Introduction to Robots.txt
This article gives the basic understanding of :
- What is a robots.txt file ?
- What is its purpose ?
- How to create the robots.txt file in Worpdress ?
- Introduction to robots.txt instructions.
What is a robots.txt file ?
Robots.txt is a text file that tells the web robots, from search engines, which pages on your site should be crawl.
Why use a robots.txt ?
Robots.txt is used for SEO. It is one of the ways how to tell search engines what urls should be indexed or not.
Search engines uses web crawlers to scan your website.
The purpose of that file, is to minimize the crawl budget of the search engine crawler when looking at your urls.
The crawlers have only a limited number of urls, called crawl budget, they are allowed to scan every time they do a pass of your website.
There is many web crawlers that exists
How and where to create the robots.txt file in WordPress.
Before jumping into creating a new file, you might want to check if you already have it.
- WordPress will create a robots.txt file for you.
- Yoast SEO plugin will also manage the robots.txt file for you.
If you don’t have it already created :
- Create an empty file named ‘robots.txt’.
- Upload robots.txt to your server under the root directory of your website (public_html in many cases).
- Test that your new robots file is accessible by accessing in a browser the url: https://<your-domain-here>/robots.txt.
If you see a blank page with no errors, that means it works, because the content of the robots.txt file is empty.
Next step, will be to edit the file and add instructions that the crawlers can read.
Note: If the file stays empty, that means that the crawlers have no limitations and will crawl all the publicly accessible url of your site.
If you receive a 404 page, Permissions denied or anything else, there is most likely a misconfiguration on your server files and/or directories. You will want to contact your hosting provider.
Introduction to robots.txt instructions.
Note that web crawlers are not forced to obey the instructions in your robots.txt.
Fortunately, most of them are doing things properly, just like Googlebot from Google. But keep in mind that it’s up to the crawler to decide or not if the rule will be applied.
- Allowing all crawlers: User-agent: * Disallow:
- Blocking all crawlers from all files: User-agent: * Disallow: /
- Blocking a specific file from all crawlers: User-agent: * Disallow: /<path-to-file>/<file-name>
- Blocking a specific folder from all crawlers: User-agent: * Disallow: /<specific-folder-here>/
- Blocking a specific crawler: User-agent: Googlebot
It is also a good practice to tell the robot file where is your sitemap.
Here is a basic example of a robots.txt :
# Accept all Crawler, and prevent crawling the wp-admin url of WordPress. User-agent: * Disallow: /wp-admin/ Sitemap: https://<your-domain>/sitemap.xml
Have a look at Google’s own robots.txt: Google’s robots.txt