Last Updated on September 20, 2022
Robots.txt
Robots.txt is something we check while filling out SEO audits. Included below is some information about what it is and why it is helpful for sites to have.
What is robots.txt?
It is a piece of code that webmasters use to instruct search engines how to crawl their site. You can “allow” or “disallow” the crawling of specific pages.
What kind of commands can you use?
- User-agent: The specific web crawler to which you’re giving crawl instructions (usually a search engine). A list of most user agents can be found here.
- Disallow: The command used to tell a user-agent not to crawl particular URL. Only one “Disallow:” line is allowed for each URL.
- Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
- Crawl-delay: How many seconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.
- Sitemap: Used to call out the location of any XML sitemap(s) associated with this URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.
Where does robots.txt go on a site?
Robots and other search engine crawlers know to look for robots.txt when they come to a site, but they only look for it in one place: the main directory (typically your root domain or homepage).
Why is a robots.txt file helpful?
Robots.txt files control crawler access to certain areas of your site. While this can be very dangerous if you accidentally disallow Googlebot from crawling your entire site (!!), there are some situations in which a robots.txt file can be very handy.
Some common use cases include:
- Preventing duplicate content from appearing in SERPs (note that meta robots is often a better choice for this)
- Keeping entire sections of a website private (for instance, your engineering team’s staging site)
- Keeping internal search results pages from showing up on a public SERP
- Specifying the location of sitemap(s)
- Preventing search engines from indexing certain files on your website (images, PDFs, etc.)
- Specifying a crawl delay in order to prevent your servers from being overloaded when crawlers load multiple pieces of content at once
Check if you have a robots.txt file
Simply type in your root domain with /robots.txt at the end to check if the client you are working on has a robots.txt file. For example, search: sebomarketing.com/robots.txt
Check out https://moz.com/learn/seo/robotstxt for additional information and SEO best practices.