This guide defines and provides examples for the following:
The robots.txt file of a site gives site owners control over how search engines access their site. The file gives guidelines to crawlers for how site content can be visited and can provide additional information about the site.
If the robots.txt file is utilized correctly, it can have a positive impact on a site’s organic search performance by guiding crawlers to important areas of the site while restricting access to content with no SEO value.
How do you send these signals to crawlers? Using the main fields which include user-agent, allow, disallow, and sitemap. We’ll also review Crawl-delay and wildcards, which can provide additional control over how your site is crawled.
Before we dive in, we’ll quickly describe the 4 main aspects of the file described in Google’s documentation. We’ll review in more detail with examples further down in the post.
The ‘user-agent’ is the name that identifies crawlers with specific purposes and/origins. User-agents should be defined when granting specific crawlers different access across your site.
This is a user-agent from Google for their image search engine.
The directives following this will only apply to the ‘Googlebot-Image’ user agent.
There are two wildcard characters that are used in the robots.txt file. They are * and $.
The * wildcard character will match any sequence of the same characters.
This addresses all user-agents for the directives following this line of instruction.
The $ wildcard matches any URL path that ends with what’s designated.
The crawler would not access /no-crawl.php but could access /no-crawl.php?crawl
‘Allow:’ directs the crawlers to crawl the site, section, or page. If there’s no path specified then the ‘Allow’ gets ignored.
URLs with the path example.com/crawl-this/ can be accessed unless further specifications are provided.
‘Disallow:’ directs the crawlers to not crawl the specified site, section(s), or page(s).
URLs containing the path example.com/?s= should not be accessed unless further specifications are added.
💡 Note: if there are contradicting directives, the crawler will follow the more specific request.
The crawl delay directive specifies the number of seconds the search engines should delay before crawling or re-crawling the site. Google does not respond to crawl delay requests but other search engines do.
The crawler should wait 10 seconds before re-accessing the site.
The sitemap field provides crawlers with the location of a website’s sitemap. The address is provided as the absolute URL. If more than one sitemap exists then multiple Sitemap: fields can be used.
The sitemap for https://www.example.com is available at the path /sitemap.xml
Leave comments, or annotations, in your robot.txt file using the pound sign to communicate the intention behind specific requests. This will make your file easier for you and your coworkers to read, understand, and update.
A simple robots.txt file that allows all user agents full access includes
💡 Note: adding the sitemap to the robots file is recommended but not mandatory.
The robots.txt file, which lives at the root of a domain, provides site owners with the ability to give directions to crawlers on how their site should be crawled.
Always test your robots file before and after implementing! You can validate your robots.txt file in Google Search Console.
If you think you need help creating or configuring your robots.txt file to get your website crawled more effectively, Seer is happy to help.
Can you write a robots file that includes the following?
a) Links to the sitemap
b) Does not allow website.com/no-crawl to be crawled
c) Does allow website.com/no-crawl-robots-guide to be crawled
d) A time delay
e) Comments which explain what each line does
💡 Share your answers with us on Twitter (@seerinteractive)!
Source: www.seerinteractive.com, originally published on 2022-02-03 11:49:49