How to Read Robots.txt | Seer Interactive

How to Read Robots.txt: Syntax & Examples

This guide defines and provides examples for the following:

The robots.txt file of a site gives site owners control over how search engines access their site. The file gives guidelines to crawlers for how site content can be visited and can provide additional information about the site. 

If the robots.txt file is utilized correctly, it can have a positive impact on a site’s organic search performance by guiding crawlers to important areas of the site while restricting access to content with no SEO value. 

How do you send these signals to crawlers? Using the main fields which include user-agent, allow, disallow, and sitemap. We’ll also review Crawl-delay and wildcards, which can provide additional control over how your site is crawled. 

Before we dive in, we’ll quickly describe the 4 main aspects of the file described in Google’s documentation. We’ll review in more detail with examples further down in the post.

  1. Disallow: URL path that cannot be crawled
  2. Allow: URL path that can be crawled
  3. User-agent: specifies the crawler that the rule applies to
  4. Sitemap: Provides the full location of the sitemap

User-agent Examples Include Googlebot Bingbot and DuckDuckbot

User-Agent

What is it:

The ‘user-agent’ is the name that identifies crawlers with specific purposes and/origins. User-agents should be defined when granting specific crawlers different access across your site.

Example: 

  • User-agent: Googlebot-Image

What it means:

This is a user-agent from Google for their image search engine. 

The directives following this will only apply to the ‘Googlebot-Image’ user agent.

Wildcards

There are two wildcard characters that are used in the robots.txt file. They are * and $.

* (Match Sequence)

What is it: 

The * wildcard character will match any sequence of the same characters.

Example: 

What it means:

This addresses all user-agents for the directives following this line of instruction. 

$ (Match URL End)

What is it: 

The $ wildcard matches any URL path that ends with what’s designated.

Example: 

What it means:

The crawler would not access /no-crawl.php but could access /no-crawl.php?crawl

Allow and Disallow

Allow

What is it: 

‘Allow:’ directs the crawlers to crawl the site, section, or page. If there’s no path specified then the ‘Allow’ gets ignored. 

Example: 

What it means:

URLs with the path example.com/crawl-this/ can be accessed unless further specifications are provided. 

Disallow

What is it:

‘Disallow:’ directs the crawlers to not crawl the specified site, section(s), or page(s). 

Example: 

What it means:

URLs containing the path example.com/?s= should not be accessed unless further specifications are added.

💡 Note: if there are contradicting directives, the crawler will follow the more specific request.

Crawl Delay

What is it: 

The crawl delay directive specifies the number of seconds the search engines should delay before crawling or re-crawling the site. Google does not respond to crawl delay requests but other search engines do. 

Example: 

What it means:

The crawler should wait 10 seconds before re-accessing the site.

Sitemap

What is it: 

The sitemap field provides crawlers with the location of a website’s sitemap. The address is provided as the absolute URL. If more than one sitemap exists then multiple Sitemap: fields can be used. 

Example: 

  • Sitemap: https://www.example.com/sitemap.xml

What it means:

The sitemap for https://www.example.com is available at the path /sitemap.xml

Leave comments, or annotations, in your robot.txt file using the pound sign to communicate the intention behind specific requests. This will make your file easier for you and your coworkers to read, understand, and update.

Example:

  • # This is a comment explaining that the file allows access to all user agents
  • User-agent: *
  • Allow: /

Robots.txt Allow All Example

A simple robots.txt file that allows all user agents full access includes

  1. The user-agents directive with the ‘match any’ wildcard character
  2. Either an empty Disallow or an Allow with the forward slash.

💡 Note: adding the sitemap to the robots file is recommended but not mandatory.

Final Thoughts On Reading Robots Files

The robots.txt file, which lives at the root of a domain, provides site owners with the ability to give directions to crawlers on how their site should be crawled. 

  • When used correctly, the file can help your site be crawled more effectively and provide additional information about your site to search engines.
  • When used incorrectly, the robots.txt file can be the reason your content isn’t able to be displayed within search results.

Testing Robots.txt

Always test your robots file before and after implementing! You can validate your robots.txt file in Google Search Console.

If you think you need help creating or configuring your robots.txt file to get your website crawled more effectively, Seer is happy to help. 

Pop Quiz!

Can you write a robots file that includes the following?

a) Links to the sitemap

b) Does not allow website.com/no-crawl to be crawled

c) Does allow website.com/no-crawl-robots-guide to be crawled

d) A time delay

e) Comments which explain what each line does

💡 Share your answers with us on Twitter (@seerinteractive)!

Additional Resources:


Source: www.seerinteractive.com, originally published on 2022-02-03 11:49:49