Crawl budget is how fast and how many pages a search engine wants to crawl on your site. It’s affected by the amount of resources a crawler wants to use on your site and the amount of crawling your server supports.
More crawling doesn’t mean you’ll rank better, but if your pages aren’t crawled and indexed they aren’t going to rank at all.
Most sites don’t need to worry about crawl budget, but there are few cases where you may want to take a look. Let’s look at some of those cases.
You usually don’t have to worry about crawl budget on popular pages. It’s usually pages that are newer, that aren’t well linked, or don’t change much that are not crawled often.
Crawl budget can be a concern for newer sites, especially those with a lot of pages. Your server may be able to support more crawling, but because your site is new and likely not very popular yet, a search engine may not want to crawl your site very much. This is mostly a disconnect in expectations. You want your pages crawled and indexed but Google doesn’t know if it’s worth indexing your pages and may not want to crawl as many pages as you want them to.
Crawl budget can also be a concern for larger sites with millions of pages or sites that are frequently updated. In general, if you have lots of pages not being crawled or updated as often as you’d like, then you may want to look into speeding up crawling. We’ll talk about how to do that later in the article.
If you want to see an overview of Google crawl activity and any issues they identified, the best place to look is the Crawl Stats report in Google Search Console.
There are various reports here to help you identify changes in crawling behavior, issues with crawling, and give you more information about how Google is crawling your site.
You definitely want to look into any flagged crawl statuses like the ones shown here:
There are also timestamps of when pages were last crawled.
If you want to see hits from all bots and users, you’ll need access to your log files. Depending on hosting and setup, you may have access to tools like Awstats and Webalizer as is seen here on a shared host with cPanel. These tools show some aggregated data from your log files.
For more complex setups you’ll have to get access to and store data from the raw log files, possibly from multiple sources. You may also need specialized tools for larger projects such as an ELK (elasticsearch, logstash, kibana) stack which allows for storage, processing, and visualization of log files. There are also log analysis tools such as Splunk.
These URLs may be found by crawling and parsing pages, or from a variety of other sources including sitemaps, RSS feeds, submitting URLs for indexing in Google Search Console, or using the indexing API.
There are also multiple Googlebots that share the crawl budget. You can find a list of the various Googlebots crawling your website in the Crawl Stats report in GSC.
Each website will have a different crawl budget that’s made up of a few different inputs.
Crawl demand is simply how much Google wants to crawl on your website. More popular pages and pages that experience significant changes will be crawled more.
Popular pages, or those with more links to them, will generally receive priority over other pages. Remember that Google has to prioritize your pages for crawling in some way, and links are an easy way to determine which pages on your site are more popular. It’s not just your site though, it’s all pages on all sites on the internet that Google has to figure out how to prioritize.
You can use the Best by links report in Site Explorer as an indication of which pages are likely to be crawled more often. It also shows you when Ahrefs last crawled your pages.
There’s also a concept of staleness. If Google sees that a page isn’t changing, they will crawl the page less frequentlly. For instance, if they crawl a page and see no changes after a day, they may wait three days before crawling again, ten days the next time, 30 days, 100 days, etc. There’s no actual set period they will wait between crawls, but it will become more infrequent over time. However, if Google sees large changes on the site as a whole or a site move, they will typically increase the crawl rate, at least temporarily.
Crawl rate limit is how much crawling your website can support. Websites have a certain amount of crawling they can take before having issues with the stability of the server like slowdowns or errors. Most crawlers will back off crawling if they start to see these issues so they do not harm the site.
Google will adjust based on the crawl health of the site. If the site is fine with more crawling, then the limit will increase. If the site is having issues, then Google will slow down the rate at which they crawl.
There are a few things you can do to make sure your site can support additional crawling and increase your site’s crawl demand. Let’s look at some of those options.
The way Google crawls pages is basically to download resources and then process them on their end. Your page speed as a user perceives it isn’t quite the same. What will impact crawl budget is how fast Google can connect and download resources which has more to do with the server and resources.
Remember that crawl demand is generally based on popularity or links. You can increase your budget by increasing the amount of external links and/or internal links. Internal links are easier since you control the site. You can find suggested internal links in the Link Opportunities report in Site Audit, which also includes a tutorial explaining how it works.
Keeping links to broken or redirected pages on your site active will have a small impact on crawl budget. Typically, the pages linked here will have a fairly low priority because they probably haven’t changed in a while, but cleaning up any issues is good for website maintenance in general and will help your crawl budget a bit.
You can find broken (4xx) and redirected (3xx) links on your site easily in the Internal pages report in Site Audit.
For broken or redirected links in the sitemap, check the All issues report for “3XX redirect in sitemap” and “4XX page in sitemap” issues.
This one is a little more technical in that it involves HTTP Request methods. Don’t use POST requests where GET requests work. It’s basically GET (pull) vs POST (push). POST requests aren’t cached so they do impact crawl budget, but GET requests can be cached.
If you need pages crawled faster, check if you’re eligible for Google’s Indexing API. Currently this is only available for a few use cases like job postings or live videos.
Bing also has an Indexing API that’s available to everyone.
There are a few things people sometimes try that won’t actually help with your crawl budget.
There are just a couple good ways to make Google crawl slower. There are a few other adjustments you could technically make like slowing down your website, but they’re not methods I would recommend.
The main control Google gives us to crawl slower is a rate limiter within Google Search Console. You can slow down the crawl rate with the tool, but it can take up to two days to take effect.
If you need a more immediate solution, you can take advantage of Google’s crawl rate adjustments related to your site health. If you serve Googlebot a ‘503 Service Unavailable’ or ‘429 Too Many Requests’ status codes on pages, they will start to crawl slower or may stop crawling temporarily. You don’t want to do this longer than a few days though or they may start to drop pages from the index.
Again, I want to reiterate that crawl budget isn’t something for most people to worry about. If you do have concerns, I hope this guide was useful.
I typically only look into it when there are issues with pages not getting crawled and indexed, I need to explain why someone shouldn’t be worried about it, or I happen to see something that concerns me in the crawl stats report in Google Search Console.
Have questions? Let me know on Twitter.
Source: ahrefs.com, originally published on 2021-05-28 09:15:20