With trillions of webpages in existence, the internet is a crowded place. And Google is continuously trying to surface every piece of useful information. With that, Google must also manage its resources in order to efficiently crawl as much of the web as possible. It’s through this balancing act that we get a crawl budget.
What is crawl rate?
A crawl rate is the number of URLs Googlebot will crawl on a website within a visit. Crawl rate optimization is centered around guiding the search engine bots to crawl your website’s most important pages.
Crawl rate is determined by two factors:
- Crawl Capacity Limit: How much time a bot will spend crawling your site without overwhelming your server. The capacity can be affected by a few things: The health of your site, your personal settings, and the hosts’ crawling limits. Because Googlebot is designed not to overload a website, if it detects that it is impacting the site’s performance it will stop its work. Website owners also have the ability to control how much they would like to decrease crawling on their site (increasing is an option, however not guaranteed).
- Crawl Demand/Crawl Scheduling: Which URLs are worth recrawling the most, based on popularity and how often that page is being updated.
The more efficient you can make it to crawl your website, the less time it will take for future crawl rate optimization efforts.
The best way to optimize your crawl rate
Thinking about the technical implications of search, there are several ways in which you can communicate to the search engines which pages these bots should crawl and index.
Use robots.txt to block URLs you don’t want crawled
The use of robots.txt has evolved in recent years, but it is still a useful tool in managing your crawl budget. The disallow directive communicated to the different crawlers not to access certain sections or URLs on your site. Here you can exclude the pages that don’t serve any value in being crawled. This looks like admin pages, members areas, and even the shopping cart. It is important to note that the robots.txt rules do not mean these pages won’t be indexed, they just won’t be crawled.
Check your Canonicals
Page bloat is a common issue among eCommerce sites with lots of filtering options. Faceted navigation can cause an almost infinite combination of URLs, leaving Googlebot being pulled in too many directions. So make sure that you are effectively using the rel=canonical tag to declare the original source of the content.
Eliminate duplicate content
This is often a byproduct of thin content, product variations, and too many category pages. Again, canonicals are your friend — identify that single source of information. Consolidate duplicate content, so that crawlers are focusing on crawling unique content rather than a bunch of copies.
Keep your sitemap up to date
A sitemap is a vital tool in helping search engines crawl and index your website, while also communicating the relationship between your pages. Google reads both your XML and HTML sitemaps frequently, so the more up-to-date it is, the better. This allows Google to better understand your site more effectively.
Pro Tip: make sure you submit your XML sitemap through Google Search Console
Use a 404 or 401 for dead pages
The notion that a 4XX error is a bad thing isn’t true. They serve to communicate information with the search engines. For the eCommerce folks in the room, when a product is sold out, use a 404 response code to tell Google to not crawl that URL again, even if that product will be returning in stock.
Avoid long redirect chains and loops
Redirects definitely serve a real-world purpose, but too many redirects can feel like a wild goose chase for Googlebot. Google will give up if it encounters too many redirects to reach the endpoint. For pages that have permanently moved, use a 301 redirect so that Google knows to focus its attention on the new page.
Where to find your crawl history
So where exactly should you start? Google Search Console offers several reporting options for you to analyze its crawl history on your site. The Index Coverage reports in GSC will give you a good overview of what Googlebot has crawled on your website. The Crawl Stats Report will show where Google has had issues when crawling your site, where you can then address those issues.
If you want to get extra nerdy, get access to your server log files. Pop those Screaming Frog’s log file analyzer and you will have hours of crawl rate fun.
Although Google has made it clear that obsessing about your crawl budget is really only reserved for websites with hundreds of thousands of URLs, you should still pay attention to which pages Googlebot is crawling on your website. Prioritizing your site’s information is becoming increasingly important and maximizing the time search engine crawlers spend on the important stuff will lead to more SEO success.