How to Do an SEO Log File Analysis [Template Included]

11 Apr

Log files have been receiving increasing recognition from technical SEOs over the past five years, and for a good reason.

They’re the most trustworthy source of information to understand the URLs that search engines have crawled, which can be critical information to help diagnose problems with technical SEO.

Google itself recognizes their importance, releasing new features in Google Search Console and making it easy to see samples of data that would previously only be available by analyzing logs.

Crawl stats report; key data above and line graph showing trend of crawl requests below

In addition, Google Search Advocate John Mueller has publicly stated how much good information log files hold.

@glenngabe Log files are so underrated, so much good information in them.

— 🦝 John (personal) 🦝 (@JohnMu) April 5, 2016

With all this hype around the data in log files, you may want to understand logs better, how to analyze them, and whether the sites you’re working on will benefit from them.

This article will answer all of that and more. Here’s what we’ll be discussing:

A server log file is a file created and updated by a server that records the activities it has performed. A popular server log file is an access log file, which holds a history of HTTP requests to the server (by both users and bots).

When a non-developer mentions a log file, access logs are the ones they’ll usually be referring to.

Developers, however, find themselves spending more time looking at error logs, which report issues encountered by the server.

The above is important: If you request logs from a developer, the first thing they’ll ask is, “Which ones?”

Therefore, always be specific with log file requests. If you want logs to analyze crawling, ask for access logs.

Access log files contain lots of information about each request made to the server, such as the following:

IP addresses
User agents
URL path
Timestamps (when the bot/browser made the request)
Request type (GET or POST)
HTTP status codes

What servers include in access logs varies by the server type and sometimes what developers have configured the server to store in log files. Common formats for log files include the following:

Apache format – This is used by Nginx and Apache servers.
W3C format – This is used by Microsoft IIS servers.
ELB format – This is used by Amazon Elastic Load Balancing.
Custom formats – Many servers support outputting a custom log format.

Other forms exist, but these are the main ones you’ll encounter.

Now that we’ve got a basic understanding of log files, let’s see how they benefit SEO.

Here are some key ways:

Crawl monitoring – You can see the URLs search engines crawl and use this to spot crawler traps, look out for crawl budget wastage, or better understand how quickly content changes are picked up.
Status code reporting – This is particularly useful for prioritizing fixing errors. Rather than knowing you’ve got a 404, you can see precisely how many times a user/search engine is visiting the 404 URL.
Trends analysis – By monitoring crawling over time to a URL, page type/site section, or your entire site, you can spot changes and investigate potential causes.
Orphan page discovery – You can cross-analyze data from log files and a site crawl you run yourself to discover orphan pages.

All sites will benefit from log file analysis to some degree, but the amount of benefit varies massively depending on site size.

This is as log files primarily benefit sites by helping you better manage crawling. Google itself states managing the crawl budget is something larger-scale or frequently changing sites will benefit from.

The same is true for log file analysis.

For example, smaller sites can likely use the “Crawl stats” data provided in Google Search Console and receive all of the benefits mentioned above—without ever needing to touch a log file.

Gif of Crawl stats report being scrolled down gradually

Yes, Google won’t provide you with all URLs crawled (like with log files), and the trends analysis is limited to three months of data.

However, smaller sites that change infrequently also need less ongoing technical SEO. It’ll likely suffice to have a site auditor discover and diagnose issues.

For example, a cross-analysis from a site crawler, XML sitemaps, Google Analytics, and Google Search Console will likely discover all orphan pages.

You can also use a site auditor to discover error status codes from internal links.

There are a few key reasons I’m pointing this out:

Access log files aren’t easy to get a hold of (more on this next).
For small sites that change infrequently, the benefit of log files isn’t as much, meaning SEO focuses will likely go elsewhere.

In most cases, to analyze log files, you’ll first have to request access to log files from a developer.

The developer is then likely going to have a few issues, which they’ll bring to your attention. These include:

Partial data – Log files can include partial data scattered across multiple servers. This usually happens when developers use various servers, such as an origin server, load balancers, and a CDN. Getting an accurate picture of all logs will likely mean compiling the access logs from all servers.
File size – Access log files for high-traffic sites can end up in terabytes, if not petabytes, making them hard to transfer.
Privacy/compliance – Log files include user IP addresses that are personally identifiable information (PII). User information may need removing before it can be shared with you.
Storage history – Due to file size, developers may have configured access logs to be stored for a few days only, making them not useful for spotting trends and issues.

These issues will bring to question whether storing, merging, filtering, and transferring log files are worth the dev effort, especially if developers already have a long list of priorities (which is often the case).

Developers will likely put the onus on the SEO to explain/build a case for why developers should invest time in this, which you will need to prioritize among other SEO focuses.

These issues are precisely why log file analysis doesn’t happen frequently.

Log files you receive from developers are also often formatted in unsupported ways by popular log file analysis tools, making analysis more difficult.

Thankfully, there are software solutions that simplify this process. My favorite is Logflare, a Cloudflare app that can store log files in a BigQuery database that you own.

Now it’s time to start analyzing your logs.

I’m going to show you how to do this in the context of Logflare specifically; however, the tips on how to use log data will work with any logs.

The template I’ll share shortly also works with any logs. You’ll just need to make sure the columns in the data sheets match up.

1. Start by setting up Logflare (optional)

Logflare is simple to set up. And with the BigQuery integration, it stores data long term. You’ll own the data, making it easily accessible for everyone.

There’s one difficulty. You need to swap out your domain name servers to use Cloudflare ones and manage your DNS there.

For most, this is fine. However, if you’re working with a more enterprise-level site, it’s unlikely you can convince the server infrastructure team to change the name servers to simplify log analysis.

I won’t go through every step on how to get Logflare working. But to get started, all you need to do is head to the Cloudflare Apps part of your dashboard.