What log file analysis actually is
Log file analysis is the practice of reading your web server's raw access logs to see exactly how search engine crawlers behave on your site. Every time Googlebot, Bingbot or any other crawler requests a URL, your server records that request with the URL, timestamp, user agent, IP address, status code returned and response time. Pull a month of those records, filter to verified search bots, and you have a forensic record of how the bots really interact with the site.
Why does that matter? Because everything else is a model. A Screaming Frog crawl is what the crawler thinks Google might do. Search Console's Crawl Stats report is Google's own aggregated summary, useful but not granular. The server log is the actual ledger of every individual fetch. If you want to know whether Googlebot is wasting half its crawl budget on parameter URLs you never wanted indexed, the log is the only place that question gets answered with certainty.
Two distinctions before we get into the steps. First, log analysis answers crawl questions, not indexing or ranking questions. It tells you what got fetched, not what got indexed, and definitely not what is ranking. You pair it with Search Console for the indexing picture and with your SERP tracker for ranking. Second, it is an advanced diagnostic. For a 500-URL plumber site in Joondalup, log analysis is overkill. For a 40,000-URL Shopify store or a mining-services catalogue with thousands of equipment SKUs, log analysis is the only way to find the wasted crawl that nobody else can see.
The data itself is unglamorous. A single line in an Apache combined log format looks roughly like this: an IP address, a timestamp, the request method and URL, the status code Google got back, the response size, the referrer (usually empty for bots) and the user agent string. Multiply that by a few million rows for a month of a busy site and you can see why we run it through a tool rather than reading it line by line.
Why server logs beat the other tools
Two reasons. The honest one and the political one. Both real.
The honest reason: crawlers and Search Console both miss things. Screaming Frog can only crawl URLs it finds by following links from the seeds you give it. If a parameter URL is generated by JavaScript on click, Screaming Frog will not find it. Googlebot will. We have seen sites where the live Googlebot crawl was hitting twelve times more URLs than the Screaming Frog index suggested existed. That gap is exactly where the wasted-crawl problems live. Search Console Crawl Stats helps, but it aggregates by purpose and file type, not by URL. You can see that 60 percent of crawl went to "Discovery" rather than "Refresh", but not which 18,000 individual URLs that was.
The political reason: a log file is unarguable. When the developer says "the canonical tags are fine, the issue must be the content", a log report showing Googlebot has hit ?colour=blue&sort=price-asc 47,000 times in the last month moves the conversation along. The data does not need defending. It is what the server recorded. We have unstuck more developer-vs-SEO standoffs with a log report than with any other artefact.
There is a third reason worth mentioning. Logs are the only tool that captures Googlebot response times. Search Console shows an average. Logs show the distribution. If your top template responds in 180ms for 95 percent of Googlebot requests but in 4,000ms for the other 5 percent, that long tail is a crawl-budget tax you cannot see anywhere else. Slow responses cost crawl frequency on large sites because Google throttles its own request rate to whatever the server can serve.
One last framing. Log analysis is the diagnostic that pays for itself fastest on big sites and is unnecessary on small ones. We do not recommend it for a typical Perth small business. We do recommend it for any site above 10,000 URLs that has ever had crawl-budget conversations or indexing concerns. It is the diagnostic that catches things every other tool misses.
The seven-step process
The order matters. Each step relies on the cleaning done by the step before. Skipping the verification or the cleaning gives you a chart full of noise that looks meaningful and is not.
Step 1. Export raw server logs
You need at least 30 days. Less than that and the signal is noisy because Googlebot's crawl pattern varies day to day. Sixty to ninety days is better for a stable picture. Pull the logs from whichever component of the stack is actually receiving the public traffic. On a typical Perth WordPress site that is the web server (Apache or Nginx). On a stack behind Cloudflare or AWS CloudFront, you want the CDN logs, because the request never reaches your origin for cached pages. On a load-balanced setup, pull from the balancer, not the individual app servers, or you only see a fraction of the requests.
Three things to confirm at this step. One, the log format. Most are Apache combined or Nginx default, both of which the tools handle. IIS sites need a slightly different parse. Two, the timezone. Mixing logs in different timezones produces nonsense time-series charts. Three, that the file is complete. Some hosts truncate at midnight. Some skip non-200 responses by default. Both quietly hide the most interesting findings.
Step 2. Verify real Googlebot via reverse DNS
Anyone can send a request with the Googlebot user agent string. Plenty of competitor-research tools, scrapers and AI training crawlers do exactly that. If you analyse the log without filtering, you will conclude things about Googlebot's behaviour that are actually true of someone else's bot pretending to be Googlebot.
The official verification method is a reverse DNS lookup. Take the IP address. Look up the hostname. Confirm it ends in googlebot.com or google.com. Then do a forward lookup on that hostname and confirm it resolves back to the same IP. Only then can you trust the request as real Googlebot. Screaming Frog Log File Analyser and similar tools automate this with the right bot list loaded.
The proportion of fake traffic varies. We routinely see 15 to 35 percent of self-declared Googlebot traffic fail verification on Australian commercial sites. On one Perth e-commerce store we audited, over half the "Googlebot" hits were a competitor's bot. Skip this step and your whole analysis is wrong.
Step 3. Load the cleaned log into a log analyser
The standard tool is Screaming Frog Log File Analyser. It is a desktop app, costs a modest annual fee, and is built for this job. Import the filtered log, choose your bot list, and let it index. For a busy site, indexing a month of logs takes a few minutes.
Alternatives exist. Splunk and ELK are overkill for one-off analyses but worth setting up if you want ongoing monitoring. grep, awk and a spreadsheet work for the technically confident on smaller datasets. We use Screaming Frog for client work because the reporting is opinionated in the right direction.
Step 4. Map crawl frequency by URL and template
This is the first chart that tells you something useful. Group every Googlebot request by URL pattern. For an e-commerce site that means: homepage hits, category-page hits, product-page hits, parameter-URL hits, blog post hits, internal-search hits, paginated-archive hits. Rank each group by request volume.
The shape of that ranking tells you Google's priorities. If product pages are getting 40 percent of crawl and parameter URLs are getting 35 percent, Google is wasting more than a third of its budget on URLs you almost certainly want canonicalised away. If your category pages get 5 percent and an old blog tag archive nobody has touched since 2019 gets 15 percent, the internal-link graph is sending the wrong signals.
Compare the crawl distribution to the importance distribution. The top revenue pages should be at the top of the crawl frequency list. They are usually not.
Step 5. Find the wasted crawl
This is the stage that earns its keep. Filter the log to four bucket types:
- Non-200 status codes. Every 404 is a wasted hit. Every 301 chain wastes one hit per hop. 5xx errors mean the server failed; Googlebot retries those and counts each retry against the crawl budget.
- Parameter URLs. Anything with a query string.
?utm_,?fbclid,?sessionid,?colour=blue&size=large. Each one is a separate URL to Googlebot unless canonicalised. We routinely see thousands of distinct parameter URLs eating crawl on sites where one canonical version would be enough. - Faceted-navigation URLs. The combinatorial explosion of filter combinations on e-commerce category pages. Twenty filters, three options each, multiplied out = millions of theoretical URLs. Googlebot will eventually try a depressing number of them.
- URLs outside the sitemap. Anything Googlebot hit that is not in your XML sitemap. Sometimes those are pages you forgot to add. Often they are URLs that should not be crawled at all, generated by old plugins, legacy redirects or a CMS quirk.
Sum the requests in those buckets. Express it as a percentage of total crawl. The number is usually shocking the first time you see it.
Step 6. Cross-reference logs with the sitemap and Search Console
Two cross-references matter. First, take every URL in your sitemap. Mark which ones Googlebot has visited in the last 30 days. The unvisited list is the "Google does not care about these" pile. Common reasons: thin content, weak internal linking, the page is too many clicks from any hub. Each one needs a separate decision: expand the content, link to it harder from somewhere, or remove it from the sitemap if it does not deserve to be there.
Second, take Search Console's Pages report ("Not indexed" categories). Cross-reference with crawl frequency. Pages "Crawled but not indexed" with high hit counts in the log usually fail on quality signals. Pages "Discovered but not currently indexed" with zero hits in the log fail on crawl-budget or internal-linking signals. The category Google labels tells you the lever to pull.
Both cross-references produce specific URL-level findings that go straight on the developer fix list.
Step 7. Triage and ship the findings
Pull every finding into a spreadsheet. Four columns: estimated wasted crawl (as a percentage of total budget), suggested fix (canonical, noindex, robots.txt block, redirect, internal-link change), effort (hours), and owner (developer, SEO, content).
Sort by wasted-crawl percentage descending. The top three usually account for more than half of the entire wasted budget. Ship those three first. Re-run the analysis 30 days later to confirm the changes shipped and Googlebot's behaviour responded. Repeat quarterly.
That is the process. Seven steps, two hours of actual work once you have the logs in hand (longer if it is your first time with the tools). The deliverable is a fix list. Not a chart pack. A spreadsheet with URL patterns and named owners.
Common log analysis mistakes
- Verifies every Googlebot request via reverse DNS before counting it. Real signal only.
- Looks at 30 to 90 days of data so the day-to-day noise averages out.
- Groups requests by URL pattern and template, not just individual URL, so the picture is interpretable.
- Cross-references the log against the sitemap and Search Console Pages report to find the gaps.
- Outputs a fix list with named URL patterns and estimated wasted-crawl percentage on each row.
- Counts every Googlebot user-agent hit as real Googlebot. Inflates the numbers by 15 to 35 percent.
- Uses a single day of logs and treats the result as a trend.
- Shows a chart of total hits over time with no breakdown by URL pattern. Pretty, useless.
- Skips the sitemap and Search Console cross-references because they are fiddly.
- Reports findings as "Googlebot crawled 4.2 million URLs" without translating that into wasted budget and fixes.
- Ignores response-time outliers because the average looks fine.
If your last log analysis report was a 40-page deck full of pie charts and no fix list, you got the deck, not the analysis. Ask for the spreadsheet.
Tools and checklists
The minimum kit:
- Raw access logs from your stack. Free, if you can get them. Apache or Nginx on most WordPress hosts. CloudFront or Cloudflare logs on a CDN-fronted site. AWS Application Load Balancer logs on a custom build. Confirm the format and timezone before downloading.
- Screaming Frog Log File Analyser. Paid annual licence, modest cost. The default tool for this job. Handles reverse DNS verification, bot lists and the cross-references natively.
- A spreadsheet. Google Sheets or Excel. The output of every analysis is a sheet with URL patterns, hit counts, suggested fixes and owners.
- Search Console. Free. The Crawl Stats report sits alongside the log analysis. Together they give you Google's view of its own crawl and your server's view of the same traffic. Disagreements between the two are usually instructive.
- Our free SEO audit tool. A useful first-pass to know whether a deeper audit (including log analysis) is needed. Pulls Lighthouse data, indexability checks and basic schema validation for any URL.
The advanced kit for ongoing monitoring on enterprise sites: Splunk or the ELK stack (Elasticsearch, Logstash, Kibana) for live log streaming and alerts; a custom dashboard with Looker Studio pulling from BigQuery for monthly reporting; and a dedicated bot-management layer like Cloudflare's bot rules for filtering the imposter traffic before it reaches your origin.
If you would rather skip running the analysis yourself and have us deliver the spreadsheet, that is the website audit service for one-off engagements, or part of the SEO retainer for ongoing work. We do this analysis on every large-site retainer we run. The findings tend to compound.
Perth and WA context
Log analysis is unusual among technical SEO disciplines in that the small-business case is weak and the mid-to-large case is overwhelming. Most Perth small businesses (the trades, healthcare, professional services in our usual catchment) do not have the URL volume to make log analysis worthwhile. For them, the technical audit and Search Console cover it. Save the log work for the sites that actually need it.
The sites that do need it cluster in a few categories. WA mining-services catalogues are the classic case. Equipment-supply businesses with 8,000 to 40,000 product SKUs, faceted navigation on category pages, and a CMS that generates parameter URLs for every filter combination. The first log analysis on a site like that usually reveals Googlebot spending more than half its time on parameter URLs nobody ever wanted indexed. Two days of canonical and robots.txt work later and the budget gets redirected to the actual product pages. The mining SEO industry page covers the broader pattern, and Karratha SEO sees this most often because the equipment suppliers up there carry the deepest catalogues.
Perth metro e-commerce on Shopify and WooCommerce is the second category. The Shopify URL structure for product variants, sort orders and collection filters generates more crawl waste than most operators realise. WooCommerce on the same scale is worse, partly because the default theme pagination patterns are crawl-budget hostile. A log analysis on a 12,000-SKU Perth fashion store almost always finds 40 to 60 percent of crawl being spent on URLs the business does not benefit from indexing. The e-commerce SEO industry page goes deeper on the pattern.
Real estate platforms across Fremantle, Mandurah and the rest of the Perth metro are the third category. Property listings have a natural churn rate. Listings sold last month should not be in the sitemap. They often still are, and Googlebot still visits them, until somebody actually looks at the logs and notices. The real estate SEO industry page covers the broader playbook.
One Perth metro pattern that recurs across all three categories. The site was built years ago. The original developer is gone. Nobody at the business has ever opened a log file. The hosting account has logs sitting there waiting, the analysis takes two hours, and the findings produce a fix list that wins back a quarter to a third of the crawl budget within a month. The first analysis is almost always the highest-value one. Subsequent quarterly checks are maintenance.
Related guides
- Back to the Technical SEO pillar for the full 12-chapter index.
- Crawl budget explained. The conceptual prequel. Read this before the log analysis if you have not already.
- The technical SEO audit. The nine-stage process that log analysis sits inside on enterprise sites.
- XML sitemaps explained. The sitemap is the cross-reference for step six of the log process.
- Robots.txt and meta robots. The fix for most wasted-crawl findings is a robots.txt rule or a noindex tag.
- Canonical tags explained. The other fix for most parameter-URL waste.
- JavaScript SEO essentials. Useful pair if the log shows Googlebot churning on a JavaScript-heavy site.
- Site migration SEO checklist. Running a log analysis the day before and 30 days after every migration is on the checklist.
- Crawling, indexing and ranking. The conceptual model the log analysis is testing in practice.
- Internal linking strategy. Where most orphan-page findings from the log get fixed.