Crawling A 20 Million Page Website

SEO Projects
henry dalziel | SEO Hong kong pro
Tech SEO
Tech SEO is a MASSIVE part of Search Engine Optimization

Needless to say but Tech SEO is a vital component of the overall optimization process.

This project, which I’ve been involved with for the last year, has yielded some challenges and beneficial discoveries.

Defining Tech SEO

Crawlers need to be able to understand what your site is all about, and that’s where “Tech SEO” comes into play.

You can have the best content on Planet Earth but if your site is built badly and it confuses Google or any other search engine for that matter then your efforts will go wasted.

turns out that the most important thing to understand with tech SEO is…crawl budget…

Before anyone says anything – yes, there is such a thing as “Crawl Budget”.

Google does NOT have infinite resources and nor does it want to crawl absolutely every single page on your site so GoogleBot (in all its’ different flavors) will determine where it needs to go based upon the roadmaps that you’ve given.

Efficient Web Crawling Is Critical

Crawl Objectives

More than anything the purpose of the crawl was to establish the health of the site.

I’d say that these are the reasons why you need to crawl your site:

  • Give yourself an overall “Health Check”
  • Find opportunities for quick wins
  • Understand your site structure including:
    • Orphan Pages
    • Pages that get nominal traffic
    • Pages that are too many clicks away from the homepage
  • Your core web vitals speed

I outline the objectives more clearly below.

Crawl Efficiency

  • Finding errors as the bot traverses the site
  • discover disallow pages that are truly non-indexable (cross-reference these with GSC)
  • discover any pagination issues?
  • make sure canonicals are working as expected
  • if product pages receive v low traffic, consider deindexing and focusing on more important category pages
  • discover categories that receive very low traffic (and cross-check with revenue) and “block”

Internal Linking

  • Visualize tool to see if we can redirect equity to important pages
  • Re-align internal linkings
  • (Cross compare with Screaming Frog)

Structure Data

  • Ensure that our product schema is enhanced to improve e-commerce results

Log Data

  • Cross-reference low GoogleBot hits with GSC for low traffic (and exclude those pages)

Content Densities

  • Discovering if there are more hits on pages with more content (categories, product pages)


  • if traffic is less than 5% of all traffic from a particular EU country then consider blocking the bot from wasting resources in even lower-level categories
“Logical” Site Structure Is Vital

Google Search Console

Google Search Console (GSC) is amazing and often underrated.

Literally, GSC gives you most of the answers to the quiz, and one of the best things about GSC is that you can connect it to Google Studio for even better research.

GSC is essentially a dashboard that shows you how GoogleBot has behaved when it hit your site. I use GSC to investigate errors and it also has a very easy way to filter-out URLs for CTR and other important data relating to everything to do with how Google is crawling your site.

Paired with Google Analytics, using Search Console (both of which are free) is a must for all Tech SEO projects.

What’s The Difference Between Google Analytics (GA) & Google Seach Console (GSC)?

  • GA is for traffic analysis
  • GSC is for crawling analysis (and reporting errors)

Tech SEO Tools That I Researched

There are the tools I researched for this project:

  • Deep Crawl
  • OnCrawl
  • Screaming Frog

Selected Tools: OnCrawl & Screaming Frog


I decide to use OnCrawl because of their support and the fact that the tool is highly recommended by other SEOs that I greatly respect. Plus, I always see the company speaking at events like SEO Brighton and other conferences like that.

The other reason I decided OnCrawl over Deep Crawl was that they offer a feature called “Crawl-on-Crawl” which is a fantastic feature.

Screaming Frog

Screaming Frog doesn’t need much in the way of an introduction.

This Tech SEO tool has been around for a very long time – and it has a very loyal following, mostly I’m sure because there is a pretty generous free version that whilst it has a limited crawl quota that should be enough for lean websites.

Unlike OnCrawl, Screaming Frog is an app, so it can chew up quite a bit of RAM.

So – anyway – those were the tools that I settled with.

My Goals

I’ve already touched on why we needed the crawl, however, to justify the investment I had to have a clear set of goals of what we were going to achieve with this entire Tech SEO Project.

The purpose of the project was of course to get meaningful and actionable data, in summary, being able to demonstrate that there is clear ROI from post-crawl actions taken.

Setting Up The Crawl

Robots.txt File

OnCrawl crawls your site using a variety of different Bot User Agents.

Your robots.txt file likely allows all GoogleBot crawlers (for example) and as long as your robot file allows the free passage of crawlers then there’s no need to tweak anything on that file.


However, be careful if you’re using Cloudflare.

If you are using Cloudflare then be sure to do one, or both of the following:

  1. Whitelist either the IP range that OnCrawl uses;
  2. Ensure that your crawl-blocking (“bot blocking”) rules are either relaxed or customized for the OnCrawl scan.
    • For #2 above, contact customer support and they’ll take care of it for you

The Results

The results from the crawl showed us that there were indeed issues with the crawl budget.

Without getting too technical and sharing too much sensitive information I can share that we discovered the following:

  • That the were rouge JSON files generating over 200K 404s per day!
  • That unnecessary taxonomy URLs were being crawled
  • That we had thousands of orphan pages
  • That we have important pages buried deep into the site architecture

How To Improve CTR Rate?

Blocking “Bad” Bots

Leave a comment

en_GBEnglish (UK)