Crawling is technical SEO jargon for automatically “scanning” websites, and data and making attempts to gather and categorize that scraped data.
If you’re interested in the Tech SEO aspect of crawling, you might find this post of interest.
Search engine bots, web crawlers, and spiders download and index online material from the Internet – that’s basically how a search engine works.
In order to gather information and match relevant URLs in response to user queries, search engines like Google deploy bots or web crawlers. If they didn’t do that they’d have no way of being able to index content.
Both good and harmful bots exist but in most cases “bots and spiders” are not harmful to your website.
For instance, you want Google’s bot to crawl and index your website and that is a perfect example of a “bot”.
Bots and spiders, however, can occasionally cause issues and bring in unwanted traffic.
“Good bots” operate in the background and seldom harm other users or websites.
A website’s security can be breached by malicious bots, or large-scale, widespread botnets can be used to launch DDOS attacks on powerful organizations (something that a single machine cannot take down).
- 1 Should You Block Web Pages?
- 2 Why Would You Prevent Bots Crawling Site?
- 3 Restricting Bad Bot Behavior
- 4 How To Block Specific Assets?
- 5 Additional Thoughts To Blocking Bots
- 6 Expert Advice
- 7 Wrapping Up
Should You Block Web Pages?
As ever in SEO, it depends.
Crawling and indexing are frequently misunderstood in the context of SEO.
When “crawling,” web crawler bots examine a web page’s source code, blog entries, and other material.
On the other hand, “indexing” refers to determining whether a web page qualifies for display in search results.
Googlebot (Google), Bingbot (Bing), and Baidu Spider are some of the best examples of “web crawler bots “.
Think of a web crawling bot as a librarian or organizer that arranges items such as card catalogs in an unorganized library so that users may discover information quickly and conveniently.
However, for reasons that we will examine, it’s wise to restrict bots if you don’t want them to crawl and index all of your websites or pages (or directories) within your websites.
Blocking bots would therefore be the best action to take to prevent search engines from indexing automatically created pages, which may only be useful to a small group of visitors.
Why Would You Prevent Bots Crawling Site?
Bad bots can assist steal your personal information or bringing down a website that is otherwise up and running. And, of course, malicious bots that we are able to detect, should be blocked.
Although finding every bot that could crawl your website is difficult, you can uncover dangerous ones that you no longer want to visit it with a little bit of research.
You might wish to prevent bots from crawling your website for the following frequent reasons:
Protecting Your Important Data
Maybe you discovered that a plugin is drawing a lot of dangerous bots who want to steal your important customer data.
Right SEO Forecasting
Convert your ranking objectives into more organic non-brand traffic. has all the essential factors required to obtain trustworthy results.
Or perhaps you discovered that a bot added malicious links all over your website by exploiting a security flaw.
Or, someone continues attempting to use a bot to spam your contact form.
At this point, you must take specific actions to safeguard your sensitive data against bot compromise.
If you have a spike in bot traffic, it’s likely that your bandwidth will increase as well, resulting in unanticipated overages and fees you’d rather avoid.
In these situations, you must definitely prevent the offending bots from crawling your website.
You don’t want to find yourself shelling out thousands of dollars for bandwidth that is not warranted.
What Do We Mean By Bandwidth?
Data transit from your server to the client-side is known as bandwidth (web browser).
You utilize bandwidth each time data is delivered across a connection attempt.
Bots might browse your website and use up bandwidth, which could result in overage fees if you go over your monthly bandwidth allotment.
When you signed up for your hosting service, your host ought to have provided you with at least some comprehensive information.
Restricting Bad Bot Behavior
It would be acceptable to control this if a hostile bot somehow started to attack your website.
You might want to make sure that this bot cannot access your contact forms, for instance. You should ensure that the bot cannot access your website.
Act now to prevent the bot from compromising your most important data.
You can stop these bots so they don’t do too much harm by making sure your site is properly secured and locked down.
How To Block Specific Assets?
You can opt to prevent web crawlers and bots from viewing your web pages if you own a business and design specialized landing pages for your marketing campaign or internal activities.
By preventing search engines or web crawler software from seeing parts of your sites, using your information, or learning how you develop your digital marketing tactics, you will avoid being a target for other marketing efforts.
Block “Known Bad Bots” By Robots.txt
One common way to block “known bad bots” is using the robots.txt file. An example of a “known bad bot” would be, for example, the SEMRush or AHRefs Bots. We know that these bots are not out to damage our sites but we do know that they are taking our data and essentially selling it to their clients. Tools like SEMRush and AHRefs make a living scraping our backlinks, organic keywords, and a host of other variables that they sell to clients so that they can make decisions based upon metrics that they lack.
Block “Known Bad Bots” By .htaccess
The “better way” of blocking respecting bots is by using your .htaccess file. Many webmasters and SEO would always advise using your .htaccess file for bot blocking.
Robots vs htaccess For Bot Blocking?
That’s a great question!
I’d advise using the .htaccess for blocking bots for three reasons:
- You can hide your .htaccess file (i.e. it is easy to see the robots.txt file in any browser)
- It keeps the directives neat and tidy
- The .htaccess file is super-power and commands greater authority than the robots file
Additional Thoughts To Blocking Bots
For simplicity, consider just using the “no index” tag on a standalone page you don’t want to be indexed.
Malicious bots are frequently used in targeted internet assaults to get access to crucial data on your website, such as your customers’ financial information. Even if you have security policies for your web server in place, banning hostile bots can provide extra security by preventing unauthorized access.
The following advice will help you stop dangerous bots from targeting your website:
- Bad bot assaults can be deterred by installing extra program plugins, such as the Wordfence security plugin for WordPress.
- To eliminate malicious queries, it’s also a good idea to implement access rules.
- You can specify exact search engines on the prohibit user agent line in your robots.txt file if you don’t want to be indexed in a specific country or language (for example)
- Only permitting Googlebot as your user agent in your robots.txt file will prevent other search engines from indexing your page.
I asked this question about bots on Craig Campbells’s excellent SEO show, and his and Mike’s advice is interesting and useful. Bottom line: no, don’t bother because it creates a footprint that might arouse more suspicion over the benefit that it brings.
[At the 35-minute mark]
Henry’s asking, would you ever block AHRefs or SEMRush bots at the robots.txt or .htaccess level? I mean, why should your backlink profiles in data or not bother who cares?
Mike answered, […this is often asked with regards to PBNs…] and I used to talk a lot about blocking this and blocking that just so that it skewed the date and people couldn’t follow everything that I was doing but the reality is no one cares. Also, it gives a footprint in itself, and you’re an outlier if you’re blocking this data. I wouldn’t worry about it unless it makes sense for the specifics like if you’re doing things through a redirect that you really don’t want people to see you might want to block that there or something like that.
Craig added, “I get where you’re coming from though why share your backlink profiles and data just all the other douche bags out there, you’re just putting a big target on your back.
Solid advice from these folks. If you haven’t already subscribed, here’s the show.
You probably don’t want all of your web pages to be crawled and indexed, even while you want Google and other search engines to take note of your finest websites to increase traffic, quality leads, and sales.
Important corporate websites, web pages of poor quality, and web pages restricted to authorized users shouldn’t be crawled and indexed by search engines. By preventing bots, you may achieve these objectives.
Is It Important to do?
Sure – but don’t worry about it.
Despite everything I have said above if you host with a decent host or you are using Cloudflare then you’re basically good to go.
Also, I should mention that “truly” nefarious bots will never respect the robots directives.