Some bots excluded from crawling client's domain
-
Hi all!
My client is in healthcare in the US and for HIPAA reasons, blocks traffic from most international sources.
a. I don't think this is good for SEO
b. The site won't allow Moz bot or Screaming Frog bot to crawl it. It's so frustrating.
We can't figure out what mechanism they are utilizing to execute this. Any help as we start down the rabbit hole to remedy is much appreciated.
thank you!
-
The main reason it's not good is that Google crawl from different data-centers around the world. So one day they may think the site is up, then the next they may think the site is gone and down
Typically you use a user-agent lance to pierce these kinds of setups. Screaming Frog for example, you can pre-select from a variety of user-agents (including 'googlebot' and Chrome) but you can also author or write your own user-agent
Write a long one that looks like an encryption key. Tell your client the user agent you have defined, let them create and exemption for it within their spam-defense system. Insert the user-agent (which no one else has or uses) into Screaming Frog, use it to allow the crawler to pierce the defense grid
Typically you would want to exempt 'Googlebot' (as a user agent) from these defense systems, but it comes with a risk. Anyone with basic scripting knowledge or who knows how to install Chrome extensions, can alter the user-agent of their script (or web browser, it's under the user's control) with ease and it is widely known that many sites make an exception for 'Googlebot' - thus it becomes a common vulnerability. For example, lots of publishers create URLs which Google can access and index, yet if you are a bog standard user they ask you to turn off ad-blockers or pay a fee
Download the Chrome User-Agent extension, set your user-agent to "googlebot" and sail right through. Not ideal from a defense perspective
For this reason I have often wished (and I am really hoping someone from Google might be reading) that in Search Console, you could tell Google a custom user-agent string and give it to them. You could then exempt that, safe in the knowledge that no one else knows it, and Google would use your own custom string to identify themselves when accessing your site and content. Then everyone could be safe, indexable and happy
We're not there yet