Robot.txt pattern matching
-
Hola fellow SEO peoples!
Site: http://www.sierratradingpost.com
robot: http://www.sierratradingpost.com/robots.txt
Please see the following line: Disallow: /keycodebypid~*
We are trying to block URLs like this:
http://www.sierratradingpost.com/keycodebypid~8855/for-the-home~d~3/kitchen~d~24/
but we still find them in the Google index.
1. we are not sure if we need to specify the robot to use pattern matching.
2. we are not sure if the format is correct. Should we use Disallow: /keycodebypid*/ or /*keycodebypid/ or even /*keycodebypid~/?
What is even more confusing is that the meta robot command line says "noindex" - yet they still show up. <meta name="robots" content="noindex, follow, noarchive" />
Thank you!
-
Here's a good SEOMoz post about this: http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts. What's most likely happening is that the disallow in robots.txt is preventing the bots from indexing the page, so they're not going to find the meta noindex tag. If people link to one of these pages externally, the disallow in robots.txt does not prevent the page from appearing in search results.
The robots.txt syntax you're using now looks correct to me for what you're trying to do.
-
Hi,
then you have the robots.txt and the meta tag. I think its better the metatag (http://www.seomoz.org/learn-seo/robotstxt)
Have you WebMaster Tools in your web? you can test your robots.txt file (http://www.google.com/support/webmasters/bin/answer.py?answer=156449)
-
Well done John!!!

-
Great point! I will remember that. However I have both the disallow line in the robots.txt file and I also have the noindex meta command. Yet Google shows 3000 of them!?!?!?!
http://www.google.com/search?q=site%3Awww.sierratradingpost.com+keycodebypid
-
Somehow Google is finding these pages, but you're disallowing the Googlebot from reading the page, so it doesn't know anything about the meta noindex tag on the page. If you have meta noindex tags on all of these pages, you can remove that line in your robots.txt preventing bots from reading these pages, and as Google crawls these pages, they should remove them from their SERPs.
-
John, The article was a real eye-opener!Thanks again!
-
ok, so not sure sure this was shared. Matt Cutts talking on this same subject.
|
| <cite class="kvm">www.youtube.com/watch?v=I2giR-WKUfY</cite> |