Blocking poor quality content areas with robots.txt

Eric_edvisors

I found an interesting discussion on seoroundtable where Barry Schwartz and others were discussing using robots.txt to block low quality content areas affected by Panda.

http://www.seroundtable.com/google-farmer-advice-13090.html

The article is a bit dated. I was wondering what current opinions are on this.

We have some dynamically generated content pages which we tried to improve after panda. Resources have been limited and alas, they are still there. Until we can officially remove them I thought it may be a good idea to just block the entire directory. I would also remove them from my sitemaps and resubmit. There are links coming in but I could redirect the important ones (was going to do that anyway). Thoughts?

Mark_Ginsberg

When you block a page or folder in robots.txt, it doesn't remove the page from the search engine's index, it just prevents them from recrawling the page. For pages/folders/sites that were never crawled by the search engines, robots.txt can prevent them from being crawled and read. But blocking pages already crawled by robots.txt will not be enough on its own to remove them from the index.

To remove this low quality content, you can do one of two things:

Add a meta robots noindex tag to the content you want to remove - this tells the engine to remove the page from the index and that the content to them shouldn't be there - in effect, it's dead to them
After blocking the folder via robots.txt, going in to Webmaster Tools and using the URL removal tool on the folder or domain.

I usually recommend option number 1, because it works for multiple engines, doesn't require webmaster tools for each engine separately, and is easier to manage and a lot more customizable exactly which pages you want removed.

But you are on the right track with the sitemaps - don't include links to the no index pages in the sitemap.

Good luck,

Mark

Eric_edvisors

Hey Mark - Thank you, this is really helpful.

This is really great advice for deindexing the pages when they still actually do exist.

One more question though. Once we actually remove them, once the directory no longer actually exists, there's no point in using the robots.txt disallow, right? At that point if they're still in the index only the tool will be useful.

I read these: https://support.google.com/webmasters/answer/59819?hl=en

While the webmaster guidelines say you need to use robots.txt, I don't see how that's a requirement for pages which don't actually exist anymore. Google shouldn't be able to crawl the pages once they no longer exist. Also, if the directory is in robots.txt but there are a few redirects within it, they redirects would not work. I also don't think adding a line to robots.txt every time we remove something is a good practice. Thoughts?

KaneJamison

If the page no longer exists and you remove the robots command for that directory it shouldn't make much difference. Google could start reporting it as a 404 since it knows that the files used to exist and there's no longer a robots command to ignore the directory. I don't see any harm in leaving it there, but I also don't see many issues arising from removing the robots command.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Blocking poor quality content areas with robots.txt

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved