Noindexing Duplicate (non-unique) Content
-
Hi khi5,
I think excluding those MLS listings from your site using the robots.txt file would be over kill.
As I'm sure you well know, Google does what it wants. I think tagging the pages you don't want indexed with "noindex follow" AND adding them to the robots.txt file doesn't make the likelihood that Google will respect your wishes any higher. You might want to consider canonicalizing them though, so links to and bookmarks and shares of said pages get credited to your site.
As to how long it takes for Google to deindex said pages, it can take a very long time. In my experience, "a very long time" can run 6-8 months. You do have the option however, of using Google Webmaster Tools > Google Index > Remove URLs to ask to have them deindexed faster. Again, no guarantees that Google will do as you ask, but I've found them to be pretty responsive when I use the tool.
I'd love to hear if anyone else feels differently.
-
thx, Donna. My question was mainly around whether Google will NOT consider MLS pages as duplicate content when I place the "noindex" on. We can all guess, but does anyone have anything concrete on this, to make me understand reality of this. Can we with 90% certainty say "yes, if you place noindex on a duplicate content page, then Google will not consider that duplicate content, hence it will not count towards how Google views duplicate vs unique site content". This is the big question: If we are left in uncertainty, then only way forward may be to password protect such pages and not offer users without creating an account.....
Removal on GWT: I plan to index some of these MLS pages in the future (when I get more unique content on them) and I am concerned if once submitted to GWT for removal, then it is tough to get such pages indexed again.
-
When "noindex" is added to a page, does this ensure Google does not count page as part of their analysis of unique vs duplicate content ratio on a website? Yes, that will tell Google that you understand the pages don't belong in the index. They will not penalize your site for duplicate content if you're explicitly telling Google to noindex them.
Is there a chance that even though Google does not index these pages, Google will still see those pages and think "ah, these are duplicate MLS pages, we are going to let those pages drag down value of entire site and lower ranking of even the unique pages". No, there's no chance these will hurt you if they're set to noindex. That is exactly what the noindex tag is for. You're doing what Google wants you to do.
I like to just use "noindex, follow" on those MLS pages, but would it be safer to add pages to robots.txt as well and that should - in theory - increase likelihood Google will not see such MLS pages as duplicate content on my website? You could add them to your robots.txt but that won't increase your likelihood of Google not penalizing you because there is already no worry about being penalized for pages not being indexed.
On another note: I had these MLS pages indexed and 3-4 weeks ago added "noindex, follow". However, still all indexed and no signs Google is noindexing yet.....
Donna's advice is perfect here. Use the Remove URLs tool. Every time I've used the tool, Google has removed the URLs from the index in less than 12-24 hours. I of course made sure to have a noindex tag in place first. Just make sure you enter everything AFTER the TLD (.com, .net, etc) and nothing before it. Example: You'd want to ask Google to remove /mls/listing122 but not example.com/mls/listing122. The ladder will not work properly because Google automatically adds "example.com" to it (they just don't make this very clear). -
good one, Philip. Last BIG question: if I remove URL's from GWT, is it possible to "unremove" without issue? I am planning to index some of these MLS pages in the future when I have more unique content on.
-
Yep! After you remove the URL or directory of URLs, there is a "Reinclude" button you can get to. You just need to switch your "Show:" view so it shows URLs removed. The default is to show URLs PENDING removal. Once they're removed, they will disappear from that view.
-
thx. I have a URL like this for a REGION: http://www.honoluluhi5.com/oahu/waianae-makaha-condos/ and for a "NEIGHBORHOOD" I have this: http://www.honoluluhi5.com/oahu/waianae-makaha/maili-condos/
As you can see Region has "waianae-makaha-condos" directory, whereas the Neighborhood has "waianae-makaha" without the "condos" for that region directory part.
Question: when I go to GWT and remove can I simply type "oahu/waianae-makaha-condos" and select the directory option and that will ALSO exclude the neighborhood URL? OR, since the region part in the URL within the neighborhood URL is different I have to submit individually?
-
I'm not 100% sure Google will understand you if you leave off the slashes. I've always added them and have never had a problem, so you want to to type: /oahu/waianae-makaha-condos/
Typing that would NOT include the neighborhood URL, in your example. It will only remove everything that exists in the /waianae-makaha-condos/ folder (including that main category page itself).
edit >> To remove the neighborhood URL and everything in that folder as well, type /oahu/waianae-makaha/maili-condos/ and select the option for "directory".
edit #2 >> I just want to add that you should be very careful with this. You don't want to use the directory option unless you're 100% sure there's nothing in that directory that you want to stay indexed.
-
thx, Philip. So you are saying if I use the directory option that will ensure the paginated pages will also be taken out of the index like this page: /oahu/waianae-makaha/maili-condos/page-2
-
so I will remove directory for "/oahu/waianae-makaha/maili-condos/" and that will ensure removal of "/oahu/waianae-makaha/maili-condos/page-2" as well?
-
Yep. Just last week I had an entire website deindexed (on purpose, it's a staging website) by entering just / into the box and selecting directory. By the next morning the entire website was gone from the index

It works for folders/directories too. I've used it many times.
-
removing directory "/oahu/waianae-makaha-condos/" will NOT remove "/oahu/waianae-makaha/maili-condos/" because the silo "waianae-makaha" and "waianae-makaha-condos" are different.
HOWEVER,
removing directory " /oahu/waianae-makaha/maili-condos/" will remove "/oahu/waianae-makaha/maili-condos/page-2" because they share this silo "waianae-makaha"Is that correctly understood?
-
for some MLS result pages I have a BUNCH of pages and I want to remove from index with 1 click as opposed to having to include each paginated page. Example: http://www.honoluluhi5.com/oahu/honolulu/metro/waikiki-condos/page-52 I simply include"/oahu/honolulu/metro/waikiki-condos/" and that will ALSO remove from index this page: http://www.honoluluhi5.com/oahu/honolulu/metro/waikiki-condos/page-52 - is that correct?
-
Yep, you got it.
You can think of it exactly like Windows folders, if that helps you stay focused. If you have C:\Website\folder1 and C:\Website\folder12. "noindexing" \folder1\ would leave \folder12\ alone because they're not in the same directory.
-
Yes. It will remove /page-52 and EVERYTHING that exists in /oahu/honolulu/metro/waikiki-condos/. It will also remove everything that exists in /page-52/ (if anything). It trickles down as far as the folders in that directory will go.
**Go to Google search and type this in: **site:honoluluhi5.com/oahu/honolulu/metro/waikiki-condos/
That will show you everything that's going to be removed from the index.
-
thx, Philip. Most helpful. I will get on it
-
based on Google's own guidelines it appears to be a bad idea to use the removal tool under normal circumstances (which I believe my site falls under): https://support.google.com/webmasters/answer/1269119
It starts with: "The URL removal tool is intended for pages that urgently need to be removed—for example, if they contain confidential data that was accidentally exposed. Using the tool for other purposes may cause problems for your site."
-
Good find. I've never seen this part of the help section. Their resonating reason behind all of the examples seems to be "You don’t need to manually remove URLs; they will drop out naturally over time."
I have never had an issue, nor have I ever heard of anyone having an issue, removing URLs with the Removal Tool. I guess if you don't feel safe doing it, you can wait for Google's crawler to catch up, although it could take over a month. If you're comfortable waiting it out, have no reasons to rush it, AND feel like playing it super safe... you can disregard everything I've said

We all learn something new every day!
-
lol - good answer Philip. I hear you. What makes it difficult is the lack of crystal clear guidelines from search engines....it is almost like they don't know themselves and each case is sort of on a "what feels right" basis.....
-
Remember that if you no-index pages, any link you have on your site pointing to those pages is wasting its link juice.
This looks like a job for Canonical tag
-
Hi Alan, thx for your comment. Let me give you an example and if you have a though that's be great:
- Condos on Island: http://www.honoluluhi5.com/oahu-condos/
- Condos in City: http://www.honoluluhi5.com/oahu/honolulu-condos/
- Condos in Region: http://www.honoluluhi5.com/oahu/honolulu/metro-condos/
Properties on the result page for 3) are all in 2) and all properties within 2) is within 1). Furthermore, for each of those URL, the paginated pages (2 to n) are all different, since each property is different, so using canonical tags would not be accurate. 1 + 2 + 3 are all important keywords.
Here is what I am planning: add some unique content to the first page in the series for each of those URL and include just the 1st page in the serious to the index, but pages 2 to n I will keep "noindex, follow" on. Argument could be "your MLS result pages will look too thin and not rank" but the other way of looking at it is "with potentially 500 or more properties on each URL, a bit of stats on page 1 will not offset all the MLS duplicate data, so even though the page may look thin, only indexing page 1 is best way forward".