If I can't list my canonical URLs in my sitemap, will I be creating duplicate content?
-
This post is deleted! -
You can't solve duplicate content issues with a sitemap. That only tells search engines what content you have, that you want search engines to find for sure. Search engines also just browse your site to find content that isn't in your sitemap.
So on monday they go to /shoes/brown/brand/supershoes.html and on tuesday they could follow a different path /brand/shoes/brown/supershoes.html which will bring them to the same content on a different url.
The only way to counter this is by the use of the rel=canonical. Or, use a noindex on the dupe's.
Putting the non-vanity url in the sitemap will add another url in the index. The non vanity, the vanity, the dupe's. So you'll be adding more junk to the system.
I wouldn't worry too much on the duplicate content issue. Yes it's a pain in the ass, yes it should be easy to solve. But the only thing which really happens right now, is that Google might not display your preferred url in the SERP's right now. As long as you're not jacking content from another site and posting it on your own site, you're in the clear. Oh and yes people might link to the "wrong" url.
If you can't solve it with the rel or the noindex in the header within a reasonable time. Move on to other stuff! Handle with this issue once you change to another CMS and do a proper redirection effort.
-
Hi Yannick,
I realize that you can't solve duplicate content issues with a sitemap, but an XML sitemap does (according to my understanding) tell Google how to organize the site's pages -- and in the case of a site that has too many URLs on it for Google to crawl them all, I imagine it would also influence the priority assigned to the indexing of a given page (meaning that a URL listed in the sitemap would likely not be skipped over by Google, but some random, less important search results page not on the sitemap may be overlooked).
Unfortunately I'm not sure that we can so easily shrug off the whole dupe content issue, since we are talking dozens of URLs per page of content. So, so many. There's no way Google can handle all of it, and our SEO link juice has got to be so diluted as to be useless. So we'd really like to fix this if at all possible.
Do you have any advice re: whether to go one direction (XML sitemap with URLs that do not match our "vanity" desired canonicals) or another (rel="canonical" for our pretty vanity SEO'd URLs)? Or both? We've got neither at the moment.
Thanks for your help.