Changing the way SEOmoz Detects Duplicate Content

KeriMorgret

Hey everyone,

I wanted to highlight today's blog post in case you missed it. In short, we're using a different algorithm to detect duplicate pages. http://moz.com/blog/visualizing-duplicate-web-pages

If you see a change in your crawl results and you haven't done anything, this is probably why. Here's more information taken directly from the post:

1. Fewer duplicate page errors: a general decrease in the number of reported duplicate page errors. However, it bears pointing out that:

**We may still miss some near-duplicates. **Like the current heuristic, only a subset of the near-duplicate pages is reported.
**Completely identical pages will still be reported. **Two pages that are completely identical will have the same simhash value, and thus a difference of zero as measured by the simhash heuristic. So, all completely identical pages will still be reported.

2. Speed, speed, speed: The simhash heuristic detects duplicates and near-duplicates approximately 30 times faster than the legacy fingerprints code. This means that soon, no crawl will spend more than a day working its way through post-crawl processing, which will facilitate significantly faster delivery of results for large crawls.

William.Lau

That is good news. It will ease some minds that are going nuts over the duplicate content reporting. Thanks!

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Changing the way SEOmoz Detects Duplicate Content

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved