Welcome to the Q&A Forum

katemats

Hi everyone!

I just wanted to add a quick response to shed a bit more light on the situation.

Last year we started a on a project to drastically improve our index. The first part of that was to make our crawler discover more of the web - this included crawling deeper on domains, discovering more links faster (freshness), and contain more links overall.

Background

To understand the changes, it might help if I explain how our crawler used to work and how we changed.

Our crawler used to crawl the web (for 3-4 weeks), then we would compute the link graph and create all the lists of links, and metrics you see in Open Site Explorer - this is what we called processing (and it would take 2-3 weeks). As part of processing we would select the top 10 billion urls to crawl, and then start crawling those.

The problem with this system was that the data was could be 7-8 weeks old (crawling time + processing + deployment to the API and OSE). It also wasn't recursive - meaning that we would only discover new links when we did the processing of that crawl, so it could take us several months before we would see new links that were deeper in domains.

The changes

We modified our crawler so we were crawling all the time - we crawl sites every day, or week, or month - based on authority. As we crawl those site, any new links that we find are added to one of the buckets, and will be crawled typically within that same index. This is exciting because we can go deeper, discover more links, and produce a higher quality index. The other benefit, is that since we are crawling all the time, we can just take a snapshot of that crawl and run processing - without waiting for the last round of processing to finish - and this means we can update the index more often.

However, in June, we had a problem with the old crawlers, and we had to roll out our new version of the crawl and index with the OSE launch on July 27th. So even though our testing looked good when we released the new index, and correlations were higher than the old crawl, we got complaints about things that were wrong.

The issues

Binary files were in the index - There are normally only supposed to be links in the index, but because the new crawler went very deep on some domains we started discovering all sorts of binary files, which when parsed, produced lots of weird links. So domains had all these links from sites that didn't link to them. We fixed this issue, and this is the first index with the fix.

We went too deep on big domains - There are a lot of knobs to turn on the new crawlers - from the number of sites we crawl daily/weekly/month to how many links we keep for different domains. One of the first things we noticed with this new crawl, was that we had less domains in our index. So we dialed down how many urls could come from a domain - and this new index also contains that change.

What we are doing

We recognize that all of you depend on this data. And we take the index quality very seriously.

We have already made a lot of other changes, increasing the overall size and adjusting how we crawl. However, since it still takes 2-4 weeks to process an index, so some of those changes won't be seen for another 2-4 weeks yet.

We are also working on an updated, higher correlating Page Authority/Domain Authority that should be out in a month or two - but also may jump around a bit.

What you can do

Definitely keep sending us feedback. It really helps us understand where we may have missed in our testing, and what we can do to fix it.

And thanks again for your patience - we really want to deliver the best possible Linkscape for you, and I assure the team is working nights and weekends to address these concerns.

And if anyone has questions you can always email me or our help team (which tend to respond to emails much faster), as all of us care a lot and really want to hear your feedback.

Thanks again,
Kate

katemats

Hey Mark!

Sorry you weren't satisfied with the answer - hopefully this large amount of detail will help [brace yourself]

First, a little history....

When we first built the web app, we had originally thought about emulating the tool Rank Tracker but in the campaign context. Over 2 years (up to 2009) we had 85,000 keywords across all our customers, so we wanted to offer more functionality int he web app, so we allowed users to track up to 3 search engine for their keywords (this meant instead of just checking one rank, we generally check 3 ranks for every keyword). We also increased the limit to many more keywords and allowed users to cut and past into the tool.

Since it was so easy to add keywords, and people watch Rand's video which made the suggestion to just "cut and paste all the words that drive traffic to your site", and pretty soon almost all our users were maxing out their keywords.

What we thought was a reasonable amount of rankings, quickly changed, and we were checking over 500,000-800,000 keywords every single night (and keep in mind this is just weekly checks).

The issues with lots of keywords

In order to do this politely (since we want to make sure we aren't abusing the search engines) we have to spread these out throughout the day, and for really heavy days collecting all those SERPs can take almost the entire day!

Since we don't want to prioritize certain users over others we have set the once a week limit which enables us to evenly distribute out collection over the whole week, and make sure that all rankings are always collected (since sometimes things can break when SERP layouts change, etc.)

Usability issues with different frequencies

Another challenge that has come up is that it is really important to a lot of users that all of their data updates on the same day. This poses a problem when keywords and rankings are collected off cycle, since then graphs could end up with extra data points and the definition of "weekly change" throughout our UI becomes fuzzy (do you use the week delta, or do you use the last rank you checked?). So we went for simple to make this more clear and usable for everyone.

So could we add more capacity, a new product, something?

Yes, to all of those questions.

We could make it so we could do a lot more keywords. And we have considered on demand rank checking, or even just tracking keywords outside of the campaign. However, it isn't something we are planning to do right now.

Instead we are focusing on improving the SERPs we do collect to include vertical results - like local, images and video - so you can track your performance in those areas. We are also focusing on adding more functionality to the web app to make it even more valuable for each campaign - including adding social data, and more.

However, we definitely, definitely listen to our customers and use your feedback as a critical factor in all of our product planning efforts. So if you think this is the wrong decision, or you want to tell us to build something - let us know here:

http://seomoz.zendesk.com/forums/293194-seomoz-pro-feature-requests

We couldn't build what we do without your input and it means a lot to us to hear your feedback.

Hope that helps!

Kate

katemats

And btw, we are still investigating the differences between indexes and will continue to update this thread as we have more information.

Thanks!

katemats

Thanks Gyorgy - I am glad you found it useful.

For what it is worth, we have another index update planned in 2-3 weeks, and then another 3 weeks out - each index should get progressively better.

The team is working over time here though - the hard part is that the changes we make now can take 2+ months to propagate.

All the domains people sent us yesterday helped us identify another bug with our index, so we have a fix for that too. But since it takes 3-5 weeks to crawl, and then another 2-3 weeks to process you won't be able to see those improvements for another 2+ months. However, by December, the index will be better than it has ever been - with more domains and links.

Thanks again for your patience and all the details - it has really helps us track down issues.

katemats

Sean - I share Rand' sentiments, thanks so much for the suggestion!

We have considered distributed crawling in the past (or even distributed rank checking because then it would be in that user's locale) but there are a whole different set of challenges. For example, you have to handle all the edge cases: what if a user's computer isn't on, or loses connectivity, what if we crawl too fast and the user gets blocked from a site, how do you write all that data securely?

Of course all of these concerns can be overcome, but right now we feel like we have a good handle on the problems, and it will be much faster for us to just fix what we have

Although, I know all of us are so appreciative of the ideas and support, and we will have something really great soon!

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

katemats

@katemats

Best posts made by katemats

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved