How does Google index pagination variables in Ajax snapshots? We're seeing random huge variables.

sitestrux

We're using the Google snapshot method to index dynamic Ajax content. Some of this content is from tables using pagination. The pagination is tracked with a var in the hash, something like:

#!home/?view_3_page=1

We're seeing all sorts of calls from Google now with huge numbers for these URL variables that we are not generating with our snapshots. Like this:

#!home/?view_3_page=10099089

These aren't trivial since each snapshot represents a server load, so we'd like these vars to only represent what's returned by the snapshots.

Is Google generating random numbers going fishing for content? If so, is this something we can control or minimize?

FedeEinhorn

I think you are right. Google is fishing for content. I would find a solution to make those URL friendly by removing the hash and using some URL rewrite and pushState to paginate that content instead.

Here's a previous question that may help: http://moz.com/community/q/best-way-to-break-down-paginated-content

sitestrux

Hi Federico, thanks for the response.

Unfortunately this is an SEO solution for a third-party JavaScript product, so removing the hash isn't an option.

I'm still interested in knowing if this is a formal Google practice and if there's some way to control or mitigate this.

FedeEinhorn

We also noticed some weird crawls last year using random numbers at the end of the URL, checking in google webmaster tools we saw that most of those urls were reported as not found, checking from where the link came from google listed some of our URLs, but didn't had any link to those URLs google was trying to fetch. After 2 or 3 months those crawls stopped. We never knew from where Google got those URLs...

evolvingSEO

Hi There

I'm an associate here at Moz, and have asked the other associates if they might know the answer, as this one's a little outside of my experience. Please follow up and let us know if you don't hear from anyone.

Thanks!

-Dan

sitestrux

Awesome, thanks for looking into it. We've gotten nowhere with any kind of answer.

randfish

I agree with Federico. I've seen Google go fishing with URL parameters (?param=xyz) and I've seen it with AJAX and hashbangs as well. How far they take this and when they choose to apply it doesn't seem to follow a consistent pattern . You can see some folks on StackExchange discussing this, too: http://webmasters.stackexchange.com/questions/25560/does-the-google-crawler-really-guess-url-patterns-and-index-pages-that-were-neve

Carson-Ward

This seems to do this only for parameters that it has decided "changes, re-orders, or narrows content." They may also crawl things that look like URLs in Javascript even when it's part of a function, but it doesn't seem like that's what's happening in this case.

Depending on the setup of the site, you can either manually configure the variable in WMT (don't do this if the parameter is material), write a clever robots.txt rule (e.g. to block anything after a number of digits after the parameter), or (the best solution) re-work the system to generate URLs that don't rely on parameters.

I'm not sure I understand why the server is rendering a page if the URL isn't supposed to exist. Depending on your server config, you may also be able to return a 404 and make a rule for which (valid) pages to render. From there you can just ignore the 404 errors until Google figures it out.

I think that's the best I can do without seeing the site.

richardbaxter

100% of my experience in this situation is from using Angular.js with Phantom rendering the snapshot, so I tend to use the meta fragment directive in the page header (because I don't use #!'s). With that said I do think my debugging / test experience might be useful, so I'll splurge it out here just in case.

For the record I don't think this is a simple case of Google fabricating URLs - I think it's worth making sure there's not something happening in-between. The real reason tends to come out in testing.

Have you looked in your log files at requests specifically containing your ?view_3_page= parameter? I'd get a sample of Googlebot requests and look for that parameter. Every time I've come across this problem so far, it's been all about your framework not responding well to the parameter ordering in the URL when combined with the escaped_fragment= parameter.

Sometimes, when the request is made by Google with escaped_fragment= in the request URI, you have to be certain that you understand the behavior that particular request URL is likely to trigger.

So when initial request: yourdomain.com/#!home/?view_3_page=1 is made,

What does: yourdomain.com/?escaped_fragment=home/?view_3_page=1 do?

Side note - it could be: yourdomain.com/?escaped_fragment=home/&view_3_page=1 but as Carson said, without looking at how your side behaves in this situation it's difficult to know so I'll just put the different outcome options in here in case one of them is close.

So, check your server logs and look at how the snapshot request URI is formed. Then check those pages out in a browser - making sure (obviously) you're responding with the right server header response and that the page code makes sense,

What tends to happen (if you've got this far) is that in unusual circumstances (eg: a chain of parameters with the escaped fragment pre-fetch directive bolted in) is that you might be serving malformed versions of what you'd hoped would be your perfectly constructed HTML snapshot.

IF that's the case, I would spend a lot of time evaluating what Google sees and therefore, what it attempts to crawl. You might find that if you're serving something a bit strange then Google might be discovering URLs you didn't know you were capable of generating. That should give you enough scope to detect a problem and get a change request assigned to fix it.

If not, then I suppose Google really is making these URLs up - but honestly, I spend a lot of time trawling through log files and it's been a long time since I haven't been able to find an explanation from the actual code.

As a side note: I'd try to avoid hashbangs in the medium / long term. As soon as they're they're you're committed to a lifetime of supporting them. A much more elegant solution is to use PushState (or $location if you're Angular) but (obviously) continue to serve the snapshot trigger via the meta fragment directive. I'm sure you're quite tired of being told to get rid of hashbangs, though.

Hope that helps?

Richard Baxter
SEOgadget.com

sitestrux

Thanks for the great replies all. Just to clarify, this is the page we're referencing:

http://www.knackhq.com/business-directory-user-demo/?escaped_fragment=

You can see the one pagination var "next" that points here:

http://www.knackhq.com/business-directory-user-demo/?escaped_fragment=home/?view_3_page=2

As you can see this is pretty simple. There's only one potential variable (the "prev" and "next" links) for introducing these huge numbers and that's pretty limited. We tested the Google URLs up and down the app and haven't seen anything that would send it fishing for larger numbers. But Google keeps hammering us with:

GET /business-directory-user-demo/?escaped_fragment=home/?view_3_page=1000251

For now we're trying to respond to those with 404s and hope they eventually die.

Unfortunately we can't avoid hashbangs.

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

How does Google index pagination variables in Ajax snapshots? We're seeing random huge variables.