The Moz Q&A Forum

    • Forum
    • Questions
    • My Q&A
    • Users
    • Ask the Community

    Welcome to the Q&A Forum

    Browse the forum for helpful insights and fresh discussions about all things SEO.

    1. SEO and Digital Marketing Q&A Forum
    2. Categories
    3. Intermediate & Advanced SEO
    4. What Sources to use to compile an as comprehensive list of pages indexed in Google?

    What Sources to use to compile an as comprehensive list of pages indexed in Google?

    Intermediate & Advanced SEO
    5 2 262
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as question
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • sp80
      sp80 last edited by

      As part of a Panda recovery initiative we are trying to get an as comprehensive list of currently URLs indexed by Google as possible.

      Using the site:domain.com operator Google displays that approximately 21k pages are indexed. Scraping the results however ends after the listing of 240 links.

      Are there any other sources we could be using to make the list more comprehensive? To be clear, we are not looking for external crawlers like the SEOmoz crawl tool but sources that would be confidently allow us to determine a list of URLs currently hold in the Google index.

      Thank you /Thomas

      1 Reply Last reply Reply Quote 0
      • sp80
        sp80 last edited by

        Does anyone have any insight on this? If the answer is simply there is no better approach than look at the limited data available through the Google UI this would be helpful as well.

        1 Reply Last reply Reply Quote 0
        • Dr-Pete
          Dr-Pete last edited by

          If you're willing to piece together multiple sources, I can definitely give you some starting points:

          (1) First, dropping from 21K pages indexed in Google to 240 definitely seems odd. Are you hitting omitted results? You may have to shut off filtering in the URL (&filter=0).

          (2) You can also divide the site up logically and run "site:" on sub-folders, parameters, etc. Say, for example:

          site:example.com/blog

          site:example.com/shop

          site:example.com/uk

          As long as there's some logical structure, you can use it to break the index request down into smaller chunks. Don't forget to use inurl: for URL parameters (filters, pagination, etc.).

          (3) This takes a while, but split up your XML sitemaps into logical clusters - say, one for major pages, one for top-level topics/categories, one for sub-categories, one for products. That way, you'll get a cleaner could of what kind of pages are indexed, and you'll know where your gaps are.

          (4) Run a desktop crawler on the site, like Xenu or Screaming Frog (Xenu is free, but PC only and harder to use. Screaming Frog has a yearly fee, but it's an excellent tool). This won't necessarily tell you what Google has indexed, but it will help you see how your site is being crawled and where problems are occurring.

          I wrote a mega-post a while back on all the different kinds of duplicate content. Sometimes, just seeing examples can help you catch a problem you might be having. It's at:

          http://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world

          sp80 1 Reply Last reply Reply Quote 2
          • sp80
            sp80 @Dr-Pete last edited by

            Thanks Pete,

            As always very much appreciate your input.

            1/ We aren't using any parameters and when using the filter=0 we are getting the same results. For my just done test I was only able to pull 350 pages out of 18.5k pages using the web interface. If anyone has any other thoughts on this please let me now.

            2/ That is a great idea. Most of our pages live in the root directory to keep the URL slugs short so unfortunately this one will not help us.

            3/ Another good idea. I understand this approach is helpful to see your coverage of wanted pages in the Google index but won't be able to help you determine superfluous pages currently in the Google index unless I misunderstood you?

            4/ We are using ScreamingFrog and I agree its a fantastic tool. The index size with ScreamingFrog is showing not more than 300 pages which is our final goal.

            Overall we are seeing continuous yet small drops to the index size using our approach of returning 410 response codes for unwanted pages and dedicated sitemaps to speed up delisting. See http://www.seomoz.org/q/panda-recovery-what-is-the-best-way-to-shrink-your-index-and-make-google-aware

            We are just trying to get a more complete list of whats currently in the index to speed up delisting.

            Thank you for your reference to the Panda post I remember reading it before and will give it another go right now.

            One final question, in your experience dealing with Panda penalties, have you seen scenarios where it seems the delisting/penalizing of a site has only happened for a particular CCTLD of google or just the homepage? See http://www.seomoz.org/q/panda-penguin-penalty-not-global-but-only-firea-for-specific-google-cctlds It is what we are currently experiencing and trying to see if other people have observed something similar.

            Best /Thomas

            Dr-Pete 1 Reply Last reply Reply Quote 0
            • Dr-Pete
              Dr-Pete @sp80 last edited by

              We don't usually take private info in public questions, but if you want to, Private Message me the domain (via my profile). I'm really curious about (1) and I'd love to take a peek.

              1 Reply Last reply Reply Quote 0
              • 1 / 1
              • First post
                Last post
              • How long will old pages stay in Google's cache index. We have a new site that is two months old but we are seeing old pages even though we used 301 redirects.
                DonnaDuncan
                DonnaDuncan
                0
                3
                81

              • Best way to link to 1000 city landing pages from index page in a way that google follows/crawls these links (without building country pages)?
                lcourse
                lcourse
                0
                7
                54

              • Google indexing only 1 page out of 2 similar pages made for different cities
                Rashi0077
                Rashi0077
                0
                4
                289

              • Big discrepancies between pages in Google's index and pages in sitemap
                David-Kley
                David-Kley
                0
                6
                218

              • Is there a way to get a list of Total Indexed pages from Google Webmaster Tools?
                sparrowdog
                sparrowdog
                0
                7
                10.4k

              • Does Google still don't index Hashtag Links ? No chance to get a Search Result that leads directly to a section of a page? or to one of numeras Hashtag Pages in a single HTML page?
                Muhammad_Jabali
                Muhammad_Jabali
                0
                3
                748

              • Is it possible to get a list of pages indexed in Google?
                CrakJason
                CrakJason
                0
                3
                1.1k

              • Why is Google displaying inside pages for our sites rather than the index pages?
                aloley
                aloley
                0
                7
                430

              Get started with Moz Pro!

              Unlock the power of advanced SEO tools and data-driven insights.

              Start my free trial
              Products
              • Moz Pro
              • Moz Local
              • Moz API
              • Moz Data
              • STAT
              • Product Updates
              Moz Solutions
              • SMB Solutions
              • Agency Solutions
              • Enterprise Solutions
              • Digital Marketers
              Free SEO Tools
              • Domain Authority Checker
              • Link Explorer
              • Keyword Explorer
              • Competitive Research
              • Brand Authority Checker
              • Local Citation Checker
              • MozBar Extension
              • MozCast
              Resources
              • Blog
              • SEO Learning Center
              • Help Hub
              • Beginner's Guide to SEO
              • How-to Guides
              • Moz Academy
              • API Docs
              About Moz
              • About
              • Team
              • Careers
              • Contact
              Why Moz
              • Case Studies
              • Testimonials
              Get Involved
              • Become an Affiliate
              • MozCon
              • Webinars
              • Practical Marketer Series
              • MozPod
              Connect with us

              Contact the Help team

              Join our newsletter
              Moz logo
              © 2021 - 2026 SEOMoz, Inc., a Ziff Davis company. All rights reserved. Moz is a registered trademark of SEOMoz, Inc.
              • Accessibility
              • Terms of Use
              • Privacy