The Moz Q&A Forum

    • Forum
    • Questions
    • My Q&A
    • Users
    • Ask the Community

    Welcome to the Q&A Forum

    Browse the forum for helpful insights and fresh discussions about all things SEO.

    1. SEO and Digital Marketing Q&A Forum
    2. Categories
    3. Technical SEO Issues
    4. Log files vs. GWT: major discrepancy in number of pages crawled

    Log files vs. GWT: major discrepancy in number of pages crawled

    Technical SEO Issues
    4 2 104
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as question
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • ufmedia
      ufmedia last edited by

      Following up on this post, I did a pretty deep dive on our log files using Web Log Explorer.  Several things have come to light, but one of the issues I've spotted is the vast difference between the number of pages crawled by the Googlebot according to our log files versus the number of pages indexed in GWT.  Consider:

      • Number of pages crawled per log files: 2993
      • Crawl frequency (i.e. number of times those pages were crawled): 61438
      • Number of pages indexed by GWT: 17,182,818 (yes, that's right - more than 17 million pages)

      We have a bunch of XML sitemaps (around 350) that are linked on the main sitemap.xml page; these pages have been crawled fairly frequently, and I think this is where a lot of links have been indexed.  Even so, would that explain why we have relatively few pages crawled according to the logs but so many more indexed by Google?

      1 Reply Last reply Reply Quote 0
      • danatanseo
        danatanseo last edited by

        Hi. Interesting question. You had me at "log files." 🙂  So before I give a longer, more detailed answer, I have a follow up question: Does your site really have 17+ million pages?

        ufmedia 1 Reply Last reply Reply Quote 1
        • ufmedia
          ufmedia @danatanseo last edited by

          Waiting on an answer from our dev team on that now.  In the meantime, here's what I can tell you:

          • Number submitted in XML sitemaps per GWT: 13,882,040 (number indexed: 13,204,476, or 95.1%)

          • Number indexed: 17,182,818

          • Difference: 3,300,778

          • Number of URLs throwing 404 errors: 2,810,650

          • 2,810,650 / 3,300,778 = 85%

          I'm sure the ridiculous number of 404s on site (I mentioned them in a separate post here) are at least partially to blame.  How much, though?  I know that Google says that 404s don't hurt SEO, but the fact that the number of 404s is 85% of the difference between the number indexed and submitted is not exactly a coincidence.

          (Apologies if these questions seem a bit dense or elementary.  I've done my share of SEO, but never on a site this massive.)

          danatanseo 1 Reply Last reply Reply Quote 0
          • danatanseo
            danatanseo @ufmedia last edited by

            I'll reserve my answer until you hear from your dev team. A massive site for sure.

            One other question/comment:  just because there are 13 million URLs in your sitemap doesn't necessarily mean there are that many pages on the site. We could be talking about URI versus URL.

            I'm pretty sure you know what I mean by that, but for others reading this who may not know, URI is the unique Web address of any given resource, while a URL is generally used to reference a complete Web page. An example of this would be an image. While it certainly has its own unique address on the Web, it most often does not have it's very own "page" on a Website (although there are certainly exceptions to that).

            So, I could see a site having millions of URIs, but very few sites have 17 million+ pages. To put it into perspective, Alibaba and IBM roughly show 6-7 million pages indexed in Google. Walmart has between 8-9 million.

            So where I'm headed in my thinking is major duplicate content issues...but, as I said, I'm going to reserve further comment until you hear back from your developers.

            This is a very interesting thread so I want to know more. Cheers!

            1 Reply Last reply Reply Quote 0
            • 1 / 1
            • First post
              Last post
            • Discrepancy in actual indexed pages vs search console
              Printcious
              Printcious
              0
              3
              96

            • Pages with Duplicate Page Content Crawl Diagnostics
              evolvingSEO
              evolvingSEO
              0
              6
              243

            • Why the number of crawled pages is so low¿?
              Cyrus-Shepard
              Cyrus-Shepard
              0
              8
              502

            • Page crawling is only seeing a portion of the pages. Any Advice?
              KeriMorgret
              KeriMorgret
              0
              5
              371

            • I have 15,000 pages. How do I have the Google bot crawl all the pages?
              KeriMorgret
              KeriMorgret
              0
              5
              714

            • Backlinks to home page vs internal page
              BobGW
              BobGW
              0
              11
              3.5k

            • Discrepency between # of pages and # of pages indexed
              Dan-Petrovic
              Dan-Petrovic
              0
              14
              990

            • Page title vs page element
              atrenary
              atrenary
              1
              5
              823

            Get started with Moz Pro!

            Unlock the power of advanced SEO tools and data-driven insights.

            Start my free trial
            Products
            • Moz Pro
            • Moz Local
            • Moz API
            • Moz Data
            • STAT
            • Product Updates
            Moz Solutions
            • SMB Solutions
            • Agency Solutions
            • Enterprise Solutions
            • Digital Marketers
            Free SEO Tools
            • Domain Authority Checker
            • Link Explorer
            • Keyword Explorer
            • Competitive Research
            • Brand Authority Checker
            • Local Citation Checker
            • MozBar Extension
            • MozCast
            Resources
            • Blog
            • SEO Learning Center
            • Help Hub
            • Beginner's Guide to SEO
            • How-to Guides
            • Moz Academy
            • API Docs
            About Moz
            • About
            • Team
            • Careers
            • Contact
            Why Moz
            • Case Studies
            • Testimonials
            Get Involved
            • Become an Affiliate
            • MozCon
            • Webinars
            • Practical Marketer Series
            • MozPod
            Connect with us

            Contact the Help team

            Join our newsletter
            Moz logo
            © 2021 - 2026 SEOMoz, Inc., a Ziff Davis company. All rights reserved. Moz is a registered trademark of SEOMoz, Inc.
            • Accessibility
            • Terms of Use
            • Privacy