The Moz Q&A Forum

    • Forum
    • Questions
    • My Q&A
    • Users
    • Ask the Community

    Welcome to the Q&A Forum

    Browse the forum for helpful insights and fresh discussions about all things SEO.

    1. SEO and Digital Marketing Q&A Forum
    2. Categories
    3. Intermediate & Advanced SEO
    4. 2.3 million 404s in GWT - learn to live with 'em?

    2.3 million 404s in GWT - learn to live with 'em?

    Intermediate & Advanced SEO
    3 2 105
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as question
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • ufmedia
      ufmedia last edited by

      So I’m working on optimizing a directory site.  Total size: 12.5 million pages in the XML sitemap.  This is orders of magnitude larger than any site I’ve ever worked on – heck, every other site I’ve ever worked on combined would be a rounding error compared to this.

      Before I was hired, the company brought in an outside consultant to iron out some of the technical issues on the site.  To his credit, he was worth the money: indexation and organic Google traffic have steadily increased over the last six months.  However, some issues remain.  The company has access to a quality (i.e. paid) source of data for directory listing pages, but the last time the data was refreshed some months back, it threw 1.8 million 404s in GWT.  That has since started to grow progressively higher; now we have 2.3 million 404s in GWT.

      Based on what I’ve been able to determine, links on this particular site relative to the data feed are broken generally due to one of two reasons: the page just doesn’t exist anymore (i.e. wasn’t found in the data refresh, so the page was simply deleted), or the URL had to change due to some technical issue (page still exists, just now under a different link).  With other sites I’ve worked on, 404s aren’t that big a deal: set up a 301 redirect in htaccess and problem solved.  In this instance, setting up that many 301 redirects, even if it could somehow be automated, just isn’t an option due to the potential bloat in the htaccess file.

      Based on what I’ve read here and here, 404s in and of themselves don’t really hurt the site indexation or ranking.  And the more I consider it, the really big sites – the Amazons and eBays of the world – have to contend with broken links all the time due to product pages coming and going.  Bottom line, it looks like if we really want to refresh the data on the site on a regular basis – and I believe that is priority one if we want the bot to come back more frequently – we’ll just have to put up with broken links on the site on a more regular basis.

      So here’s where my thought process is leading:

      • Go ahead and refresh the data.  Make sure the XML sitemaps are refreshed as well – hopefully this will help the site stay current in the index.
      • Keep an eye on broken links in GWT.  Implement 301s for really important pages (i.e. content-rich stuff that is really mission-critical).  Otherwise, just learn to live with a certain number of 404s being reported in GWT on more or less an ongoing basis.
      • Watch the overall trend of 404s in GWT.  At least make sure they don’t increase.  Hopefully, if we can make sure that the sitemap is updated when we refresh the data, the 404s reported will decrease over time.

      We do have an issue with the site creating some weird pages with content that lives within tabs on specific pages.  Once we can clamp down on those and a few other technical issues, I think keeping the data refreshed should help with our indexation and crawl rates.

      Thoughts?  If you think I’m off base, please set me straight.  🙂

      1 Reply Last reply Reply Quote 0
      • TomVolpe
        TomVolpe last edited by

        Hi,

        Sounds like you’ve taken on a massive job with 12.5 million pages, but I think you can implement a simple fix to get things started.

        You’re right to think about that sitemap, make sure it’s being dynamically updated as the data refreshes, otherwise that will be responsible for a lot of your 404s.

        I understand you don’t want to add 2.3 million separate redirects to your htaccess, so what about a simple rule - if the request starts with ^/listing/ (one of your directory pages), is not a file and is not a dir, then redirect back to the homepage. Something like this:

        does the request start with /listing/ or whatever structure you are using

        RewriteCond %{REQUEST_URI} ^/listing/ [nc]

        is it NOT a file and NOT a dir

        RewriteCond %{REQUEST_FILENAME} !-f
        RewriteCond %{REQUEST_FILENAME} !-d
        #all true? Redirect
        RewriteRule .* / [L,R=301]

        This way you can specify a certain URL structure for the pages which tend to turn to 404s, any 404s outside of your first rule will still serve a 404 code and show your 404 page and you can manually fix these problems, but the pages which tend to disappear can all be redirected back to the homepage if they’re not found.

        You could still implement your 301s for important pages or simply recreate the page if it’s worth doing so, but you will have dealt with a large chunk or your non-existing pages.

        I think it’s a big job and those missing pages are only part of it, but it should help you to sift through all of the data to get to the important bits – you can mark a lot of URLs as fixed and start giving your attention to the important pages which need some works.

        Hope that helps,

        Tom

        ufmedia 1 Reply Last reply Reply Quote 2
        • ufmedia
          ufmedia @TomVolpe last edited by

          I was actually thinking about some type of wildcard rule in htaccess.  This might actually do the trick!  Thanks for the response!

          1 Reply Last reply Reply Quote 0
          • 1 / 1
          • First post
            Last post
          • Can 'follow' rather than 'nofollow' links be damaging partner's SEO
            LureCreative
            LureCreative
            0
            4
            112

          • Google only indexing the top 2/3 of my page?
            mgreeves
            mgreeves
            1
            7
            121

          • Something happened within the last 2 weeks on our WordPress-hosted site that created "duplicates" by counting www.company.com/example and company.com/example (without the 'www.') as separate pages. Any idea what could have happened, and how to fix it?
            Christy-Correll
            Christy-Correll
            0
            6
            161

          • Google's 'related:' operator
            EpicWebStudios
            EpicWebStudios
            0
            5
            209

          • Mixing 'rel canonical' and 'rel alternate'
            Vizergy
            Vizergy
            0
            2
            147

          • 'Nofollow' footer links from another site, are they 'bad' links?
            Stellar_SEO
            Stellar_SEO
            0
            3
            1.0k

          • 301s Creating Soft 404s in GWT
            Ties.com
            Ties.com
            0
            2
            175

          • NOINDEX listing pages: Page 2, Page 3... etc?
            dunklea
            dunklea
            0
            3
            686

          Get started with Moz Pro!

          Unlock the power of advanced SEO tools and data-driven insights.

          Start my free trial
          Products
          • Moz Pro
          • Moz Local
          • Moz API
          • Moz Data
          • STAT
          • Product Updates
          Moz Solutions
          • SMB Solutions
          • Agency Solutions
          • Enterprise Solutions
          • Digital Marketers
          Free SEO Tools
          • Domain Authority Checker
          • Link Explorer
          • Keyword Explorer
          • Competitive Research
          • Brand Authority Checker
          • Local Citation Checker
          • MozBar Extension
          • MozCast
          Resources
          • Blog
          • SEO Learning Center
          • Help Hub
          • Beginner's Guide to SEO
          • How-to Guides
          • Moz Academy
          • API Docs
          About Moz
          • About
          • Team
          • Careers
          • Contact
          Why Moz
          • Case Studies
          • Testimonials
          Get Involved
          • Become an Affiliate
          • MozCon
          • Webinars
          • Practical Marketer Series
          • MozPod
          Connect with us

          Contact the Help team

          Join our newsletter
          Moz logo
          © 2021 - 2026 SEOMoz, Inc., a Ziff Davis company. All rights reserved. Moz is a registered trademark of SEOMoz, Inc.
          • Accessibility
          • Terms of Use
          • Privacy