The Moz Q&A Forum

    • Forum
    • Questions
    • My Q&A
    • Users
    • Ask the Community

    Welcome to the Q&A Forum

    Browse the forum for helpful insights and fresh discussions about all things SEO.

    1. SEO and Digital Marketing Q&A Forum
    2. Categories
    3. Technical SEO Issues
    4. I can't crawl the archive of this website with Screaming Frog

    I can't crawl the archive of this website with Screaming Frog

    Technical SEO Issues
    12 3 239
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as question
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • DirkC
      DirkC last edited by

      Did you put in some special filters - just tried to crawl the site & it seems to work just fine?

      gjergjshala 1 Reply Last reply Reply Quote 0
      • LoganRay
        LoganRay last edited by

        Try going to File > Default Conif > Clear Default Configuration. This happens to me sometimes as well as I've edited settings over time. Clearing it out and going back to default settings is usually quicker than clicking through the settings to identify which one is causing the problem.

        gjergjshala 1 Reply Last reply Reply Quote 1
        • gjergjshala
          gjergjshala @DirkC last edited by

          Hi Dirk

          Thanks a lot for replying back. The issue is that Screaming Frog is crawling the archive pages (like these examples) but it won't crawl the articles that are listed inside these pages.

          The hierarchy of the site goes like this:

          Homepage
           - Categories (with about 20 articles in them)
            - Archive of that category (with all the remaining articles, which in this case means thousands since they are a news website)

          Screaming Frog will crawl the homepage and categories ... but after it goes to the archive it won't crawl the articles inside archive, instead it will only crawl the pages (pagination) of that archive.

          Thanks again.

          DirkC 1 Reply Last reply Reply Quote 0
          • gjergjshala
            gjergjshala @LoganRay last edited by

            Hi Logan

            I've tried going back to default configuration but it didn't help .. still i don't believe Screaming Frog is to blame, i think there is something wrong with the way the site has been developed (they are using a custom CMS) .. but i can't find the reason why this is happening. As soon as i find the solution then i can ask the guys who developed this site to make the necessary changes.

            Thanks a lot.

            1 Reply Last reply Reply Quote 0
            • DirkC
              DirkC @gjergjshala last edited by

              It think Screaming Frog is going nuts on the formkey value in the url which is constantly changing when changing pages.

              Could you modify the settings of the spider to respect noindex & respect canonical - looks like this is solving the issue.

              Alternatively you could rewrite the url to ignore the formkey (remove parameter)

              Dirk

              gjergjshala 1 Reply Last reply Reply Quote 1
              • gjergjshala
                gjergjshala @DirkC last edited by

                I've tried changing settings to respect noindex & canonical .. it will stop crawling the archive pages but still it won't crawl the links inside those pages. (i've added NOINDEX, FOLLOW in all archive pagination pages)

                What do you mean by rewriting the url to ignore the formkey? How do you think it should be.

                Gjergji

                DirkC 1 Reply Last reply Reply Quote 0
                • DirkC
                  DirkC @gjergjshala last edited by

                  In the menu 'url rewriting' you can simply put the parameters the site should ignore (like date, formkey,..). I removed the formkey parameter and I checked the pages of the archive in Screaming Frog.

                  It is clearly able to detect all the internal links on the page - so will crawl them.

                  How are you certain that the pages below are not crawled - could you give a specific example of page that should be crawled but isn't?

                  Dirk

                  gjergjshala 1 Reply Last reply Reply Quote 1
                  • gjergjshala
                    gjergjshala @DirkC last edited by

                    Dirk, thanks a lot.

                    I just added "formkey" to be removed as a parameter and it seems to be working. I crawled 1k pages until now and i'm going to monitor how it goes.

                    The site has more than 400k pages so the process to crawl them all will take time (and i'm going to have to crawl each sector so i can create sitemaps for them).

                    Thanks again
                    Gjergj

                    DirkC 1 Reply Last reply Reply Quote 0
                    • DirkC
                      DirkC @gjergjshala last edited by

                      Great it worked. Just a small note - if Screaming Frog is getting confused by all these parameters, it could well be that Googlebot (while more sophisticated) is also having these issues. You could consider to exclude the formkey parameter in the Search Console (Crawl > URL parameters)

                      DIrk

                      gjergjshala 1 Reply Last reply Reply Quote 0
                      • gjergjshala
                        gjergjshala @DirkC last edited by

                        I can't make it work. After removing 'fromkey' parameter i was able to crawl 1.7k and it stopped there. The site has more than 400k pages so .. something must be wrong 😞

                        I want to crawl only the root domain without subdomains and all i can crawl is around 2k pages.

                        Do you have any idea what might be happening?

                        DirkC 1 Reply Last reply Reply Quote 0
                        • DirkC
                          DirkC @gjergjshala last edited by

                          I think the issue comes from the way you handle the pagination and or the way your render archived pages. 
                          Example: First archive page of Aktuale

                          http://zeri.info/arkiva/?formkey=7301c1be1634ffedb1c3780e5063819b6ec19157&acid=aktuale

                          Clicking on page 2 adds the date

                          http://zeri.info/arkiva/?from=2016-06-01&until=2016-06-16&acid=aktuale&formkey=cc0a40ca389eb511b1369a9aa9da915826d6ca44&faqe=2#archive-results => I assume that you're only listing the articles published from June 1st till today.

                          If I check all the different section & the number of articles listed in each archive I get approx. 1200 pages - add some additional pages linked on these pages and you get to the 2K pages you mentioned.

                          There seems to be no possibility to reach the previously published content without executing a search - which Screaming Frog can't do. It's quite possible that this is causing issues for Google bot as well so I would try to fix this.

                          If you really want to crawl the full site in the mean time - add another rule in url rewriting - this time selecting 'regex replace' -

                          add regex: from=2016-06-01 
                          replace regex from=2010-01-01 (replace by the earliest date of publishing)

                          This way - the system will call url http://zeri.info/arkiva/?from=**2010-06-01**&until=2016-06-16&acid=kultura&formkey=5932742bd5dd77799524ba31b94928810908fc07&faqe=2 rather than the original one - listing all the articles instead of only the june articles.

                          Hope this helps.

                          Dirk

                          1 Reply Last reply Reply Quote 0
                          • 1 / 1
                          • First post
                            Last post
                          • My wepgages aren't crawled by google
                            MichaelGregory
                            MichaelGregory
                            0
                            9
                            131

                          • Google Crawling Issues! How Can I Get Google to Crawl My Website Regularly?
                            libero_net
                            libero_net
                            1
                            8
                            142

                          • How to handle pages I can't delete?
                            evolvingSEO
                            evolvingSEO
                            0
                            14
                            281

                          • Can't understand poor rankings
                            MikeAquaspresso
                            MikeAquaspresso
                            0
                            14
                            158

                          • Why can't I redirect 302 errors to 301's?
                            Chris.Menke
                            Chris.Menke
                            0
                            3
                            80

                          • As a wholesale website can our independent retailer's website use (copy) our content?
                            ewanTHH
                            ewanTHH
                            0
                            5
                            514

                          • Why Can't I Get on Google?
                            Matt-Williamson
                            Matt-Williamson
                            0
                            2
                            351

                          • Website isn't Ranking for Any Keyword
                            Confetti_Wedding
                            Confetti_Wedding
                            0
                            6
                            599

                          Get started with Moz Pro!

                          Unlock the power of advanced SEO tools and data-driven insights.

                          Start my free trial
                          Products
                          • Moz Pro
                          • Moz Local
                          • Moz API
                          • Moz Data
                          • STAT
                          • Product Updates
                          Moz Solutions
                          • SMB Solutions
                          • Agency Solutions
                          • Enterprise Solutions
                          • Digital Marketers
                          Free SEO Tools
                          • Domain Authority Checker
                          • Link Explorer
                          • Keyword Explorer
                          • Competitive Research
                          • Brand Authority Checker
                          • Local Citation Checker
                          • MozBar Extension
                          • MozCast
                          Resources
                          • Blog
                          • SEO Learning Center
                          • Help Hub
                          • Beginner's Guide to SEO
                          • How-to Guides
                          • Moz Academy
                          • API Docs
                          About Moz
                          • About
                          • Team
                          • Careers
                          • Contact
                          Why Moz
                          • Case Studies
                          • Testimonials
                          Get Involved
                          • Become an Affiliate
                          • MozCon
                          • Webinars
                          • Practical Marketer Series
                          • MozPod
                          Connect with us

                          Contact the Help team

                          Join our newsletter
                          Moz logo
                          © 2021 - 2026 SEOMoz, Inc., a Ziff Davis company. All rights reserved. Moz is a registered trademark of SEOMoz, Inc.
                          • Accessibility
                          • Terms of Use
                          • Privacy