The Moz Q&A Forum

    • Forum
    • Questions
    • My Q&A
    • Users
    • Ask the Community

    Welcome to the Q&A Forum

    Browse the forum for helpful insights and fresh discussions about all things SEO.

    1. SEO and Digital Marketing Q&A Forum
    2. Categories
    3. Technical SEO Issues
    4. I can't crawl the archive of this website with Screaming Frog

    I can't crawl the archive of this website with Screaming Frog

    Technical SEO Issues
    12 3 239
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as question
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • gjergjshala
      gjergjshala last edited by

      Hi

      I'm trying to crawl this website (http://zeri.info/) with Screaming Frog but because of some technical issue with their site (i can't find what is causing it) i'm able to crawl only the first page of each category (ex. http://zeri.info/sport/) and then it will go to crawl each page of their archive (hundreds of thousands of pages) but it won't crawl the links inside these pages.

      Thanks a lot!

      1 Reply Last reply Reply Quote 0
      • DirkC
        DirkC last edited by

        Did you put in some special filters - just tried to crawl the site & it seems to work just fine?

        gjergjshala 1 Reply Last reply Reply Quote 0
        • LoganRay
          LoganRay last edited by

          Try going to File > Default Conif > Clear Default Configuration. This happens to me sometimes as well as I've edited settings over time. Clearing it out and going back to default settings is usually quicker than clicking through the settings to identify which one is causing the problem.

          gjergjshala 1 Reply Last reply Reply Quote 1
          • gjergjshala
            gjergjshala @DirkC last edited by

            Hi Dirk

            Thanks a lot for replying back. The issue is that Screaming Frog is crawling the archive pages (like these examples) but it won't crawl the articles that are listed inside these pages.

            The hierarchy of the site goes like this:

            Homepage
             - Categories (with about 20 articles in them)
              - Archive of that category (with all the remaining articles, which in this case means thousands since they are a news website)

            Screaming Frog will crawl the homepage and categories ... but after it goes to the archive it won't crawl the articles inside archive, instead it will only crawl the pages (pagination) of that archive.

            Thanks again.

            DirkC 1 Reply Last reply Reply Quote 0
            • gjergjshala
              gjergjshala @LoganRay last edited by

              Hi Logan

              I've tried going back to default configuration but it didn't help .. still i don't believe Screaming Frog is to blame, i think there is something wrong with the way the site has been developed (they are using a custom CMS) .. but i can't find the reason why this is happening. As soon as i find the solution then i can ask the guys who developed this site to make the necessary changes.

              Thanks a lot.

              1 Reply Last reply Reply Quote 0
              • DirkC
                DirkC @gjergjshala last edited by

                It think Screaming Frog is going nuts on the formkey value in the url which is constantly changing when changing pages.

                Could you modify the settings of the spider to respect noindex & respect canonical - looks like this is solving the issue.

                Alternatively you could rewrite the url to ignore the formkey (remove parameter)

                Dirk

                gjergjshala 1 Reply Last reply Reply Quote 1
                • gjergjshala
                  gjergjshala @DirkC last edited by

                  I've tried changing settings to respect noindex & canonical .. it will stop crawling the archive pages but still it won't crawl the links inside those pages. (i've added NOINDEX, FOLLOW in all archive pagination pages)

                  What do you mean by rewriting the url to ignore the formkey? How do you think it should be.

                  Gjergji

                  DirkC 1 Reply Last reply Reply Quote 0
                  • DirkC
                    DirkC @gjergjshala last edited by

                    In the menu 'url rewriting' you can simply put the parameters the site should ignore (like date, formkey,..). I removed the formkey parameter and I checked the pages of the archive in Screaming Frog.

                    It is clearly able to detect all the internal links on the page - so will crawl them.

                    How are you certain that the pages below are not crawled - could you give a specific example of page that should be crawled but isn't?

                    Dirk

                    gjergjshala 1 Reply Last reply Reply Quote 1
                    • gjergjshala
                      gjergjshala @DirkC last edited by

                      Dirk, thanks a lot.

                      I just added "formkey" to be removed as a parameter and it seems to be working. I crawled 1k pages until now and i'm going to monitor how it goes.

                      The site has more than 400k pages so the process to crawl them all will take time (and i'm going to have to crawl each sector so i can create sitemaps for them).

                      Thanks again
                      Gjergj

                      DirkC 1 Reply Last reply Reply Quote 0
                      • DirkC
                        DirkC @gjergjshala last edited by

                        Great it worked. Just a small note - if Screaming Frog is getting confused by all these parameters, it could well be that Googlebot (while more sophisticated) is also having these issues. You could consider to exclude the formkey parameter in the Search Console (Crawl > URL parameters)

                        DIrk

                        gjergjshala 1 Reply Last reply Reply Quote 0
                        • gjergjshala
                          gjergjshala @DirkC last edited by

                          I can't make it work. After removing 'fromkey' parameter i was able to crawl 1.7k and it stopped there. The site has more than 400k pages so .. something must be wrong 😞

                          I want to crawl only the root domain without subdomains and all i can crawl is around 2k pages.

                          Do you have any idea what might be happening?

                          DirkC 1 Reply Last reply Reply Quote 0
                          • DirkC
                            DirkC @gjergjshala last edited by

                            I think the issue comes from the way you handle the pagination and or the way your render archived pages. 
                            Example: First archive page of Aktuale

                            http://zeri.info/arkiva/?formkey=7301c1be1634ffedb1c3780e5063819b6ec19157&acid=aktuale

                            Clicking on page 2 adds the date

                            http://zeri.info/arkiva/?from=2016-06-01&until=2016-06-16&acid=aktuale&formkey=cc0a40ca389eb511b1369a9aa9da915826d6ca44&faqe=2#archive-results => I assume that you're only listing the articles published from June 1st till today.

                            If I check all the different section & the number of articles listed in each archive I get approx. 1200 pages - add some additional pages linked on these pages and you get to the 2K pages you mentioned.

                            There seems to be no possibility to reach the previously published content without executing a search - which Screaming Frog can't do. It's quite possible that this is causing issues for Google bot as well so I would try to fix this.

                            If you really want to crawl the full site in the mean time - add another rule in url rewriting - this time selecting 'regex replace' -

                            add regex: from=2016-06-01 
                            replace regex from=2010-01-01 (replace by the earliest date of publishing)

                            This way - the system will call url http://zeri.info/arkiva/?from=**2010-06-01**&until=2016-06-16&acid=kultura&formkey=5932742bd5dd77799524ba31b94928810908fc07&faqe=2 rather than the original one - listing all the articles instead of only the june articles.

                            Hope this helps.

                            Dirk

                            1 Reply Last reply Reply Quote 0
                            • 1 / 1
                            • First post
                              Last post
                            • Why can't google mobile friendly test access my website?
                              Nadav_W
                              Nadav_W
                              0
                              4
                              1.7k

                            • My wepgages aren't crawled by google
                              MichaelGregory
                              MichaelGregory
                              0
                              9
                              131

                            • Can't understand poor rankings
                              MikeAquaspresso
                              MikeAquaspresso
                              0
                              14
                              158

                            • Why can't I redirect 302 errors to 301's?
                              Chris.Menke
                              Chris.Menke
                              0
                              3
                              80

                            • As a wholesale website can our independent retailer's website use (copy) our content?
                              ewanTHH
                              ewanTHH
                              0
                              5
                              514

                            • Why Can't I Get on Google?
                              Matt-Williamson
                              Matt-Williamson
                              0
                              2
                              351

                            • Website isn't Ranking for Any Keyword
                              Confetti_Wedding
                              Confetti_Wedding
                              0
                              6
                              599

                            • I just found something weird I can't explain, so maybe you guys can help me out.
                              mattbeswick
                              mattbeswick
                              0
                              4
                              1.1k

                            Get started with Moz Pro!

                            Unlock the power of advanced SEO tools and data-driven insights.

                            Start my free trial
                            Products
                            • Moz Pro
                            • Moz Local
                            • Moz API
                            • Moz Data
                            • STAT
                            • Product Updates
                            Moz Solutions
                            • SMB Solutions
                            • Agency Solutions
                            • Enterprise Solutions
                            • Digital Marketers
                            Free SEO Tools
                            • Domain Authority Checker
                            • Link Explorer
                            • Keyword Explorer
                            • Competitive Research
                            • Brand Authority Checker
                            • Local Citation Checker
                            • MozBar Extension
                            • MozCast
                            Resources
                            • Blog
                            • SEO Learning Center
                            • Help Hub
                            • Beginner's Guide to SEO
                            • How-to Guides
                            • Moz Academy
                            • API Docs
                            About Moz
                            • About
                            • Team
                            • Careers
                            • Contact
                            Why Moz
                            • Case Studies
                            • Testimonials
                            Get Involved
                            • Become an Affiliate
                            • MozCon
                            • Webinars
                            • Practical Marketer Series
                            • MozPod
                            Connect with us

                            Contact the Help team

                            Join our newsletter
                            Moz logo
                            © 2021 - 2026 SEOMoz, Inc., a Ziff Davis company. All rights reserved. Moz is a registered trademark of SEOMoz, Inc.
                            • Accessibility
                            • Terms of Use
                            • Privacy