The Moz Q&A Forum

    • Forum
    • Questions
    • My Q&A
    • Users
    • Ask the Community

    Welcome to the Q&A Forum

    Browse the forum for helpful insights and fresh discussions about all things SEO.

    1. SEO and Digital Marketing Q&A Forum
    2. Categories
    3. Intermediate & Advanced SEO
    4. Crawled page count in Search console

    Crawled page count in Search console

    Intermediate & Advanced SEO
    9 2 293
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as question
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Bob_van_Biezen
      Bob_van_Biezen last edited by

      Hi Guys,

      I'm working on a project (premium-hookahs.nl) where I stumble upon a situation I can’t address. Attached is a screenshot of the crawled pages in Search Console.

      History:

      Doing to technical difficulties this webshop didn’t always no index filterpages resulting in thousands of duplicated pages. In reality this webshops has less than 1000 individual pages. At this point we took the following steps to result this:

      1. Noindex filterpages.
      2. Exclude those filterspages in Search Console and robots.txt.
      3. Canonical the filterpages to the relevant categoriepages.

      This however didn’t result in Google crawling less pages. Although the implementation wasn’t always sound (technical problems during updates) I’m sure this setup has been the same for the last two weeks. Personally I expected a drop of crawled pages but they are still sky high. Can’t imagine Google visits this site 40 times a day.

      To complicate the situation:

      We’re running an experiment to gain positions on around 250 long term searches. A few filters will be indexed (size, color, number of hoses and flavors) and three of them can be combined. This results in around 250 extra pages. Meta titles, descriptions, h1 and texts are unique as well.

      Questions:

      1. -          Excluding in robots.txt should result in Google not crawling those pages right?
      2. -          Is this number of crawled pages normal for a website with around 1000 unique pages?
      3. -          What am I missing?

      BxlESTT

      1 Reply Last reply Reply Quote 0
      • donford
        donford last edited by

        Hello Bob,

        Here is some food for thought. If you disallow a page in Robots.txt, google for example will not crawl that page. That does not however mean they will remove it from the index if it had previously been crawled. It simply treats it as inaccessible and moves on. It will take some time, months before Google finally says, we have no fresh crawls of page x, its time to remove it from the index.

        On the other hand if you specifically allow Google to crawl those pages and show a no-index tag on it, Google now has a new directive it can act upon immediately.

        So my evaluation of the situation would be to do 1 of 2 things.

        1. Remove the disallow from robots and allow Google to crawl the pages again. However, this time use no-index, no-follow tags.

        2. Remove the disallow from robots and allow Google to crawl the pages again, but use canonical tags to the main "filter" page to prevent further indexing the specific filter pages.

        Which option is best depends on the amount of urls being indexed, a few thousand canonical would be my choice. A few hundred thousand, then no index would make more sense.

        Whichever option, you will have to insure Google re-crawls, and then allow them time to re-index appropriately. Not a quick fix, but a fix none the less.

        My thoughts and I hope it makes sense,

        Don

        Bob_van_Biezen 2 Replies Last reply Reply Quote 1
        • Bob_van_Biezen
          Bob_van_Biezen @donford last edited by

          Hello Don,

          Thanks for your advice. What would your advice be if the main goal would be the reduction of crawled pages per day? I think we got the right pages in the index and the old duplicates are mostly deindexed. At this point I’m mostly worried about Google spending it’s crawlbudget on the right pages. Somehow it still crawls 40.000 pages per day while we only got around 1000 pages that should be crawled. Looking at the current setup (with almost everything excluded though robots.txt) I can’t think of pages it does crawl to reach the 40k. And 40 times a day sounds like way to many crawled pages for a normal webshop.

          Hope to hear from you!

          1 Reply Last reply Reply Quote 0
          • Bob_van_Biezen
            Bob_van_Biezen @donford last edited by

            Hi Don,

            Just wanted to add a quick note: your input made go through the indexation state of the website again which was worse than I through it was. I will take some steps to get this resolved, thanks!

            Would love to hear your input about the number of crawled pages.

            Best regards,

            Bob

            1 Reply Last reply Reply Quote 0
            • donford
              donford last edited by

              Hi Bob,

              You can "suggest" a crawl rate to Google by logging into your webmasters tools on Google and adjusting it there.

              As for indexing pages.. I looked at your robots and site. It really looks like you need to employ some No Follow on some of your internal linking, specifically on the product page filters, that alone could reduce the total number of URLS that the crawlers even attempts to look at.

              Additionally your sitemap http://premium-hookahs.nl/sitemap.xml shows a change frequency of daily, and probably should be broken out between Pages / Images so you end up using two sitemaps one for images and one for pages. You may also want to review what is in there. Using ScreamingFrog (free) the sitemap I made (link) only shows about 100 urls.

              Hope it helps,

              Don

              Bob_van_Biezen 1 Reply Last reply Reply Quote 1
              • Bob_van_Biezen
                Bob_van_Biezen @donford last edited by

                Hi Don,

                You're right about the sitemap, noted it on the to do list!

                Your point about nofollow is intersting. Isn't excluding in robots.txt giving the same result?

                Before we went on with the robots.txt we didn't implant nofollow because we didn't want any linkjuice to pass away. Since we got robots.txt I assume this doesn’t matter anymore since Google won’t crawl those pages anyway.

                Best regards,

                Bob

                donford 1 Reply Last reply Reply Quote 0
                • donford
                  donford @Bob_van_Biezen last edited by

                  Hi Bob,

                  About the nofollow vs blocked. In the end I suppose you have the same results, but in practice it works a little differently. When you nofollow a link it tells the crawler as soon as it encounters the link not to request or follow that link path. When you block it via robots the crawler still attempts to access the url only to find it not accessible.

                  Imagine if I said go to the parking lot and collect all the loose change in all the unlocked cars. Now imagine how much easier that task would be if all the locked cars had a sign in the window that said "Locked", you could easily ignore the locked cars and go directly to the unlocked ones. Without the sign you would have to physically go check each car to see if it will open.

                  About link juice, if you have a link, juice will be passed regardless of the type of link. (You used to be able to use nofollow to preserve link juice but no longer). This is bit unfortunate for sites that use search filters because they are such a valuable tool for the users.

                  Don

                  Bob_van_Biezen 1 Reply Last reply Reply Quote 1
                  • Bob_van_Biezen
                    Bob_van_Biezen @donford last edited by

                    Hi Don,

                    Thanks for the clear explanation. I always though disallow in robots.txt would give a sort of map to Google (at the start of a site crawl) with the pages on the site that shouldn’t be crawled. So he therefore didn’t have to “check the locked cars”.

                    If I understand you correctly, google checks the robots.txt with every single page load?

                    That could definitely explain high number of crawled pages per day.

                    Thanks a lot!

                    donford 1 Reply Last reply Reply Quote 0
                    • donford
                      donford @Bob_van_Biezen last edited by

                      Ben,

                      I doubt that crawlers are going to access the robots.txt file for each request, but they still have to validate any url they find against the list of the blocked ones.

                      Glad to help,

                      Don

                      1 Reply Last reply Reply Quote 1
                      • 1 / 1
                      • First post
                        Last post
                      • In Search Console, why is the XML sitemap "issue" count 5x higher than the URL submission count?
                        Extima-Christian
                        Extima-Christian
                        0
                        2
                        53

                      • Client has an inexplicable jump in crawled pages being reported in Google Search Console
                        BigChad2
                        BigChad2
                        0
                        3
                        55

                      • Search Console - Best practice to fetch pages when you update them?
                        Justen_H
                        Justen_H
                        0
                        2
                        62

                      • Should I worry about rendering problems of my pages in google search console fetch as google?
                        NickLeRoy
                        NickLeRoy
                        0
                        2
                        92

                      • Why are these pages showing up as 404 in Search Console?
                        0
                        1
                        53

                      • Google Search Console - Indexed Pages
                        muzzmoz
                        muzzmoz
                        0
                        4
                        1.4k

                      • "No index" page still shows in search results and paginated pages shows page 2 in results
                        khi5
                        khi5
                        0
                        3
                        114

                      • Do search engines crawl links on 404 pages?
                        brad-causes
                        brad-causes
                        0
                        8
                        740

                      Get started with Moz Pro!

                      Unlock the power of advanced SEO tools and data-driven insights.

                      Start my free trial
                      Products
                      • Moz Pro
                      • Moz Local
                      • Moz API
                      • Moz Data
                      • STAT
                      • Product Updates
                      Moz Solutions
                      • SMB Solutions
                      • Agency Solutions
                      • Enterprise Solutions
                      • Digital Marketers
                      Free SEO Tools
                      • Domain Authority Checker
                      • Link Explorer
                      • Keyword Explorer
                      • Competitive Research
                      • Brand Authority Checker
                      • Local Citation Checker
                      • MozBar Extension
                      • MozCast
                      Resources
                      • Blog
                      • SEO Learning Center
                      • Help Hub
                      • Beginner's Guide to SEO
                      • How-to Guides
                      • Moz Academy
                      • API Docs
                      About Moz
                      • About
                      • Team
                      • Careers
                      • Contact
                      Why Moz
                      • Case Studies
                      • Testimonials
                      Get Involved
                      • Become an Affiliate
                      • MozCon
                      • Webinars
                      • Practical Marketer Series
                      • MozPod
                      Connect with us

                      Contact the Help team

                      Join our newsletter
                      Moz logo
                      © 2021 - 2026 SEOMoz, Inc., a Ziff Davis company. All rights reserved. Moz is a registered trademark of SEOMoz, Inc.
                      • Accessibility
                      • Terms of Use
                      • Privacy