The Moz Q&A Forum

    • Forum
    • Questions
    • My Q&A
    • Users
    • Ask the Community

    Welcome to the Q&A Forum

    Browse the forum for helpful insights and fresh discussions about all things SEO.

    1. SEO and Digital Marketing Q&A Forum
    2. Categories
    3. Intermediate & Advanced SEO
    4. Block in robots.txt instead of using canonical?

    Block in robots.txt instead of using canonical?

    Intermediate & Advanced SEO
    9 4 1.6k
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as question
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • YairSpolter
      YairSpolter last edited by

      When I use a canonical tag for pages that are variations of the same page, it basically means that I don't want Google to index this page. But at the same time, spiders will go ahead and crawl the page. Isn't this a waste of my crawl budget? Wouldn't it be better to just disallow the page in robots.txt and let Google focus on crawling the pages that I do want indexed?

      In other words, why should I ever use rel=canonical as opposed to simply disallowing in robots.txt?

      1 Reply Last reply Reply Quote 0
      • RobertFisher
        RobertFisher last edited by

        Yair

        I think that the canonical is the better option. I am unsure as to your use of the term "crawl budget," in that there is no fixed number of times a page or a site will be crawled versus a second similar site for example. I have a huge reference site that is crawled every couple of days and I have small sites of ten pages that are crawled weekly or less. It is dependent on the traffic and behaviors of that traffic (which would include number of inbound links, etc.) and on things like you re-submitting sitemap, etc. 
        The canonical tag was created to provide the clarification to the search engine as to what you considered to be the relevant page. Go ahead and use it.

        Best

        Robert

        YairSpolter 1 Reply Last reply Reply Quote 1
        • YairSpolter
          YairSpolter @RobertFisher last edited by

          Thanks for the response, Robert.

          I have read lots of SEO advice on maximizing your "crawl budget" - making sure your internal link system is built well to send the bots to the right pages. According to my research, since bots only spend a certain amount of time on your site when they are crawling, it is important to do whatever you can to ensure that they don't "waste time" on pages that are not important for SEO. Just as one example, see this post from AJ Kohn.

          Do you disagree with this whole approach?

          Devanur-Rafi RobertFisher 2 Replies Last reply Reply Quote 0
          • Devanur-Rafi
            Devanur-Rafi @YairSpolter last edited by

            Hi, even if you use robots.txt file to block these pages, Google can still pick the references of these pages from third-party websites and can crawl from there. Such pages will not have the description snippet in the search results and instead will show text that reads:

            A description of this result is not available because of this site's robots.txt.

            So, to fully stop Google from crawling these pages, you can go in for the page-level meta robots tag along with the robots.txt method. The page-level robots meta tag complements robots.txt method.By the way, robots.txt file can definitely save you some crawl budget. I don't think you should be thinking much about crawl budget though, as long as your website is super-easy to crawl with simple text-based internal links and stuff like, super-fast servers etc.,

            Those my my two cents my friend.

            Best regards,

            Devanur Rafi

            1 Reply Last reply Reply Quote 1
            • RobertFisher
              RobertFisher @YairSpolter last edited by

              I don't disagree at all and I think AJ Kohn is a rock star. In SEO, I have learned over time that there are rarely absolutes like always do this or never do that. I based my answer on how you posited the question.

              If you read AJ's post you will note that the rel=canonical issue comes up with others commenting and not in the body of his post. Yes, if the page is superfluous like a cart page or a contact page, use the robots.txt to block the crawl. But, if you have a page with rank, links, etc. that help your canonical page, how are you helping yourself by forgoing rel=canon?

              I think his bigger point was that you want to be aware and to understand that the # of times you are crawled is at least partially governed by PR which is governed by all those other things we discussed. If you understand that and keep the crawl focused on better pages you help yourself.

              Does that clarify a bit?
              Best

              YairSpolter 1 Reply Last reply Reply Quote 1
              • TakeshiYoung
                TakeshiYoung last edited by

                I would go with the canonicals. If there are any links going to these duplicate pages, that will prevent any "link juice evaporation" from links which Google can see but can't crawl due to robots.txt. Best to let Google just crawl the page and see the canonical so that it understands that it is a duplicate page.

                Having canonicals on all your pages is good practice anyway, as it can prevent inadvertent duplicate content from things like query parameters.

                Crawl budget can be of some concern if you're talking about a massive number of pages, but start by first taking a look at Google Webmaster Tools and seeing how many of your pages are being crawled vs the total number of pages on your site. As long as this ration isn't small, you should be good. You can also get more crawl budget by building up your domain authority by building links.

                YairSpolter 1 Reply Last reply Reply Quote 0
                • YairSpolter
                  YairSpolter @TakeshiYoung last edited by

                  Thanks Takeshi.

                  Maybe I should have explained that I'm talking about a large site - around 400K pages. More than 1,000 new pages are created per  week. That's why I am concerned about managing crawl budget. The pages that I'm referring to are not linked to anywhere on the site. Sure, Google can potentially get to them if someone decides to link to them on their own site, but this is unlikely (since it's a sub-page of the main profile page, which is where people would naturally link to) and certainly won't happen on a large scale. So I'm not really concerned about about link-juice evaporation. According to AJ Kohn here, it's not enough to see in Webmaster Tools that Google has indexed all pages on our site. There is also the issue of how often pages are being crawled, which is what we are trying to optimize for.

                  So it's really a question of balance... if these pages (there are MANY of them) are included in the crawl (and in our sitemap), potentially it's a real waste of crawl budget. Doesn't this outweigh the minuscule, far-fetched potential loss?

                  Would love to hear your thoughts...

                  1 Reply Last reply Reply Quote 0
                  • YairSpolter
                    YairSpolter @RobertFisher last edited by

                    Thanks Robert.

                    The pages that I'm talking about disallowing do not have rank or links. They are sub-pages of a profile page. If anything, the main page will be linked to, not the sub-pages.

                    Maybe I should have explained that I'm talking about a large site - around 400K pages. More than 1,000 new pages are created per  week. That's why I am concerned about managing crawl budget. The pages that I'm referring to are not linked to anywhere on the site. Sure, Google can potentially get to them if someone decides to link to them on their own site, but this is unlikely and certainly won't happen on a large scale. So I'm not really concerned about about losing pagerank on the main profile page if I disallow them. To be clear: we have many thousands of pages with content that we want to rank. The pages I'm talking about are not important in those terms.

                    So it's really a question of balance... if these pages (there are MANY of them) are included in the crawl (and in our sitemap), potentially it's a real waste of crawl budget. Doesn't this outweigh the minuscule, far-fetched potential loss?

                    I understand that Google designed rel=canonical for this scenario, but that does not mean that it's necessarily the best way to go considering the other options.

                    RobertFisher 1 Reply Last reply Reply Quote 0
                    • RobertFisher
                      RobertFisher @YairSpolter last edited by

                      With this info, I would go with Robots.txt because, as you say, it outweighs any potential loss given the use of the pages and the absence of links.

                      Thanks

                      1 Reply Last reply Reply Quote 1
                      • 1 / 1
                      • First post
                        Last post
                      • Search Results Pages Blocked in Robots.txt?
                        BeckyKey
                        BeckyKey
                        0
                        3
                        117

                      • Robots.txt Blocking - Best Practices
                        ReunionMarketing
                        ReunionMarketing
                        0
                        7
                        456

                      • Block subdomain directory in robots.txt
                        DirkC
                        DirkC
                        0
                        5
                        1.1k

                      • Using folder blocked by robots.txt before uploaded to indexed folder - is that OK?
                        khi5
                        khi5
                        0
                        4
                        100

                      • Robots.txt Blocked Most Site URLs Because of Canonical
                        0
                        1
                        117

                      • Blocking out specific URLs with robots.txt
                        Modi
                        Modi
                        0
                        3
                        133

                      • Why are these results being showed as blocked by robots.txt?
                        eyepaq
                        eyepaq
                        0
                        9
                        203

                      • Files blocked in robot.txt and seo
                        john4math
                        john4math
                        0
                        4
                        344

                      Get started with Moz Pro!

                      Unlock the power of advanced SEO tools and data-driven insights.

                      Start my free trial
                      Products
                      • Moz Pro
                      • Moz Local
                      • Moz API
                      • Moz Data
                      • STAT
                      • Product Updates
                      Moz Solutions
                      • SMB Solutions
                      • Agency Solutions
                      • Enterprise Solutions
                      • Digital Marketers
                      Free SEO Tools
                      • Domain Authority Checker
                      • Link Explorer
                      • Keyword Explorer
                      • Competitive Research
                      • Brand Authority Checker
                      • Local Citation Checker
                      • MozBar Extension
                      • MozCast
                      Resources
                      • Blog
                      • SEO Learning Center
                      • Help Hub
                      • Beginner's Guide to SEO
                      • How-to Guides
                      • Moz Academy
                      • API Docs
                      About Moz
                      • About
                      • Team
                      • Careers
                      • Contact
                      Why Moz
                      • Case Studies
                      • Testimonials
                      Get Involved
                      • Become an Affiliate
                      • MozCon
                      • Webinars
                      • Practical Marketer Series
                      • MozPod
                      Connect with us

                      Contact the Help team

                      Join our newsletter
                      Moz logo
                      © 2021 - 2026 SEOMoz, Inc., a Ziff Davis company. All rights reserved. Moz is a registered trademark of SEOMoz, Inc.
                      • Accessibility
                      • Terms of Use
                      • Privacy