The Moz Q&A Forum

    • Forum
    • Questions
    • My Q&A
    • Users
    • Ask the Community

    Welcome to the Q&A Forum

    Browse the forum for helpful insights and fresh discussions about all things SEO.

    1. SEO and Digital Marketing Q&A Forum
    2. Categories
    3. Intermediate & Advanced SEO
    4. Robots.txt assistance

    Robots.txt assistance

    Intermediate & Advanced SEO
    9 4 280
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as question
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • theLotter
      theLotter last edited by

      I want to block all the inner archive news pages of my website in robots.txt - we don't have R&D capacity to set up rel=next/prev or create a central page that all inner pages would have a canonical back to, so this is the solution.

      The first page I want indexed reads:
      http://www.xxxx.news/?p=1

      all subsequent pages that I want blocked because they don't contain any new content read:
      http://www.xxxx.news/?p=2
      http://www.xxxx.news/?p=3
      etc....

      There are currently 245 inner archived pages and I would like to set it up so that future pages will automatically be blocked since we are always writing new news pieces. Any advice about what code I should use for this?

      Thanks!

      1 Reply Last reply Reply Quote 0
      • Andy.Drinkwater
        Andy.Drinkwater last edited by

        I haven't actually done this myself, but I suspect that pattern matching is your solution here.

        However, what you want to be able to do is disallow the whole pattern and then allow just the first page:

        Allow: /?p=1
        Disallow: /?p=*
        

        The thing I don't have the answer to, is if this will work by first allowing the page 1, and then blocking all others. I don't have a method for this in blocking via robots as this is normally handed with other solutions you mention.

        You can try it though through Webmaster tools:
        https://support.google.com/webmasters/answer/156449?hl=en

        1. On the Webmaster Tools Home page, click the site you want.
        2. Under Crawl, click Blocked URLs.
        3. If it's not already selected, click the** Test robots.txt** tab.
        4. Copy the content of your robots.txt file, and paste it into the first box.
        5. In the URLs box, list the site to test against.
        6. In the User-agents list, select the user-agents you want.

        -Andy

        1 Reply Last reply Reply Quote 1
        • Martijn_Scheijbeler
          Martijn_Scheijbeler last edited by

          I think it has to be the other way around: Disallow: /?p=* Allow: /?p=1 as you want to first disallow everything with the P parameter but then allow the first page. You should test it but I think in Andy's example you will still block the first page which you've just allowed.

          Andy.Drinkwater 1 Reply Last reply Reply Quote 1
          • Andy.Drinkwater
            Andy.Drinkwater @Martijn_Scheijbeler last edited by

            Definitely something to test. I'm not sure of the rules that Google will apply with this and which way round works.

            -Andy

            1 Reply Last reply Reply Quote 0
            • CleverPhD
              CleverPhD last edited by

              I think you are missing something here if you want to get these pages out of the index. Plus, your use of Robots may harm how Google finds and ranks your actual news items.

              First, you have to add the noindex meta tag to pages 2-N in your pagination.  Let Google crawl them and take them out of the index.

              If you just add them to robots.txt, Google will not crawl, but will also not remove them from the index.

              Once you get them out of the index, keeping those tags in place will prevent reindexation and you don't have to add them to Robots.txt.

              More importantly, you want pages 2-N being spidered but not indexed.  You want Google to crawl your paginated pages to find all of your deep content.  Otherwise, unless you have a XML or HTML sitemap, or some other crawlable navigational aid, you are actually preventing Google from crawling and then ranking your content.

              Read this Moz post

              http://moz.com/learn/seo/robotstxt

              There is a section titled "Why Meta Robots is Better than Robots.txt" that will confirm my points.

              Lastly. Step back a second. If you are a news/content site and this helps you to generate revenue, and you have a bunch of news pages, and this is important content, spend some money on Development to implement the rel=next/prev.  It is worth it to get Google crawling your stuff properly.

              Good luck!

              Andy.Drinkwater 1 Reply Last reply Reply Quote 0
              • Andy.Drinkwater
                Andy.Drinkwater @CleverPhD last edited by

                If you read the original post again, Sara says "we don't have R&D capacity".

                They wouldn't be able to do all this.

                -Andy

                CleverPhD 1 Reply Last reply Reply Quote 1
                • CleverPhD
                  CleverPhD @Andy.Drinkwater last edited by

                  Thanks Andy.  I did see that and that is why I mentioned at the end that being a content site and if that generates revenue that they should consider investing some money in that direction.

                  If they are short on money/resources/capacity and the robots.txt solution could actually negatively impact indexation of content that is producing/justifying the current level of money/resources/capacity they could end up in worse position than where they started, i.e. having less money/resources/capacity.

                  Andy.Drinkwater 1 Reply Last reply Reply Quote 1
                  • Andy.Drinkwater
                    Andy.Drinkwater @CleverPhD last edited by

                    "I mentioned at the end that being a content site and if that generates revenue that they should consider investing some money in that direction"

                    Absolutely.

                    1 Reply Last reply Reply Quote 0
                    • theLotter
                      theLotter last edited by

                      Thanks for all the input and advice!

                      We are a gaming site that publishes industry news 2-3 times a week, but that is not our main source of income

                      1 Reply Last reply Reply Quote 0
                      • 1 / 1
                      • First post
                        Last post
                      • Is robots met tag a more reliable than robots.txt at preventing indexing by Google?
                        Bobbi_Tschumper
                        Bobbi_Tschumper
                        1
                        7
                        3.0k

                      • Robots.txt advice
                        Martijn_Scheijbeler
                        Martijn_Scheijbeler
                        0
                        3
                        105

                      • Twitter Robots.TXT
                        MarketingChimp10
                        MarketingChimp10
                        0
                        5
                        453

                      • Meta robots or robot.txt file?
                        Andy.Drinkwater
                        Andy.Drinkwater
                        0
                        5
                        152

                      • Robots.txt
                        Travis_Bailey
                        Travis_Bailey
                        0
                        4
                        107

                      • Robots.txt
                        TomRayner
                        TomRayner
                        0
                        5
                        137

                      • Robot.txt help
                        evolvingSEO
                        evolvingSEO
                        0
                        23
                        203

                      • Why is noindex more effective than robots.txt?
                        webseoservices
                        webseoservices
                        0
                        3
                        705

                      Get started with Moz Pro!

                      Unlock the power of advanced SEO tools and data-driven insights.

                      Start my free trial
                      Products
                      • Moz Pro
                      • Moz Local
                      • Moz API
                      • Moz Data
                      • STAT
                      • Product Updates
                      Moz Solutions
                      • SMB Solutions
                      • Agency Solutions
                      • Enterprise Solutions
                      • Digital Marketers
                      Free SEO Tools
                      • Domain Authority Checker
                      • Link Explorer
                      • Keyword Explorer
                      • Competitive Research
                      • Brand Authority Checker
                      • Local Citation Checker
                      • MozBar Extension
                      • MozCast
                      Resources
                      • Blog
                      • SEO Learning Center
                      • Help Hub
                      • Beginner's Guide to SEO
                      • How-to Guides
                      • Moz Academy
                      • API Docs
                      About Moz
                      • About
                      • Team
                      • Careers
                      • Contact
                      Why Moz
                      • Case Studies
                      • Testimonials
                      Get Involved
                      • Become an Affiliate
                      • MozCon
                      • Webinars
                      • Practical Marketer Series
                      • MozPod
                      Connect with us

                      Contact the Help team

                      Join our newsletter
                      Moz logo
                      © 2021 - 2026 SEOMoz, Inc., a Ziff Davis company. All rights reserved. Moz is a registered trademark of SEOMoz, Inc.
                      • Accessibility
                      • Terms of Use
                      • Privacy