Help with robots.txt on Magento
-
Hi everybody,
I need your help in order to fix some problems with HTML errors and Crawling errors generated by Magento on my client's website www.casabiancheria.it
I have some problems with duplicate meta informations due to the fact that there are a lot of links such as
-
/stampe-romagnole/tovaglie-con-tovaglioli**/colore/**beige,marrone,giallo,lilla/show/all.html
-
/stampe-romagnole/tovaglie-con-tovaglioli**/colore/**beige,marrone,lilla/show/all.html
that are generated by the filter /colore/ and so they have duplicate content and meta information on them.
I activated the canonicals on Magento but this hasn't fixed the problem yet.
On the sitemap there are only 1 link for each product, so it seems that the canonicals are working, but bot Google Webmaster Tools and SEO Moz are giving me errors on duplicate content and meta informations.
I would like to solve these problems by excluding from robots.txt all the urls that contain the filter parameters, such as /colore/, /price/, /dimensions/, etc. (take a look to the attachment).
I tried different solutions in order to exclude these links from robots, but I wasn't able to succeed.
Below you can find my current robots.txt... can someone help me in order to write the correct form of this file and finally exclude all these urls generated by filters on Magento?
Finally, is it worth it to exclude also the images from Magento? (take a look to the final lines of the robots below).
Thank you very much for your help!
Alberto
User-agent: *
Disallow: /CVS
Disallow: /.svn$
Disallow: /.idea$
Disallow: /.sql$
Disallow: /.tgz$
Disallow: /w1nL1f3L0g1c/
Disallow: /app/
Disallow: /downloader/
Disallow: /errors/
Disallow: /includes/
Disallow: /lib/
Disallow: /pkginfo/
Disallow: /shell/
Disallow: /var/
Disallow: /404/
Disallow: /cgi-bin/
Disallow: /magento/
Disallow: /report/
Disallow: /scripts/
Disallow: /shell/
Disallow: /skin/
Disallow: /stats/
Disallow: /api.php
Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /get.php
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_AFL.txt
Disallow: /README.txt
Disallow: /RELEASE_NOTES.txt
Disallow: /?dir
Disallow: /?dir=desc
Disallow: /?dir=asc
Disallow: /?limit=all
Disallow: /?mode*
Disallow: /index.php/
Disallow: /?SID=
Disallow: /checkout/
Disallow: /onestepcheckout/
Disallow: /customer/
Disallow: /customer/account/
Disallow: /customer/account/login/
Disallow: /catalogsearch/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /cgi-bin/
Disallow: /cleanup.php
Disallow: /apc.php
Disallow: /memcache.php
Disallow: /phpinfo.php
Disallow: /control/
Disallow: /customize/
Disallow: /newsletter/
Disallow: /poll/
Disallow: /review/
Disallow: /sendfriend/
Disallow: /tag/
Disallow: /wishlist/
Disallow: /catalog/product/gallery/
Disallow: /?*
Disallow: //colore/
Disallow: //price/
Disallow: //misura/
Disallow: //marca/
Disallow: //sort-by/
Disallow: //combinazione/
Disallow: /*/seleziona-colore/
Disallow: /colore/
Disallow: /price/
Disallow: /misura/
Disallow: /marca/
Disallow: /sort-by/
Disallow: /combinazione/
Disallow: /seleziona-colore/
Disallow: /*colore/
Disallow: /*price/
Disallow: /*misura/
Disallow: /*marca/
Disallow: /*sort-by/
Disallow: /*combinazione/
Disallow: /*seleziona-colore/ -
-
Hi,
If the duplicated content urls are already in the google index then excluding them with the robots.txt will not remove them but just stop the google bot from crawling them again. You could do a bit of conditional logic on your head.phtml template file to check for the relevant url part and output a noindex,follow meta tag on the pages you don't want indexed. This is a more reliable way to make sure they are removed and not indexed in the future (be sure to test first!).
-
This post is deleted!