Does having active urls with and without trailing .html impact SEO?
-
A recent update resulted in duplication of urls on our site due to inconsistent url structure:
Example:
- /category2.html and /category2 both active on the site as the same page
Will this hurt and should we create redirects using only one version of the url?
- /category2.html redirect to /category2
-
It may do or it may not. It may or may not impact upon duplicate content, it always impacts upon crawl allowance
I'm going to use trailing slash URLs (a more common issue and consolidation feature) in my example, but it's equally applicable for stripping .HTML or non-resource (PDF, JPG, JS etc) file extensions
Quite a lot of sites, even if they refuse to clean this up, will at least 'canonical' one URL to the other. That let's Google know that one version of the page is canonical and should receive relevant SEO traffic - it avoids content duplication related penalties or algorithmic devaluations. There are two things it doesn't help Google out with
- It doesn't tell Google not to crawl both URLs (you might say the canonical tag does that, but keep in mind Google has to have already loaded both URLs to read both canonical tags so... no)
- It doesn't consolidate SEO authority to the same degree that 301 redirects do. Say one page has some nice backlinks and the other one does too, that 'ranking benefit' won't all be consolidated onto one page. The canonical tag will make sure only one page ranks, but it won;t gain the 'optimal' benefit of the backlinks for both web-pages (301s do a better job of that, generally)
So as you can see, even if you avoid content duplication issues, there are other problems that could potentially arise. This being the case, it's best to consolidate your URL architecture at and and all levels
My preference is this logic in the htaccess (via 301s):
- Always force a trailing slash for pages (as they may have sub-pages, and can also be directories)
- EXCEPT if the active URL is a file (e.g: somesite.com/some-folder/some-image.jpg) - in which case, do not force a trailing slash (files are never folders / directories)
- But if the file extension is page-based rather than resource based (e.g: .html) then strip the extension and finish with a trailing slash
SEO is about avoiding risk. If there is conflicting information on a subject, pick the tried and tested (safe) method
Note that if you are on an MS / IIS server (rather than Linux / Apache) you may have to modify web.config instead of '.htaccess'