Resolving duplicate content is hard. The trickiest part is wrapping your head around which arrows in the canonicalization & de-duplication quiver to use in which situations to accomplish 2 twin goals of A) consolidating link popularity and B) deindexing duplicate content. I just finished a series on finding and fixing duplicate content, but I think this decision matrix will help clarify the situation. Before you can begin the process, however, you need to identify the duplicate content.
Click the image to open a larger version.
The first decision as you start at the list of duplicate content URLs is: “Is the content at these URLs an exact duplicate?” Answering “Yes” means that different URLs load the exact same content in the exact same order. Answering “No” means that there are some filtering, sorting, pagination, breadcrumb or other variations generated by the URL.
Let’s take the “Yes” path first, all of our URLs load content that is exactly the same as the others identified. The issue is a lack of canonicalization. The next question is: “Are these URL variations required to load the page or track data for analytics?” Answering “No” means that the duplicate URLs can be canonicalized in the SEO ideal manner with 301 redirects. Excellent! Answering “Yes” means that the URLs must continue to exist for some reason at least in the short term. Canonicalizing in this case will mean applying canonical tags to the head of the file specifying the canonical URL.
Now let’s assume that the answer to the first question about exact duplicates is “No.” This issue is called cannibalization, more than one page of content targeting the same keyword theme.
The next question to ask is: “Can the content at this URL be differentiated sufficiently with content optimization to send its own valuable keyword signal?” Answering “Yes” means that the content does have SEO potential and should be optimized to target a unique, relevent keyword target. Answering “No” means that the page is of low or no value to natural search and has no real chance to rank or drive traffic. Proceed to the next question.
The next question to ask is: “Does the content at this URL serve usability needs?” Answering “Yes” means that the content needs to remain live and accessible to provide functionality that humans enjoy/need (like sorting results by price). Canonicalizing in this case will mean applying canonical tags to the head of the file specifying the canonical URL. Answering “No” means that the page is of low or no value to natural search and usability both, and has no real chance to rank or drive traffic. Proceed to the next question.
The next question to ask is: “Is this URL string required to drive site functionality differently than the others, or to provide tracking data?” Answering “Yes” means that the URL needs to remain live and accessible to provide functionality. Canonicalizing in this case will mean applying canonical tags to the head of the file specifying the canonical URL. Answering “No” means that the page is of low or no value to natural search, usability or business needs and has no real chance to rank or drive traffic. The duplicate URLs can be canonicalized in the SEO ideal manner with 301 redirects. Excellent!
This duplicate content decision matrix identifies the ideal tactics to use to A) consolidating link popularity and B) deindexing duplicate content. In some cases, there will be barriers to 301 redirects, canonical tags or content differentiation via optimization. In those cases, there are other options to deindex content, but they do not consolidate link popularity. This is a critical point to understand. The link popularity accumulated by URLs that are meta noindexed, disallowed or 404’d is wasted. Try, try and try again to remove internal barriers to 301s and canonical tags before resorting to Plan B deindexation tactics like meta noindexes, disallows or 404s.
Today I’m wrapping up the 3-part series on duplicate content & SEO Highlander-style: “There can be only one.” If you missed the first two posts, Highlander SEO refers to the similarities between canonicalization and the centuries-old battle between life-force absorbing immortals popularized in a 1986 film.
We’ve covered how The Highlander is relevant to SEO (in addition to being an awesome movie), and how to find the duplicate content we must battle. Today is all about fixing duplicate content and consolidating all that glorious immortal life force … err link popularity … at a single URL for a single page of content.
Q1: How do you choose the canonical version?
A1: Ideally the canonical version will be the page that has the most link popularity. Typically it’s the version of the URL linked to from the primary navigation of a site, the one without extra parameters and tracking and other gobbledygook appended to it. It’s typically the tidiest looking URL for the simple reason that dynamic systems append extra messy looking stuff in different orders as you click around a site. Ideally if you’ve done URL rewrites or are blessed with a platform that generates clean URLs, the canonical will be a short, static URL that may have a useful keyword or two in it.
Following our example yesterday from Home Depot, the canonical URL would be http://www.homedepot.com/webapp/wcs/stores/servlet/thdsitemap_product_100088778_10053_10051 and all of the others would be canonicalized to it. (Note that the URL is most definitely not canonicalized now, so please do not use this as a positive example to emulate.) Of course I’d much prefer a shorter, rewritten URL to canonicalize to without the intervening platform-required directories: http://www.homedepot.com/p-100088778 or http://www.homedepot.com/c-garden-center/b-toro/p-100088778, but that’s a story for another time.
You’ll also need to choose a canonical protocol, subdomain, domain, file path, file extension, case, and … well … anything else that your server allows to load differently. For example, these 32 fictitious URLs all seem like the same page of content to humans and would load exactly the same content at different URLs.
The task sounds simple: Choose a canonical version of the URL and always link to that canonical version. For example, always link to http://www.noncanonical.com/sales/, which is canonicalized to:
nonsecure http protocol (assuming it is nonsecure content)
www subdomain
noncanonical.com domain & TLD (if you own multiple domains/TLDs)
file path without the /content/directory
ending with a trailling slash (not without trailing slash and not a file+extenstion)
all lowercase
no parameters
This may look like a ridiculous example, but I just audited a network of sites with all of these (and more) sources of duplication on every site in the network. In the example above I didn’t even add case, tracking parameters, breadcrumb variations, parameter order variations or other URL variations, so this is actually a moderate example.
Can you see how the issue can multiply to hundreds of variations of a single page of content. For these duplicate URLs to exist and be indexed, at least one page is linking to each one. And each link to a duplicate URL is a lost opportunity to consolidate link popularity into a single canonical URL.
Connor MacLeod: “How do you fight such a savage?” Ramirez: “With heart, faith and steel. In the end there can be only one.”
Q2: How do you canonicalize once you’ve chosen the canonical URL?
A2: 301 redirects. Once the canonical URL has been chosen, 301 redirects are the ideal way to canonicalize, reduce content duplication and consolidate link popularity to a single canonical URL. There are other ways, and we’ll go over them in the next section, but only a 301 redirect CONSISTENTLY does all three of these things:
Redirect the user agent to the destination URL;
Pass link popularity to the destination URL;
Deindex the URL that has been 301 redirected.
For sites with widespread canonicalization issues, pattern-based 301 redirects are the best bet. For instance, regular expressions can be written to 301 redirect every instance of a URL without the www subdomain to the same URL with the www subdomain. So the 301 redirect would detect the missing www in http://noncanonical.com/sales/ and would 301 redirect it to http://www.noncanonical.com/sales/, without having to write a 301 redirect specific to those 2 exact pages. For each element that creates duplicate content, a 301 redirect would be written to canonicalize that element. So the URL http://noncanonical.com/content/sales/default.html would trigger a 301 redirect to add the www subdomain, remove the /content directory and remove the default.html file+extension.
Yes, that’s 3 redirects for one URL. But assuming you change your own navigation to link to the canonical URLs, and assuming that there aren’t a boatload of sites referring monstrous traffic to your noncanonical URLs, this won’t be a long-term issue. The 301 redirects will consolidate the link popularity to the canonical URL and prompt the engines to deindex the undesirable URLs, which in turn will mean that there are fewer and fewer referrals to them. Which means that the 301s won’t be a long-term burden to the servers.
Speaking of which, fix your linking. Links to noncanonical content on your own site will only perpetuate the duplicate content issue, and put an extra burden on your servers as they serve multiple 301s before loading the destination URL.
Q3: What if I can’t do that, are there other canonicalizing or deduplication tactics?
A3: Yes, but they’re not as effective. The other tactics to suppress or canonicalize content each lack at least one of the important benefits of 301 redirects. Again, this is very important: Only a 301 redirect CONSISTENTLY:
Redirects the user agent to the destination URL;
Passes link popularity to the destination URL;
De-indexes the URL that has been 301 redirected.
Note that a 302 redirect only redirects the user agent. It does not pass link popularity or deindex the redirected URL. For some reason, 302 redirects are the default on many servers. Always check the redirects with a server header checker to be certain.
There are instances where 301 redirects are too complicated to be executed confidently, such as when legacy URLs are rewritten to a keyword-rich URL. In some cases there’s no reliable common element between the legacy URL & the rewritten URL to allow a pattern-based 301 redirect to be written.
When planning URL rewrites, always be sure you’ll be able to 301 redirect the legacy URLs. But if they’re already live and competing with the rewritten URLs and it’s too late to turn back, then all you can do is go forward. In this instance, consider 301 redirecting all legacy structures to the most appropriate category URL or the homepage.
If you can’t do that your other options for the undesirable URLs are:
Canonical tags: The latest rage, I’ve only observed Google use the canonical tags, and only in some instances. They are a suggestion that, if Google (and eventually Yahoo & Bing) choose to follow, will pass link popularity to the canonical URL and devalue the noncanonical version of the URL. If you have difficulty executing 301 redirects because there is no pattern that can reliable detect a match between a legacy URL and a canonical URL, then it’s likely that you won’t be able to match the legacy URL and a canonical URL in a canonical tag either.
404 server header status: Prompts deindexation but does not pass link popularity or the user agent to the canonical URL. Once URL is deindexed, it will not be crawled again unless links to it remain.
Meta robots noindex: Prompts deindexation (or purgatory-like snippetless indexation) but does not pass link popularity or the user agent to the canonical URL. URLs may remain indexed but snippetless, as bots will continue to crawl some pages to determine if the meta robots noindex tag is still present. May be used alone or combined with follow or nofollow conditions.
Robots.txt disallow: The robots.txt file at the root of a domain can be used to block good bots from crawling specified content or directories. Deindexation (or purgatory-like snippetless indexation) will eventually occur over 3-12 months depending on the number of links into the disallowed content. Does not pass link popularity or the user agent to the canonical URL
Rel=nofollow: Contrary to popular belief, the rel=nofollow attribute in the anchor tag does not prevent the crawl or indexation. It merely prevents link popularity from flowing through that one single link to that one single destination page. Does not deindex, does not pass link popularity, does not redirect the user agent.
See this duplicate content decision matrix for resolving duplicate content to learn more about which situations call for which option.
These other options are also useful when redirecting the human user is undesirable because the URL variation triggers a content change that’s valuable to human usability or business needs but not to search engines. For example, tracking parameters in URLs may not work if 301 redirected — the JavaScript doesn’t have time to fire and collect the data for the analytics package before the redirect occurs. In this instance, what’s best for SEO is bad for the ability to make data-driven business decisions. Ideally for SEO, tracking information would not be passed in the URL, so there wouldn’t be a duplicate URL to 301 redirect int he first place. But when there is, a canonical tag is the next line of defense. If that isn’t effective (and if the tracking code still can’t be changed), then the duplicate content can be deindexed with meta noindex, 404s, robots.txt disallows etc.
Keep in mind, though, that when content is suppressed instead of being canonicalized the link popularity that would flow to that canonical URL is now going nowhere. It is not being consolidated to the canonical URL to make that one single URL rank more strongly. It is just wasted. But at least the URL isn’t competing against it’s sister URLs for the same theme. The end result is kind of like The Highlander winning but losing a lot of blood. There is only one (URL) but instead of absorbing the life force of the other URLs with duplicate content, the meta noindex erases it from existence without channeling the life force back to The Highlander.
And so ends the 3-part series on Highlander-style SEO. The key thing to remember is “There can be only one!” That’s one URL for one page of content. Find the duplicate content with a crawler or log files and vanquish it with 301 redirects to consolidate its link popularity into a single, strong, canonical URL. Good luck, immortals.
Welcome back to The Highlander’s tutorial on duplicate content: “There can be only one.” Yesterday’s post on canonicalization Highlander-style compared the battle site owners face with duplicate content to a centuries-old battle between life-force absorbing immortals. Tonight’s post will be more practical, answering some of the questions my clients ask most frequently around canonicalization and duplicate content:
How do you discover the complete universe of duplicate content on a site?
How do you know if you found all the duplicate content?
How do you determine if it’s really an issue for your site?
We’re going to focus on finding the duplication and reasoning through why it matters today. I’ll save the related questions around fixing content duplication for my next post, because this one is huge already.
How do you choose the canonical version?
How do you canonicalize once you’ve chosen the canonical URL?
What if I can’t do that, are there other canonicalizing or de-duplication tactics?
Good questions, all of them, especially when asked by a client that’s staring the reality of 1,000+ duplicate URLs for a single product SKU. No joking. It’s not hard to find examples to use because duplicate content is such a widespread issue. In fact, even extremely savvy big-brand sites are stunned when the scope of the issue is defined. Let’s get started.
Q1: How do you discover the complete universe of duplicate content on a site?
A1: Start with a crawl. Log files may work as well if you have access, but I rarely get to work with a client’s log files so I rely on an outside crawler. I use Link Sleuth, a free multithreaded crawler that zips through a site and records every file it finds. (Note that Tilman’s opinions don’t reflect my own, but his crawler is great.) Careful not to use too many simultaneous threads or you may crash your server.
When the crawler is finished, output the file, filter out all the external links, JS/CSS/gif/jpg and other non-web page files. When you have a nice clean list of all the URLs Link Sleuth could crawl, you can begin to look for patterns that indicate duplicate content (multiple URLs for a single page of content). For instance, if it’s an ecommerce site and the SKU number is in the URL, filter the URLs by that number. Or you can use a keyword or another identifying element in the URL.Failing that, just alpha sort the list and start scanning down it looking for patterns that indicate a different URL for the same content.
Every unique URL listed that loads the same page of content — even if it’s only different by a single character or upper vs lower case or http vs https — is duplicate content to the search engines.
Here’s a real world example from Google’s Home Depot indexation, the Toro Blue Stripe Drip Battery-Operated Hose-End Timer product page:
These are just 6 URLs out of 200+ that Google has indexed. For one product. Multiply that by the number of SKUs that Home Depot sells and holy cow that’s a lot of duplicate content. In fact, with roughly 210,000 pages indexed in Google, it’s a safe bet that Home Depot doesn’t have some percentage of their content indexed. You wouldn’t guess that from looking at the lump sum, but 210,000 only leaves room for 1,000 products with 200 duplicate URLs each. Of course Google doesn’t give true indexation numbers, but it’s enough to make an SEO want to dig deeper and analyze where the issues are.
It’s not really possible to do the same kind of analysis on Yahoo or Bing because they don’t allow you to slice a site: query as finely as Google does. If I had actually crawled the site with Link Sleuth, it would have turned up many more that Google hasn’t bothered to index.
So with your crawl export you have a list of the URLs, and you can filter and sort to find duplicate URLs. Make notes of the elements within the URL that cause duplication (what’s different in the URL) and save each example so you can go back and examine it again later.
Q2: How do you know if you found all the duplicate content?
A2: You don’t. Not a comforting answer, is it. The truth is, you won’t be able to find every last page of duplicate content, and it’s not critical that you do. Look for the major issues that create big pockets, like incorrect non-www to www canonicalization. Fixing that alone can cut a site’s duplicate content in half. But we’ll talk all about that in my next post. For now, understand that there will be a handful of root causes for the vast majority of a site’s duplicate content, and fixing those root causes resolves the vast majority of duplication.
Q3: How do you determine if it’s really an issue for your site?
A3: Consider the diffused link popularity. Duplicate content is often thought of as just a messiness issue — the spiders don’t want to crawl through your cruft so you’d better clean house. True, but that’s not the most important reason.
Even those who claim that duplicate content isn’t an issue will agree that internal linking is important to SEO. It’s important because internal links pass link popularity throughout a site from the typically most popular homepage all the way down to the lowliest product or content page. Without link popularity to demonstrate a page’s importance to the search engines, a page has less chance of ranking and driving traffic and conversions.
Duplicate content has a MAJOR impact on link popularity. Instead of laser focusing link popularity to a single URL for a page of content, duplicate content diffuses that popularity across multiple pages — 200+ for our poor Toro Blue Stripe Drip Battery-Operated Hose-End Timer.
Every page gets crawled and indexed because it has (or had at one point) at least one link to it. And if Link Sleuth is finding it, there’s an active link somewhere on the site to every URL it records. Each of those links passes some small portion of link popularity. And if each link links to a different URL for the same content, say 200+ URLs for Toro Blue Stripe Drip Battery-Operated Hose-End Timer, that’s 200+ tiny fragments of link popularity that are really intended for the same page. Consolidating that link popularity to a single URL increases that page of content’s chance of ranking at that single canonical URL.
Back to our example. Note the visible PageRank from the Google Toolbar for each of the 6 Toro Blue Stripe Drip Battery-Operated Hose-End Timer pages listed above:
How strong would this product page be if every link to it pointed at the same URL? There’s no numerical formula, but no one can disagree that more links to a single URL means more link popularity, which means a stronger ability to rank and drive traffic and conversions.
The only reasons duplicate content may not be a problem for a site include:
You’ve suppressed the duplicate content (with meta robots tags or robots.txt disallows) so that it’s not indexed. This is equivalent to sweeping a problem under a rug. It’s less visible, but it’s still there. All those suppressed URLs still have links to them, so that link popularity is in essence wasting away instead of strengthening your site.
Natural search isn’t a priority. Paid search and other marketing channels are doing fine without it. OK, that’s a business decision you can make I guess. *shrugs helplessly*
… I can’t think of any other reasons.
Stay tuned for the exciting conclusion of The Highlander’s approach to duplicate content. Until tomorrow, “There can be only one.”
I’ve been on a quest to clearly define the duplicate content issue for clients and help them understand exactly how they can clean it up. My colleague PJ Fusco came up with the perfect analogy for duplicate content: The Highlander. Connor MacLeod of the clan MacLeod does indeed say it best: “There can be only one!”
The Highlander faced a serious challenge: four and a half centuries of fights to the death with other immortal beings. After each battle, the living immortal absorbed the life force of his defeated (and usually decapitated) foe. Our hero, The Highlander, of course prevails and consolidates all the immortal life force into himself.
Wow. It’s amazing that a movie that most people don’t care for released in 1986 sums up the whole duplicate content issue in one iconic phrase. “There can be only one.” You’re right, Connor.
But which one is THE one? And how do you find them all to fight them in the first place? And which weapons will be most effective? And seriously, absorbing the immortal life force?
Excellent questions. I’ll answer them all over the next couple of days, but let’s start with the coolest one — the concept of consolidating life force into a single immortal being by slaying all of its foes. Sure makes canonicalization sound cool, right? And really, it is.
Canonicalization in SEO is the process of choosing a single version of a URL from multiple duplicate versions of the URL, and forcing them all to point back to the single chosen URL — the canonical URL. So basically, pick the strongest URL and point the others to that single strongest URL to make it even stronger. Sounds JUST like The Highlander, right?
I’ll detail the following questions in my next post. Stay tuned.
How do you discover the complete universe of duplicate content on a site?
How do you determine if it’s really an issue?
How do you choose the canonical version?
What options are available to canonicalize duplicate URLs?
What if I can’t do that on my site?
Last but not least, this fine movie is on sale at Amazon for $7.99. Figured I’d give them the link as a photo credit for borrowing the DVD image. Seriously, though, every geek needs to have The Highlander in their DVD library.