Highlander SEO Part 2: Finding Duplicate Content

Welcome back to The Highlander’s tutorial on duplicate content: “There can be only one.” Yesterday’s post on canonicalization Highlander-style compared the battle site owners face with duplicate content to a centuries-old battle between life-force absorbing immortals. Tonight’s post will be more practical, answering some of the questions my clients ask most frequently around canonicalization and duplicate content:

  1. How do you discover the complete universe of duplicate content on a site?
  2. How do you know if you found all the duplicate content?
  3. How do you determine if it’s really an issue for your site?

We’re going to focus on finding the duplication and reasoning through why it matters today. I’ll save the related questions around fixing content duplication for my next post, because this one is huge already.

  • How do you choose the canonical version?
  • How do you canonicalize once you’ve chosen the canonical URL?
  • What if I can’t do that, are there other canonicalizing or de-duplication tactics?

Good questions, all of them, especially when asked by a client that’s staring the reality of 1,000+ duplicate URLs for a single product SKU. No joking. It’s not hard to find examples to use because duplicate content is such a widespread issue. In fact, even extremely savvy big-brand sites are stunned when the scope of the issue is defined. Let’s get started.

Q1: How do you discover the complete universe of duplicate content on a site?

A1: Start with a crawl. Log files may work as well if you have access, but I rarely get to work with a client’s log files so I rely on an outside crawler. I use Link Sleuth, a free multithreaded crawler that zips through a site and records every file it finds. (Note that Tilman’s opinions don’t reflect my own, but his crawler is great.) Careful not to use too many simultaneous threads or you may crash your server.

When the crawler is finished, output the file, filter out all the external links, JS/CSS/gif/jpg and other non-web page files. When you have a nice clean list of all the URLs Link Sleuth could crawl, you can begin to look for patterns that indicate duplicate content (multiple URLs for a single page of content). For instance, if it’s an ecommerce site and the SKU number is in the URL, filter the URLs by that number. Or you can use a keyword or another identifying element in the URL.Failing that, just alpha sort the list and start scanning down it looking for patterns that indicate a different URL for the same content.

Every unique URL listed that loads the same page of content — even if it’s only different by a single character or upper vs lower case or http vs https — is duplicate content to the search engines.

Here’s a real world example from Google’s Home Depot indexation, the Toro Blue Stripe Drip Battery-Operated Hose-End Timer product page:

  • http://www.homedepot.com/webapp/wcs/stores/servlet/thdsitemap_product_100088778_10053_10051
  • http://www.homedepot.com/webapp/wcs/stores/servlet/ProductDisplay?id=100088778&jspStoreDir=hdus&catalogId=10053&marketID=401&productId=100088778&locStoreNum=8125&langId=-1&linktype=product&storeId=10051&ddkey=THDStoreFinder
  • http://www.homedepot.com/webapp/wcs/stores/servlet/ProductDisplay?productId=100088778&catalogId=10053&storeId=10051&langId=-1
  • http://www.homedepot.com/webapp/wcs/stores/servlet/ProductDisplay?storeId=10051&langId=-1&catalogId=10053&productId=100088778&N=+502214+90401+10000003
  • http://www.homedepot.com/webapp/wcs/stores/servlet/ProductDisplay?storeId=10051&langId=-1&catalogId=10053&productId=100088778&N=10000003+90401+502214
  • http://www.homedepot.com/webapp/wcs/stores/servlet/ProductDisplay?ddkey=THDStoreFinder&id=100088778&jspStoreDir=hdus&marketID=401&productId=100088778&locStoreNum=8125&linktype=product&catalogId=10053&storeId=10051&langId=-1

These are just 6 URLs out of 200+ that Google has indexed. For one product. Multiply that by the number of SKUs that Home Depot sells and holy cow that’s a lot of duplicate content. In fact, with roughly 210,000 pages indexed in Google, it’s a safe bet that Home Depot doesn’t have some percentage of their content indexed. You wouldn’t guess that from looking at the lump sum, but 210,000 only leaves room for 1,000 products with 200 duplicate URLs each. Of course Google doesn’t give true indexation numbers, but it’s enough to make an SEO want to dig deeper and analyze where the issues are.

It’s not really possible to do the same kind of analysis on Yahoo or Bing because they don’t allow you to slice a site: query as finely as Google does. If I had actually crawled the site with Link Sleuth, it would have turned up many more that Google hasn’t bothered to index.

So with your crawl export you have a list of the URLs, and you can filter and sort to find duplicate URLs. Make notes of the elements within the URL that cause duplication (what’s different in the URL) and save each example so you can go back and examine it again later.

Q2: How do you know if you found all the duplicate content?

A2: You don’t. Not a comforting answer, is it. The truth is, you won’t be able to find every last page of duplicate content, and it’s not critical that you do. Look for the major issues that create big pockets, like incorrect non-www to www canonicalization. Fixing that alone can cut a site’s duplicate content in half. But we’ll talk all about that in my next post. For now, understand that there will be a handful of root causes for the vast majority of a site’s duplicate content, and fixing those root causes resolves the vast majority of duplication.

Q3: How do you determine if it’s really an issue for your site?

A3: Consider the diffused link popularity. Duplicate content is often thought of as just a messiness issue — the spiders don’t want to crawl through your cruft so you’d better clean house. True, but that’s not the most important reason.

Even those who claim that duplicate content isn’t an issue will agree that internal linking is important to SEO. It’s important because internal links pass link popularity throughout a site from the typically most popular homepage all the way down to the lowliest product or content page. Without link popularity to demonstrate a page’s importance to the search engines, a page has less chance of ranking and driving traffic and conversions.

Duplicate content has a MAJOR impact on link popularity. Instead of laser focusing link popularity to a single URL for a page of content, duplicate content diffuses that popularity across multiple pages — 200+ for our poor Toro Blue Stripe Drip Battery-Operated Hose-End Timer.

Every page gets crawled and indexed because it has (or had at one point) at least one link to it. And if Link Sleuth is finding it, there’s an active link somewhere on the site to every URL it records. Each of those links passes some small portion of link popularity. And if each link links to a different URL for the same content, say 200+ URLs for Toro Blue Stripe Drip Battery-Operated Hose-End Timer, that’s 200+ tiny fragments of link popularity that are really intended for the same page. Consolidating that link popularity to a single URL increases that page of content’s chance of ranking at that single canonical URL.

Back to our example. Note the visible PageRank from the Google Toolbar for each of the 6 Toro Blue Stripe Drip Battery-Operated Hose-End Timer pages listed above:

duplicate content
Duplicate product URLs & the visible PageRank for each (click to expand)

How strong would this product page be if every link to it pointed at the same URL? There’s no numerical formula, but no one can disagree that more links to a single URL means more link popularity, which means a stronger ability to rank and drive traffic and conversions.

The only reasons duplicate content may not be a problem for a site include:

  1. You’ve suppressed the duplicate content (with meta robots tags or robots.txt disallows) so that it’s not indexed. This is equivalent to sweeping a problem under a rug. It’s less visible, but it’s still there. All those suppressed URLs still have links to them, so that link popularity is in essence wasting away instead of strengthening your site.
  2. Natural search isn’t a priority. Paid search and other marketing channels are doing fine without it. OK, that’s a business decision you can make I guess. *shrugs helplessly*
  3. … I can’t think of any other reasons.

Stay tuned for the exciting conclusion of The Highlander’s approach to duplicate content. Until tomorrow, “There can be only one.”


Web PieRat logo.

Originally posted on Web PieRat.

Highlander SEO: There Can Be Only One

I’ve been on a quest to clearly define the duplicate content issue for clients and help them understand exactly how they can clean it up. My colleague PJ Fusco came up with the perfect analogy for duplicate content: The Highlander. Connor MacLeod of the clan MacLeod does indeed say it best: “There can be only one!”

The Highlander faced a serious challenge: four and a half centuries of fights to the death with other immortal beings. After each battle, the living immortal absorbed the life force of his defeated (and usually decapitated) foe. Our hero, The Highlander, of course prevails and consolidates all the immortal life force into himself.

Wow. It’s amazing that a movie that most people don’t care for released in 1986 sums up the whole duplicate content issue in one iconic phrase. “There can be only one.” You’re right, Connor.

But which one is THE one? And how do you find them all to fight them in the first place? And which weapons will be most effective? And seriously, absorbing the immortal life force?

Excellent questions. I’ll answer them all over the next couple of days, but let’s start with the coolest one — the concept of consolidating life force into a single immortal being by slaying all of its foes. Sure makes canonicalization sound cool, right? And really, it is.

Canonicalization in SEO is the process of choosing a single version of a URL from multiple duplicate versions of the URL, and forcing them all to point back to the single chosen URL — the canonical URL. So basically, pick the strongest URL and point the others to that single strongest URL to make it even stronger. Sounds JUST like The Highlander, right?

I’ll detail the following questions in my next post. Stay tuned.

  1. How do you discover the complete universe of duplicate content on a site?
  2. How do you determine if it’s really an issue?
  3. How do you choose the canonical version?
  4. What options are available to canonicalize duplicate URLs?
  5. What if I can’t do that on my site?

Last but not least, this fine movie is on sale at Amazon for $7.99. Figured I’d give them the link as a photo credit for borrowing the DVD image. Seriously, though, every geek needs to have The Highlander in their DVD library.


Web PieRat logo.

Originally posted on Web PieRat.