Highlander SEO Part 2: Finding Duplicate Content • Identity Developments

Welcome back to The Highlander’s tutorial on duplicate content: “There can be only one.” Yesterday’s post on canonicalization Highlander-style compared the battle site owners face with duplicate content to a centuries-old battle between life-force absorbing immortals. Tonight’s post will be more practical, answering some of the questions my clients ask most frequently around canonicalization and duplicate content:

How do you discover the complete universe of duplicate content on a site?
How do you know if you found all the duplicate content?
How do you determine if it’s really an issue for your site?

We’re going to focus on finding the duplication and reasoning through why it matters today. I’ll save the related questions around fixing content duplication for my next post, because this one is huge already.

How do you choose the canonical version?
How do you canonicalize once you’ve chosen the canonical URL?
What if I can’t do that, are there other canonicalizing or de-duplication tactics?

Good questions, all of them, especially when asked by a client that’s staring the reality of 1,000+ duplicate URLs for a single product SKU. No joking. It’s not hard to find examples to use because duplicate content is such a widespread issue. In fact, even extremely savvy big-brand sites are stunned when the scope of the issue is defined. Let’s get started.

Q1: How do you discover the complete universe of duplicate content on a site?

A1: Start with a crawl. Log files may work as well if you have access, but I rarely get to work with a client’s log files so I rely on an outside crawler. I use Link Sleuth, a free multithreaded crawler that zips through a site and records every file it finds. (Note that Tilman’s opinions don’t reflect my own, but his crawler is great.) Careful not to use too many simultaneous threads or you may crash your server.

When the crawler is finished, output the file, filter out all the external links, JS/CSS/gif/jpg and other non-web page files. When you have a nice clean list of all the URLs Link Sleuth could crawl, you can begin to look for patterns that indicate duplicate content (multiple URLs for a single page of content). For instance, if it’s an ecommerce site and the SKU number is in the URL, filter the URLs by that number. Or you can use a keyword or another identifying element in the URL.Failing that, just alpha sort the list and start scanning down it looking for patterns that indicate a different URL for the same content.

Every unique URL listed that loads the same page of content — even if it’s only different by a single character or upper vs lower case or http vs https — is duplicate content to the search engines.

Here’s a real world example from Google’s Home Depot indexation, the Toro Blue Stripe Drip Battery-Operated Hose-End Timer product page:

http://www.homedepot.com/webapp/wcs/stores/servlet/thdsitemap_product_100088778_10053_10051
http://www.homedepot.com/webapp/wcs/stores/servlet/ProductDisplay?id=100088778&jspStoreDir=hdus&catalogId=10053&marketID=401&productId=100088778&locStoreNum=8125&langId=-1&linktype=product&storeId=10051&ddkey=THDStoreFinder
http://www.homedepot.com/webapp/wcs/stores/servlet/ProductDisplay?productId=100088778&catalogId=10053&storeId=10051&langId=-1
http://www.homedepot.com/webapp/wcs/stores/servlet/ProductDisplay?storeId=10051&langId=-1&catalogId=10053&productId=100088778&N=+502214+90401+10000003
http://www.homedepot.com/webapp/wcs/stores/servlet/ProductDisplay?storeId=10051&langId=-1&catalogId=10053&productId=100088778&N=10000003+90401+502214
http://www.homedepot.com/webapp/wcs/stores/servlet/ProductDisplay?ddkey=THDStoreFinder&id=100088778&jspStoreDir=hdus&marketID=401&productId=100088778&locStoreNum=8125&linktype=product&catalogId=10053&storeId=10051&langId=-1

These are just 6 URLs out of 200+ that Google has indexed. For one product. Multiply that by the number of SKUs that Home Depot sells and holy cow that’s a lot of duplicate content. In fact, with roughly 210,000 pages indexed in Google, it’s a safe bet that Home Depot doesn’t have some percentage of their content indexed. You wouldn’t guess that from looking at the lump sum, but 210,000 only leaves room for 1,000 products with 200 duplicate URLs each. Of course Google doesn’t give true indexation numbers, but it’s enough to make an SEO want to dig deeper and analyze where the issues are.

It’s not really possible to do the same kind of analysis on Yahoo or Bing because they don’t allow you to slice a site: query as finely as Google does. If I had actually crawled the site with Link Sleuth, it would have turned up many more that Google hasn’t bothered to index.

So with your crawl export you have a list of the URLs, and you can filter and sort to find duplicate URLs. Make notes of the elements within the URL that cause duplication (what’s different in the URL) and save each example so you can go back and examine it again later.

Q2: How do you know if you found all the duplicate content?

A2: You don’t. Not a comforting answer, is it. The truth is, you won’t be able to find every last page of duplicate content, and it’s not critical that you do. Look for the major issues that create big pockets, like incorrect non-www to www canonicalization. Fixing that alone can cut a site’s duplicate content in half. But we’ll talk all about that in my next post. For now, understand that there will be a handful of root causes for the vast majority of a site’s duplicate content, and fixing those root causes resolves the vast majority of duplication.

Q3: How do you determine if it’s really an issue for your site?

A3: Consider the diffused link popularity. Duplicate content is often thought of as just a messiness issue — the spiders don’t want to crawl through your cruft so you’d better clean house. True, but that’s not the most important reason.

Even those who claim that duplicate content isn’t an issue will agree that internal linking is important to SEO. It’s important because internal links pass link popularity throughout a site from the typically most popular homepage all the way down to the lowliest product or content page. Without link popularity to demonstrate a page’s importance to the search engines, a page has less chance of ranking and driving traffic and conversions.

Duplicate content has a MAJOR impact on link popularity. Instead of laser focusing link popularity to a single URL for a page of content, duplicate content diffuses that popularity across multiple pages — 200+ for our poor Toro Blue Stripe Drip Battery-Operated Hose-End Timer.

Every page gets crawled and indexed because it has (or had at one point) at least one link to it. And if Link Sleuth is finding it, there’s an active link somewhere on the site to every URL it records. Each of those links passes some small portion of link popularity. And if each link links to a different URL for the same content, say 200+ URLs for Toro Blue Stripe Drip Battery-Operated Hose-End Timer, that’s 200+ tiny fragments of link popularity that are really intended for the same page. Consolidating that link popularity to a single URL increases that page of content’s chance of ranking at that single canonical URL.

Back to our example. Note the visible PageRank from the Google Toolbar for each of the 6 Toro Blue Stripe Drip Battery-Operated Hose-End Timer pages listed above:

How strong would this product page be if every link to it pointed at the same URL? There’s no numerical formula, but no one can disagree that more links to a single URL means more link popularity, which means a stronger ability to rank and drive traffic and conversions.

The only reasons duplicate content may not be a problem for a site include:

You’ve suppressed the duplicate content (with meta robots tags or robots.txt disallows) so that it’s not indexed. This is equivalent to sweeping a problem under a rug. It’s less visible, but it’s still there. All those suppressed URLs still have links to them, so that link popularity is in essence wasting away instead of strengthening your site.
Natural search isn’t a priority. Paid search and other marketing channels are doing fine without it. OK, that’s a business decision you can make I guess. *shrugs helplessly*
… I can’t think of any other reasons.

Stay tuned for the exciting conclusion of The Highlander’s approach to duplicate content. Until tomorrow, “There can be only one.”

Originally posted on Web PieRat.