A site redesign or switch to a new platform is kind of like a rebirth – it’s one of the most exciting and nerve-wracking times for the entire Internet marketing team. With everyone caught up in the branding, design, usability and technology, the impact on SEO can sometimes be forgotten until the last minute.
I wrote this article on redesigning a site with SEO in mind back in July for MultichannelMerchant.com and gave up looking for it to be published… so I missed its publish date in September. Maybe you did too. Here’s a redux of the original article.
While it’s difficult to determine what the natural search impact will be until working code hits a development server, keeping several mantras in mind and repeating them liberally will keep the team focused on the most critical elements to plan for SEO success. I love these mantras — I actually say them to myself as I’m auditing sites.
SEO Development Mantras
Links must be crawlable with JavaScript, CSS and cookies disabled.
Plain text must be indexable on the page with JavaScript & CSS disabled.
Every page must send a unique keyword signal.
One URL for one page of content.
We’re going to 301 that, right?
When a site is stable on the development environment and the URLs are ironed out, identify a 301 redirect plan, build the new XML sitemap and make sure you have a measurement plan in place to measure the impact of the relaunch.
Starting up my new SEO blog was an exercise in compromise. For once in my life, I decided not to obsess over the (super fun) details and instead moved straight to writing posts. I grabbed my domain, paid $10 to map a free WordPress blog to it, and slapped my social network icons in the sidebar.
I didn’t have the time to figure out hosting and plugins in addition to writing the content. I decided to focus on the things I had to have (content and a place to put it), and the things I could not change later (domain) without negative impact on the blog’s SEO. The rest (hosting, cool templates, control over SEO elements, plugins) can be added on or changed later after I’ve proven to myself that I have enough to write about.
In the process, I found a couple of important SEO elements missing in the free WordPress-hosted offering:
No 301 Redirects: This is a biggie for SEO. WordPress does not 301 redirect when you move your blog. In addition, WordPress does not 301 redirect your blog’s subdomain to your mapped domain name. Example: http://jillkocher.wordpress.com 302 redirects to https://webpierat.com/w. Consequently, if you want your blog to build link popularity & keep it, map it to a domain right from the start so that you can either keep the same URLs or control the 301 redirects yourself when you decide to move it later. Without the mapped domain, your content will either be stuck at the blog’s subdomain or will have to leave its link popularity behind when it moves.
No Plugins. Plugins are not available when blogs are hosted at wordpress.com. You can’t modify your title tags or meta data, no sticky posts or other great SEO plugins for custom WordPress installations.
No Google Webmaster Tools Verification. WordPress is working on it, but the former verification method is kaput.
And a couple of social media and usability elements are also missing:
No JavaScript Support. WordPress’s free blogs only support HTML. Many of the cool plugins and widgets require JavaScript. Bummer.
No Social Media Integration in the Templates. If you want to offer quick logos and links to share posts on Twitter or Digg or Facebook … you can’t. That requires JavaScript, or manual coding in every post. There’s also no quick way to list all your social profiles in your sidebar without manual HTML.
That said, WordPress does include some nifty features in the free version:
XML Sitemap: WordPress creates an XML sitemap automatically, and makes it autodiscoverable on the robots.txt file it also generates automatically. Example: https://webpierat.com/w/robots.txt links to https://webpierat.com/w/sitemap.xml. Unfortunately you can’t modify these manually.
Edit Post URIs + 301 Redirects: Bloggers can edit the default title-based URL while they’re in the post admin. Cool. Better yet, if you publish the post and then want to change the URI after it’s live, WordPress automatically 301 redirects the older versions of the URL to the latest version (even through multiple URL changes).
404 Errors. Deleting a post produces a hard 404 error on that URL, which will prompt deindexation if a URL has been indexed. Would be better to 301 redirect it to harvest any link popularity the URL may have collected, but at least it’s a 404 and not 200 that lives on forever cruftily.
Tags & Categories: It’s nice to have the ability to create tags and categories for posts to easily create a navigational hierarchy. Be careful to choose wisely, though, so you don’t give prominence to non-valuable keywords & phrases. Similarly, don’t name both a tag & category the same thing or you’ll inadvertently set up competition for the same keyword phrase at different URLs.
One day soon I’ll start experimenting with a custom installation, but I’m still too excited about the prospect of playing with 301 redirects and CSS to allow myself to get distracted by it. There are a lot of other pros and cons to starting your blog on WordPress’s free platform, but these are the ones that made an impression on me.
My name doesn’t exist in Spanish. Jill would actually be pronounced Heeee – J’s make an H sound and LL makes a Y sound. Jill works just fine in the American Midwest, but when I moved to California for college I encountered a whole new world of Spanish-speaking folks who were genuinely confused by my ridiculous name.
On a far larger scale, the same issue exists across the internet as sites attempt to expand their reach from their native language to serve other countries in other languages. American English is very close to the Queen’s English, but American English just doesn’t speak to the British in the same way. It’s more than sprinkling in some U’s and swapping Z’s for S’s. In Wisconsin I’d stuff a package in the trunk, but in London they’d place a parcel in the boot. I watch my step, they mind the gap. And we’re not even going to talk about the double meaning of fanny packs, yikes.
Some sites translate their navigation and major headings only, but leave the featured image content or product detail information in the original language. I can’t think of a better way to illustrate that customers speaking that language are not a priority. Sears uses a translating technology that autoconverts English textual content to Spanish, but the featured content images are still in English. Message to the Spanish-speaking population: You don’t matter as much but we’d like you to give us your money.
Would you send your English-speaking founder to the streets of Paris to sell the widgets he’s passionate about and communicates magnificently well in English, armed only with a sheet of common French phrases and expect him to sell like gangbusters? Of course not. He won’t be able to communicate effectively there, no matter how passionately he believes in his widgets. Similarly, French customers are not going to drop $300 (or euros) on your site if your French content is poorly written and archaically constructed. You wouldn’t write your primary content woodenly, why on earth would you translate woodenly? It’s ineffective and even insulting. You can’t just slap a French flag on it and call it a day.
In some cases, the navigation is in the original language but some linguistically orphaned content exists deeper in the site. If you can’t read English, how could you navigate the English Virginia.gov site to find the Spanish translation of the preparation for college guide? Lovely that it’s offered, but how will the people who need it find it on the site?
Sites communicate most strongly in the languages of their creators, naturally. When attempts are made to create content for other languages, many sites make the mistake of purely translating the same words they use to the equivalent word in the other languages. Not so good unless you just really don’t want to make a connection with those readers / customers. In that case, save everyone the time and don‘t bother to translate in the first place.
I enjoy babelfish as much as the next gal, but it’s no way to localize content. Localization is about creating content specifically for a geographically defined audience in their preferred language. It’s about carrying your message to them as an important audience equally valuable to the one that speaks the core language of the site, not expecting them to piece together your message and thank you for the opportunity to enjoy your delightful site so clearly not targeted at them despite its translated status.
If you’re serious about expanding your reach to other languages and countries, take the time to hire or contract with someone native to that country fluent in the dialect required. Don’t ask him to translate the content you’ve already written, boil it down to a series of key message bullets and allow him to create the content from those key messages in a way that will resonate best. For more on localization and SEO, Andy Atkins-Kruger is an excellent resource at Multilingual Search Blog or WebCertain.com.
Resolving duplicate content is hard. The trickiest part is wrapping your head around which arrows in the canonicalization & de-duplication quiver to use in which situations to accomplish 2 twin goals of A) consolidating link popularity and B) deindexing duplicate content. I just finished a series on finding and fixing duplicate content, but I think this decision matrix will help clarify the situation. Before you can begin the process, however, you need to identify the duplicate content.
Click the image to open a larger version.
The first decision as you start at the list of duplicate content URLs is: “Is the content at these URLs an exact duplicate?” Answering “Yes” means that different URLs load the exact same content in the exact same order. Answering “No” means that there are some filtering, sorting, pagination, breadcrumb or other variations generated by the URL.
Let’s take the “Yes” path first, all of our URLs load content that is exactly the same as the others identified. The issue is a lack of canonicalization. The next question is: “Are these URL variations required to load the page or track data for analytics?” Answering “No” means that the duplicate URLs can be canonicalized in the SEO ideal manner with 301 redirects. Excellent! Answering “Yes” means that the URLs must continue to exist for some reason at least in the short term. Canonicalizing in this case will mean applying canonical tags to the head of the file specifying the canonical URL.
Now let’s assume that the answer to the first question about exact duplicates is “No.” This issue is called cannibalization, more than one page of content targeting the same keyword theme.
The next question to ask is: “Can the content at this URL be differentiated sufficiently with content optimization to send its own valuable keyword signal?” Answering “Yes” means that the content does have SEO potential and should be optimized to target a unique, relevent keyword target. Answering “No” means that the page is of low or no value to natural search and has no real chance to rank or drive traffic. Proceed to the next question.
The next question to ask is: “Does the content at this URL serve usability needs?” Answering “Yes” means that the content needs to remain live and accessible to provide functionality that humans enjoy/need (like sorting results by price). Canonicalizing in this case will mean applying canonical tags to the head of the file specifying the canonical URL. Answering “No” means that the page is of low or no value to natural search and usability both, and has no real chance to rank or drive traffic. Proceed to the next question.
The next question to ask is: “Is this URL string required to drive site functionality differently than the others, or to provide tracking data?” Answering “Yes” means that the URL needs to remain live and accessible to provide functionality. Canonicalizing in this case will mean applying canonical tags to the head of the file specifying the canonical URL. Answering “No” means that the page is of low or no value to natural search, usability or business needs and has no real chance to rank or drive traffic. The duplicate URLs can be canonicalized in the SEO ideal manner with 301 redirects. Excellent!
This duplicate content decision matrix identifies the ideal tactics to use to A) consolidating link popularity and B) deindexing duplicate content. In some cases, there will be barriers to 301 redirects, canonical tags or content differentiation via optimization. In those cases, there are other options to deindex content, but they do not consolidate link popularity. This is a critical point to understand. The link popularity accumulated by URLs that are meta noindexed, disallowed or 404’d is wasted. Try, try and try again to remove internal barriers to 301s and canonical tags before resorting to Plan B deindexation tactics like meta noindexes, disallows or 404s.
Today I’m wrapping up the 3-part series on duplicate content & SEO Highlander-style: “There can be only one.” If you missed the first two posts, Highlander SEO refers to the similarities between canonicalization and the centuries-old battle between life-force absorbing immortals popularized in a 1986 film.
We’ve covered how The Highlander is relevant to SEO (in addition to being an awesome movie), and how to find the duplicate content we must battle. Today is all about fixing duplicate content and consolidating all that glorious immortal life force … err link popularity … at a single URL for a single page of content.
Q1: How do you choose the canonical version?
A1: Ideally the canonical version will be the page that has the most link popularity. Typically it’s the version of the URL linked to from the primary navigation of a site, the one without extra parameters and tracking and other gobbledygook appended to it. It’s typically the tidiest looking URL for the simple reason that dynamic systems append extra messy looking stuff in different orders as you click around a site. Ideally if you’ve done URL rewrites or are blessed with a platform that generates clean URLs, the canonical will be a short, static URL that may have a useful keyword or two in it.
Following our example yesterday from Home Depot, the canonical URL would be http://www.homedepot.com/webapp/wcs/stores/servlet/thdsitemap_product_100088778_10053_10051 and all of the others would be canonicalized to it. (Note that the URL is most definitely not canonicalized now, so please do not use this as a positive example to emulate.) Of course I’d much prefer a shorter, rewritten URL to canonicalize to without the intervening platform-required directories: http://www.homedepot.com/p-100088778 or http://www.homedepot.com/c-garden-center/b-toro/p-100088778, but that’s a story for another time.
You’ll also need to choose a canonical protocol, subdomain, domain, file path, file extension, case, and … well … anything else that your server allows to load differently. For example, these 32 fictitious URLs all seem like the same page of content to humans and would load exactly the same content at different URLs.
The task sounds simple: Choose a canonical version of the URL and always link to that canonical version. For example, always link to http://www.noncanonical.com/sales/, which is canonicalized to:
nonsecure http protocol (assuming it is nonsecure content)
www subdomain
noncanonical.com domain & TLD (if you own multiple domains/TLDs)
file path without the /content/directory
ending with a trailling slash (not without trailing slash and not a file+extenstion)
all lowercase
no parameters
This may look like a ridiculous example, but I just audited a network of sites with all of these (and more) sources of duplication on every site in the network. In the example above I didn’t even add case, tracking parameters, breadcrumb variations, parameter order variations or other URL variations, so this is actually a moderate example.
Can you see how the issue can multiply to hundreds of variations of a single page of content. For these duplicate URLs to exist and be indexed, at least one page is linking to each one. And each link to a duplicate URL is a lost opportunity to consolidate link popularity into a single canonical URL.
Connor MacLeod: “How do you fight such a savage?” Ramirez: “With heart, faith and steel. In the end there can be only one.”
Q2: How do you canonicalize once you’ve chosen the canonical URL?
A2: 301 redirects. Once the canonical URL has been chosen, 301 redirects are the ideal way to canonicalize, reduce content duplication and consolidate link popularity to a single canonical URL. There are other ways, and we’ll go over them in the next section, but only a 301 redirect CONSISTENTLY does all three of these things:
Redirect the user agent to the destination URL;
Pass link popularity to the destination URL;
Deindex the URL that has been 301 redirected.
For sites with widespread canonicalization issues, pattern-based 301 redirects are the best bet. For instance, regular expressions can be written to 301 redirect every instance of a URL without the www subdomain to the same URL with the www subdomain. So the 301 redirect would detect the missing www in http://noncanonical.com/sales/ and would 301 redirect it to http://www.noncanonical.com/sales/, without having to write a 301 redirect specific to those 2 exact pages. For each element that creates duplicate content, a 301 redirect would be written to canonicalize that element. So the URL http://noncanonical.com/content/sales/default.html would trigger a 301 redirect to add the www subdomain, remove the /content directory and remove the default.html file+extension.
Yes, that’s 3 redirects for one URL. But assuming you change your own navigation to link to the canonical URLs, and assuming that there aren’t a boatload of sites referring monstrous traffic to your noncanonical URLs, this won’t be a long-term issue. The 301 redirects will consolidate the link popularity to the canonical URL and prompt the engines to deindex the undesirable URLs, which in turn will mean that there are fewer and fewer referrals to them. Which means that the 301s won’t be a long-term burden to the servers.
Speaking of which, fix your linking. Links to noncanonical content on your own site will only perpetuate the duplicate content issue, and put an extra burden on your servers as they serve multiple 301s before loading the destination URL.
Q3: What if I can’t do that, are there other canonicalizing or deduplication tactics?
A3: Yes, but they’re not as effective. The other tactics to suppress or canonicalize content each lack at least one of the important benefits of 301 redirects. Again, this is very important: Only a 301 redirect CONSISTENTLY:
Redirects the user agent to the destination URL;
Passes link popularity to the destination URL;
De-indexes the URL that has been 301 redirected.
Note that a 302 redirect only redirects the user agent. It does not pass link popularity or deindex the redirected URL. For some reason, 302 redirects are the default on many servers. Always check the redirects with a server header checker to be certain.
There are instances where 301 redirects are too complicated to be executed confidently, such as when legacy URLs are rewritten to a keyword-rich URL. In some cases there’s no reliable common element between the legacy URL & the rewritten URL to allow a pattern-based 301 redirect to be written.
When planning URL rewrites, always be sure you’ll be able to 301 redirect the legacy URLs. But if they’re already live and competing with the rewritten URLs and it’s too late to turn back, then all you can do is go forward. In this instance, consider 301 redirecting all legacy structures to the most appropriate category URL or the homepage.
If you can’t do that your other options for the undesirable URLs are:
Canonical tags: The latest rage, I’ve only observed Google use the canonical tags, and only in some instances. They are a suggestion that, if Google (and eventually Yahoo & Bing) choose to follow, will pass link popularity to the canonical URL and devalue the noncanonical version of the URL. If you have difficulty executing 301 redirects because there is no pattern that can reliable detect a match between a legacy URL and a canonical URL, then it’s likely that you won’t be able to match the legacy URL and a canonical URL in a canonical tag either.
404 server header status: Prompts deindexation but does not pass link popularity or the user agent to the canonical URL. Once URL is deindexed, it will not be crawled again unless links to it remain.
Meta robots noindex: Prompts deindexation (or purgatory-like snippetless indexation) but does not pass link popularity or the user agent to the canonical URL. URLs may remain indexed but snippetless, as bots will continue to crawl some pages to determine if the meta robots noindex tag is still present. May be used alone or combined with follow or nofollow conditions.
Robots.txt disallow: The robots.txt file at the root of a domain can be used to block good bots from crawling specified content or directories. Deindexation (or purgatory-like snippetless indexation) will eventually occur over 3-12 months depending on the number of links into the disallowed content. Does not pass link popularity or the user agent to the canonical URL
Rel=nofollow: Contrary to popular belief, the rel=nofollow attribute in the anchor tag does not prevent the crawl or indexation. It merely prevents link popularity from flowing through that one single link to that one single destination page. Does not deindex, does not pass link popularity, does not redirect the user agent.
See this duplicate content decision matrix for resolving duplicate content to learn more about which situations call for which option.
These other options are also useful when redirecting the human user is undesirable because the URL variation triggers a content change that’s valuable to human usability or business needs but not to search engines. For example, tracking parameters in URLs may not work if 301 redirected — the JavaScript doesn’t have time to fire and collect the data for the analytics package before the redirect occurs. In this instance, what’s best for SEO is bad for the ability to make data-driven business decisions. Ideally for SEO, tracking information would not be passed in the URL, so there wouldn’t be a duplicate URL to 301 redirect int he first place. But when there is, a canonical tag is the next line of defense. If that isn’t effective (and if the tracking code still can’t be changed), then the duplicate content can be deindexed with meta noindex, 404s, robots.txt disallows etc.
Keep in mind, though, that when content is suppressed instead of being canonicalized the link popularity that would flow to that canonical URL is now going nowhere. It is not being consolidated to the canonical URL to make that one single URL rank more strongly. It is just wasted. But at least the URL isn’t competing against it’s sister URLs for the same theme. The end result is kind of like The Highlander winning but losing a lot of blood. There is only one (URL) but instead of absorbing the life force of the other URLs with duplicate content, the meta noindex erases it from existence without channeling the life force back to The Highlander.
And so ends the 3-part series on Highlander-style SEO. The key thing to remember is “There can be only one!” That’s one URL for one page of content. Find the duplicate content with a crawler or log files and vanquish it with 301 redirects to consolidate its link popularity into a single, strong, canonical URL. Good luck, immortals.