How Do You Know If It’s Duplicate Content?

Duplicate content is one of the Great Bogeymen of SEO, striking fear into the hearts of marketers and business owners since the early 2000s. It’s easy to find horror stories people share about duplicate content disasters. In my experience most of these are campfire tales, shared between people who didn’t actually live through the disasters. And some of the people who swore up and down to me their duplicate content problems were real failed to provide convincing evidence, if only because so much time had passed since they dealt with the issues.

Because the SEO industry lacks standards, anyone can claim anything. You can use your own definitions, your own metrics, and your own key performance indicators. Who is to say you did anything wrong, or that they could have done better? It’s never as simple as “I’m right, you’re wrong”, when it comes to diagnosing a problem between a Website and a search engine.

Unless you have access to the magic software that engineers at Bing and Google use, where they can look at why pages do (not) rank for given queries, you must deduce what is probably wrong and what may resolve the issue. Fortunately for so many of us, you can sometimes fix the wrong problem and get the desired result you’re looking for. And sometimes the problem fixes itself.

Duplicate content never fixes itself. It’s not going to happen. If a site really is serving multiple instances of the same content, it’s going to go right on doing that until someone changes something. If you’re lucky the search engines will canonicalize the duplicate instances for you. But luck isn’t a good business model and it’s not an optimization strategy.

Search engine optimization must support the business decision – which may be to create duplicate or near-duplicate content. And SEO must also improve the relationship between the Website and the search engine. In that context, here’s what everyone must do (in my humble opinion).

1 – Classify the Content Correctly

Each search engine has its own ideas about identifying and classifying duplicate content. And each (major) search engine has more than one algorithm that classifies things as (near-)duplicate content. If you’re looking for all the pages that quote the same string of text, you WANT to see (near-duplicates) in the search results.

What a search algorithm considers to be duplicate depends on what that algorithm evaluates. Some algorithms might include page formatting code. Other algorithms may only include visible text. Some algorithms may ignore embedded media (images, videos, etc.).

It’s a mistake to assume that you’ll always be facing the same discriminatory logic. Here are a few examples of (near-) duplicate content that may create SEO problems.

A) Faceted navigation

Most often used on ecommerce sites, faceted navigation allows the user to sort content within a container (a folder, section, or category) by different criteria, such as size, weight, color, etc.

It’s all the same content, just presented in a different sort sequence that is encoded into the page URLs ala “/product&w=1&c=2&s=3”.

B) Syndicated content

If you’re distributing your content to other sites via RSS feeds or manual placement, you’re syndicating the content.

Unfortunately, if someone is scraping your site, you’re still syndicating content (against your will).

And if you’re republishing content from public writing platforms like WordPress.Com, Blogspot.Com, Medium, or The Conversation (to name just a few), you’re syndicating content.

C) Duplicate Meta Data

Google’s infamous “Omitted Results” isn’t always true duplicate content. On sites that use the same page titles and meta descriptions, the search engines often show you only a sample of listings and omit the rest. This is frustrating for people who write otherwise distinctive copy.

And if multiple Websites use the same page titles and meta descriptions, you may find yourself lumped in with the rest of the cruft.

D) Redirected Content

When you move content from one URL to another, or redirect an entire site to a new host, the search engines may continue listing the old URLs in the SERPs for weeks, months, sometimes even years.

These duplicate listings are harmless but they strike fear into the hearts of many. As long as the searcher lands on the correct destination page after clicking through the search result, everyone is fine.

E) Cross-indexed, Mis-classified Content

I’ve seen this problem on sites that utilize CDNs, sites that change host names, and sites that change canonical declarations.

For any of several possible reasons, the search engine crawls Alpha but displays the content as if it was found on Beta. It may have been found on the other URL at some time in the past. Maybe it exists in both places but you’re magically redirecting one copy to another.

As I write this, Google’s search results continue to display scraped content from one of Reflective Dynamics’ portfolio sites on a domain that was seized by the registrar and which cannot be registered again. I know it can’t be registered because I’ve tried to claim it.

Google’s cache for those pages lists the URLs of the original pages that were scraped. Unfortunately for us, anyone who clicks on those listings is taken to the registrar’s default search portal. In other words, our content appears in the SERPs as if it was published on a domain that isn’t registered, and its cache images correctly display what is found on our site, but if you click on the links in the search results you’re taken to a search portal that only benefits the domain registrar.

F) Manually Republished Content, Multiple URLs

I don’t see this as much as I used to, but people still occasionally show me Websites where the exact same article is pasted into 2 or more URLs. They usually only differ by a folder or section name in the URL.

G) Boilerplate Text or Navigation

Disclaimers, link-saturated page footers, sidebar navigation, and other blocks of generic page copy may be deemed “duplicate content” by some algorithms. If everything else on the page is unique to that page then the page itself is usually not deemed to be a duplicate of anything – although the weight or abundance of the boilerplate content may result in the page being deemed a near-duplicate. See section 3 below for more about that.

2 – Understand What the Duplicate Content is Doing

One of the great myths of duplicate content that is still passed around the SEO community is that “duplicate content splits your PageRank”. That’s only true if the duplicate content isn’t connected via navigation.

If you can land on page Alpha and click through the site to reach the duplicate content on page Beta, then you’re not splitting your PageRank. The PageRank-like value flows through the site’s navigation.

If you can land on Alpha but NOT click through to Beta – regardless of whether Beta appears on the same host or a different domain altogether – then it is indeed possible for both copies of the content to earn links and accrue separate PageRank.

Shame on the other guy for not redirecting or canonicalizing to your version of the content; shame on you for not doing it if you control all the copies.

Sometimes people become upset at which URL the search engine displays. If you’re pointing more links to Alpha, and Alpha is linked to from an important page on your site, and/or other people are pointing more links to Alpha, then your canonical link relation declaring Beta to be the primary URL may be ignored. Search engines are under no obligation to follow your canonical suggestions.

So, yes, the PageRank may flow from Alpha to Beta through the site navigation, but Alpha may be seen as the more important copy of the content simply because most of the links point to it.

3 – Don’t Confuse Near-Duplicate Content with Duplicate Content

There are different types of near-duplicates. Some are harmless. Some are evil footsteps into the Dark Side of SEO.

If you’re only changing keywords on a page – madlibbing as it’s often called – you’re not creating duplicate content so much as you’re creating doorway pages. Don’t waste your time arguing semantics. The purpose of this kind of content is obvious. It’s a lazy strategy that is worthy of every penalty or shred of disdain a search engine decides to throw at it.

Someone will always tell you that it works for them. I’ve frequently found such people – years later – adding, “It worked until I was caught.” Your mileage may vary but the end result is usually the same.

Search engines may allow doorway-like content to survive due to mitigating factors – but don’t bet on being able to play a long game with a doorway strategy.

Spun articles are another type of near-duplicate content. Bing and Google are pretty good at figuring out that content is spun, but not always so. In fact, someone once brought a site to me that Google had manually penalized for “spinning” even though all the pages were original content, written by a knowledgeable writer. The only thing I could see wrong was that the article titles were awkwardly written. The articles themselves were NOT spun.

Angry as my friend was, I suggested he just wipe the site clean and start over. The penalty was lifted as soon as he filed a reconsideration request (after following my advice). It was a stupid, unnecessary penalty (in my humble opinion) but Google’s spam team isn’t obligated to explain why it gives a penalty.

Complex madlibbing happens often in certain industries that are dominated by directories. Think of real estate and travel sites that often list much the same information because they’re pulling data from databases. To differentiate themselves these sites throw a lot of widgets onto their pages, but it’s all duplicate content by one measure or another.

The search engines tolerate a few of these aggregator sites but not many. Every year I find Google listing new reverse phone number lookup sites because (apparently) the old aggregators fail to accumulate enough signals to earn permanent residency in the SERPs. Or maybe people just grow tired of trying to keep their sites live and they move on to other projects. All I know is that all the good reverse phone lookup sites were killed by Google.

If you put enough boilerplate content on a page – or too little unique, high value content – you’re creating stubs. Search engines may tolerate stubs for a while, especially for large, well-known sites that fill out most of their pages with substantial content. But you don’t want to launch a massive site built only with stubs. I advise people to use “noindex” on their stub pages until they’ve got real content.

4 – Create Value Wherever Possible

Whether you’re dealing with exact duplicate pages or near-duplicate pages, you need to distinguish them from each other as much as possible. If you cannot take what is there and improve it then you have a real challenge. It might even be a problem.

If a page cannot be improved then why does it exist?

What is the unique value to the random visitor who doesn’t know anything about or care about search engine optimization, rankings, and passive income (or direct sales)?

When inexperienced marketers say they don’t know how to create value for autogenerated content, you know their projects are doomed. If you don’t know how to do it, learn fast.

5 – Blog Archives Don’t Need to Be Duplicate Content

Blog archives have earned a far worse reputation than they deserve. Search engines crawl these pages to find deep, older content on large Websites. Many SEO specialists happily slaughter these useful index archives with “noindex” or some misguided attempt to canonicalize them to the home page. Some people incorrectly use (now obsolete) pagination markup to misrepresent all these pages as essentially duplicates.

You can create duplicate content using Author, Category, Date, and Tag archive indexes on a blog. You don’t want to but you can. Just publish full Post summaries instead of excerpts. Voila! You have duplicate content.

Many people who rely on RSS feed readers prefer to read the full Posts in their feeds so they don’t have to visit potentially slow, ad-laden Websites. While their annoyance is understandable, publishers must do a little extra work to satisfy their RSS audiences while publishing only excerpts on their archive pages.

If you add every Post to multiple categories and/or tags, you may create duplicate content if you add all or most Posts to the same categories and/or tags. Why have multiple categories and tags then?

I recommend that people use Category archives to separate content by topic and Tag archives to index content by important keywords or phrases. But it’s easier to only use Categories or Tags.

Don’t use “noindex” on these pages if you distinguish them from each other (and the other archives). Let the search engines find and crawl these pages to help your older, deeper content. See my article about managing Website subduction for more information about that.

Conclusion

Another popular SEO myth is the infamous keyword cannibalization nonsense. This myth holds that you should only publish 1 article per query – rather like a cheap doorway scheme.

Search engines like Bing and Google don’t mind listing more than 1 page from a host in their queries. They only mind listing useless junk that doesn’t improve their search results.

It’s okay to publish duplicate content, too. It may not appear in the search results but you won’t be penalized for it. It’s not holding you back in some obscure search algorithm’s computations. You can shoot yourself in the foot with (near)-duplicate content but more often I find people blasting their appendages into the void by trying to fix imaginary duplicate content problems.

The bottom line here is that even if you do have duplicate content that is just wasting everyone’s time and resources, your first option is always to improve it before you start chiseling away at what could be a perfectly good Website. The search engines are patient. They’ll wait for you to come up with a clever solution.

Follow SEO Theory

A confirmation email will be sent to new subscribers AND unsubscribers. Please look for it!