You know duplicate content can have a negative effect on web site rankings. But how do you examine whether a particular web site exhibits this problem, and how do you mitigate or avoid it?
To begin,
you can divide duplicate content into two main categories:
Duplicate Content as a Result of Site Architecture
Some examples of site architecture itself leading to duplicate content are as follows:
- Print-friendly pages
- Pages with substantially similar content that can be accessed via different URLs
- Pages with items that are extremely similar, such as a series of differently colored shirts in an e-commerce catalog having similar descriptions
- Pages that are part of an improperly configured affiliate program tracking application
- Pages with duplicate title or meta tag values
- Using URL-based session IDs
- Canonicalization problems
All of these scenarios are discussed at length in this chapter.
To look for duplicate content as a result of site architecture, you can use a “site:example.com” query to examine the URLs of a web site that a search engine has indexed. All major search engines (Google,
Yahoo!, Bing Search) support this feature. Usually this will reveal quickly if, for example, “printfriendly” pages are being indexed.Google frequently places content it perceives as duplicate content in the “supplemental index.” This is
noted at the bottom of a search engine result with the phrase “supplemental result.” If your web site has many pages in the supplemental index, it may mean that those pages are considered duplicate content —
at least by Google. Investigate several pages of URLs if possible, and look for the aforementioned cases.Look especially at the later pages of results. It is extremely easy to create duplicate content problems without
realizing it, so viewing from the vantage point of a search engine may be useful.
Duplicate Content as a Result of Content Theft
Content theft creates an entirely different problem. Just as thieves can steal tangible goods, they can also steal content. This, unsurprisingly, is the reason why it is called content theft. It creates a similar problem
for search engines, because they strive to filter duplicate content from search results — across different web sites as well — and will sometimes make the wrong assumption as to which instance of the content is
the original, authoritative one.This is an insidious problem in some cases, and can have a disastrous effect on rankings. CopyScape (copyscape.com) is a service that helps you find content thieves by scanning for similar
content contained by a given page on other pages. Sitemaps can also offer help by getting new content indexed more quickly and therefore removing the ambiguity as to who is the original author.
unfortunately, fighting content theft is ridiculously time-consuming and expensive — especially if lawyers get involved. Doing so for all instances is probably unrealistic; and search engines generally
do accurately assess who is the original author and display that one preferentially. In Google, the illicit duplicates are typically relegated to the supplemental index. However, it may be necessary to take this
action in the unlikely case that the URLs with the stolen content actually rank better than yours.
Excluding Duplicate Content
When you have duplicate content on your site, you can remove it entirely by altering the architecture of a web site. But sometimes a web site has to contain duplicate content. The most typical scenario of this is
when the business rules that drive the web site require the said duplicate content.To address this, you can simply exclude it from the view of a search engine. Here are the two ways of
excluding pages:
Using the Robots Meta Tag
This is addressed first, not because it’s universally the optimal way to exclude content, but rather because it has virtually no limitations as to its application. Using the robots meta tag you can exclude any HTMLbased
content from a web site on a page-by-page basis, and it is frequently an easier method to use when eliminating duplicate content from a preexisting site for which the source code is available, or when a site
contains many complex dynamic URLs.