Google Sets the Duplicate Content Issue Straight
A recent post made at Google Webmaster Central recently made an official statement on the concept of duplicate content and how it can affect your site rankings. Here’s a brief rundown of tactics they feel may be damaging for rankings, but not necessarily a duplicate content penalty:
- Dynamically generated pages with alternating query values. In order to avoid this, make sure to keep a strict order to any variables you’re generating.
- Pushing affiliate content with no formatting changes or alteration to the text
- Copies of pages specific to your own site in different locations, such as a particular page being located in example.com/stuff/index.html and example.com/stuff.html
Google goes on to explain that it clusters duplicate content and puts most of the ranking attention towards the page which the algorithm believes to be the most informative. This is probably based on text and location of a page within a site architecture. In practical experience, we at Apogee have found a number of duplicate content problems that are similar to these suggestions, but an outgrowth of Google’s desire for practicality, memory management, and speed. If there is too much code on your page, Google will only scan to a certain length. This can cause some confusion for the search engine terms of duplicate content. Sourcing as much raw code as possible out to external files can help prevent this, and also improve scalability for your website. How the search engine detects duplicate content is also a good reason to avoid generating unique content through something Google has trouble reading, such as Flash or JavaScript. A particular part of your site that Google will let you know has duplicate content issues are meta tags. It’s always a good idea to tailor your keywords to be specific to the page—and likewise with your description and title text.
So why does Google detect duplicate content the way it does? It’s all about the amount of computational complexity in determining what duplicate content is. Let’s start at the sentence level. As simple as it sounds, it’s actually very difficult to determine what a sentence is, programmatically. From there, we need to store every sentence we’ve seen and check each new sentence we find against this very large database of sentences. As we store more and more sentences, the amount of time this check takes gets larger and larger. Breaking this problem down further, to the phrase level, requires parsing, which can be computationally intensive. Checking against only paragraphs requires that they are demarcated with <p> tags, reducing the chances even further of finding duplicate content than simply checking these large sections of natural language by themselves. So Google scans down a certain amount of your code on each page, looks to see if it has ever seen exactly that set of code before, and groups the URL of your page in with what it would consider the “type” associated with this chunk of code.
For most webmasters, this is splitting hairs a little bit. If you’re writing content yourself and not leaning heavily on a content management system or a very large template for the design of your page, duplicate content shouldn’t be something that scares you, so says Google. This means you don’t have to be worried about using colloquialisms in your writing, or even quoting large blocks of text from another source. But as with so much in optimization, the value of the content you produce is the most important multiplier in any on- or off-page optimization to improve your rankings.













