Latent Semantic Indexing

Oct 18

18:04

2008

Priyanka Singh

Latent semantic indexing adds an important step to the document indexing process.

If we take a look towards the history of LSI then we will come to know that this term arise in 2003 when Google purchased a company called Applied Semantics whose software technology was used to extract and organize the information from websites in a manner that is similar to the way that humans might act. And the purpose was to help Google to match advertisers of Adsense ads with the appropriate web pages where the ads can be shown. Adsense matched keywords on the pages to the keywords in the ads by which a website owner can earn money for every click he receives from an ad shown on his site. But a problem soon arose that millions of pages were being generated simply to contain relevant long-tailed keyword phrases to capture traffic from Google that results in profitable clicks on the ads. Content on these machine-generated pages was virtually non-existent and it was very difficult for the person doing a search for finding the spam. The problem for Google at this time was that they could not differentiate between these generated pages and the sites which actually contained valuable content.

Some more techniques were also used, such as keyword stuffing, reciprocal linking strategies etc. However, reciprocal linking was soon discounted by Google and partially by Yahoo and MSN when one-way inbound links became more important in rankings than before.

In addition to recording of keywords a document contains, it also examines the document collection as a whole, to see which other documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates well with how a human being looking at content, might classify a document collection. Although the LSI algorithm doesn't understand anything about what the words mean, the patterns it notices can make it seem intelligent.

When you search an LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the document that best fits the query. Because two documents may be semantically very close even if they do not share a particular keyword, LSI does not require an exact match to return useful results. Where a normal keyword search will fail if there is no exact match, LSI will return relevant documents that don't contain the keyword at all.

The method involves in it is that it removes common stop words and creates database of relevant keywords and all words in relevant documents. Another method involves an algorithm called multi dimensional scaling, which takes data and moves it around, calculating whether the projection is more or less accurate after each movement.

Since latent semantic indexing adds weight to related words in the content of websites, spam sites which are just filled with keywords and reciprocal links without any important information can actually be lowered down. Web pages that are tightly focused on one keyword or phrase may rank worse in search engines using LSI. Inbound links to various synonyms and tenses of keywords in web pages and a wider variety of keywords and phrases throughout the content can work well for this more human seeming approach to ranking in search engines. LSI is an important factor to consider in reaching and maintaining a high rank in Google and other search engines.

Article "tagged" as: