While there are different ways to organize web content, every crawling search engine has the same basic parts:
Crawler (or Spider)
The crawler does just what its name implies. It scours the web following links,updating pages,
and adding new pages when it comes across them. Each search engine has periods of deep crawling and periods of shallow crawling. There is also
a scheduler mechanism to prevent a spider from overloading servers and to tell the spider what documents to crawl next and how frequently to crawl them.
Rapidly changing or highly important documents are more likely to get crawled frequently. The frequency of crawl should typically have little effect on search
relevancy; it simply helps the search engines keep fresh content in their index. A popular,rapidly growing forum might get crawled a few dozen times each day. A static site
with little link popularity and rarely changing content might only get crawled once or twice a month.The best benefit of having a frequently crawled page is that you can get your new
sites, pages, or projects crawled quickly by linking to them from a powerful or frequently changing page.
The Index
The index is where the spider-collected data are stored. When you perform a search on a major search engine, you are not searching the web, but the cache of the web provided by that search engine’s index.
Reverse Index - Search engines organize their content in what is called a reverse index. A reverse index sorts web documents by words. When you search Google and it displays 1-
10 out of 143,000 websites, it means that there are approximately 143,000 web pages that either have the words from your search on them or have inbound links containing them. Also, note that search engines do not store punctuation, just
words.Storing Attributes - Since search engines view pages from their source code in a linear format, it is best to move JavaScript and other extraneous code to external files to help move the
page copy higher in the source code.Some people also use Cascading Style Sheets (CSS) or a blank table cell to place the page content ahead of the navigation. As far as how search engines evaluate what
words are first, they look at how the words appear in the source code. I have not done significant testing to determine if it is worth the effort to make your unique
page code appear ahead of the navigation, but if it does not take much additional effort, it is probably worth doing. Link analysis (discussed in depth later) is far
more important than page copy to most search algorithms, but every little bit can help.As well as storing the position of a word, search engines can also store how the
data are marked up. For example, is the term in the page title? Is it a heading?What type of heading? Is it bold? Is it emphasized? Is it in part of a list? Is it in
link text?Words that are in a heading or are set apart from normal text in other ways may be given additional weighting in many search algorithms. However, keep in mind that
it may be an unnatural pattern for your keyword phrases to appear many times in bold and headings without occurring in any of the regular textual body copy. Also,
if a page looks like it is aligned too perfectly with a topic, then that page may get a lower relevancy score than a page with a lower keyword density and more natural page copy.
Search Interface
The search algorithm and search interface are used to find the most relevant document in the index based on the search query. First the search engine tries todetermine user intent by looking at the words the searcher typed in.
These terms can be stripped down to their root level and checked against a lexical database to see what concepts they represent.Terms that are a near match will help you rank for other similarly related terms.
For example, using the word swims could help you rank well for swim or swimming.Search engines can try to match keyword vectors with each of the specific terms in
a query. If the search terms occur near each other frequently, the search engine may understand the phrase as a single unit and return documents related to that phrase.
WordNet is the most popular lexical database. At the end of this chapter there is a link to a Porter Stemmer tool if you need help conceptualizing how stemming works.