The Microsoft Search (MS Search) team creates technologies that perform
three main functions:
Crawling, the process by which a search engine locates and parses text in
different sources, such as file system information, HTML pages, e-mail stores, documents
stored on file shares, and databases
Indexing, in which each word encountered during crawling is arranged in
an index, where it's linked to its occurrences in any of the original sources
Ranking, the process of displaying a list of search results so that the
most relevant sources are displayed first.
Crawling
When first building an index, a search engine performs a full crawla
fairly slow process in which information is gathered from every source that the user wants
included in the results. Subsequently, the engine conducts periodic incremental crawls,
in which it updates the index only for documents that have been updated or deleted. More
recent Microsoft products, such as SharePoint Portal Server (SPS) and MSN Desktop Search,
use adaptive crawling, which identifies material that is more likely to be updated
frequently and crawls only those sources likely to have been updated since the last crawl.
This is quicker than incremental crawling, which checks the time stamp for every source in
the index.
Two important technologies used in crawling are protocol handlers and
filters (also called IFilters).
Protocol handlers enable a product to access data over a particular
protocol or in a particular type of store. Commonly used protocol handlers include the
file protocol, the Messaging API used by Exchange (MAPI), Hypertext Transfer Protocol
(HTTP), and Web Distributed Authoring and Versioning (WebDAV, an extension to HTTP that
allows information accessed over HTTP to behave similarly to material accessed on a local
file server).
IFilters extract textual information from particular document
formats. SPS 2003, which is the Microsoft business product containing the most recent
version of MS Search technology, has built-in IFilters for Office, HTML, Exchange, Lotus,
and many other types of data. All filters are written to a public API, enabling third
parties to create their own: Adobe offers an IFilter for PDF files, and Corel offers one
for WordPerfect files.
Indexing
After gathering information, the searching application prepares anindex
in which each word is mapped back to every occurrence in the sources. This has several
benefits:
By checking an index, the searching application need not open and inspect
documents every time a search is conducteda slow process
Search result relevance can be ranked using statistics and probabilistic
formulas (see the "Ranking" section below for more details).
Important technologies used in indexing include word-breaking,
which lets the index builder recognize individual words within text, and word stemming,
which lets the index builder recognize related forms of a word (e.g., run, running, ran).
These technologies must be applied to each different language supported by a searching
product: SPS 2001 supported 13 languages, SPS 2003 added four more, and SQL 2005 will
support 23.
Ranking
When users perform a search query, they expect the most relevant
information to appear first. Microsoft products apply two categories of ranking
technologies:
Algorithmic technologies use formulas to determine which sources are
most relevant to a particular query. These technologies are regularly updated and refined
by the MS Search team, with input from Microsoft Research. Because product groups
generally take snapshots of MS Search technology and customize it for their own needs, the
algorithms used by particular products vary.
For example, SPS 2003, the Microsoft business product with the most
recent version of MS Search technology, looks at the following factors when computing
relevance:
Frequency across all indexed material, with rarer terms weighted more
heavily than more common ones; for example, if a user conducted a search on the words
"Ford Festiva" on an automotive Web site, SPS would weigh the rare term
"Festiva" more highly than the more common term "Ford" and would favor
sources that feature Festiva prominently
Term frequency, or how often a term is mentioned within a specific
document: the more times a searched term appears within a source, the higher that source
is ranked
Document length: a 100-word document that has five instances of a term is
ranked more highly than a 10,000-word document with five instances of the same term
Term position: a source in which the term appears in a prominent place
(such as the title) will be ranked more highly than a source in which it appears less
prominently.
In other cases, results are not ranked according to relevancefor
example, full-text search in Exchange groups results by conversation and arranges them by
date.
Managed or editorial results. Other technologies allow
administrators or data owners to manually enhance results for particular search queries.
For example, SPS 2003 has a thesaurus that lets administrators define synonyms for a
common term (e.g., "WinXP" for "Windows XP") and a Best Bets feature
that lets an administrator specify particular sources that will appear in a special
category at the top of the results page.