| Important Microsoft Search Technologies |
| Feb. 21, 2005 | ||
|
The Microsoft Search (MS Search) team creates technologies that perform three main functions:
Crawling When first building an index, a search engine performs a full crawl—a fairly slow process in which information is gathered from every source that the user wants included in the results. Subsequently, the engine conducts periodic incremental crawls, in which it updates the index only for documents that have been updated or deleted. More recent Microsoft products, such as SharePoint Portal Server (SPS) and MSN Desktop Search, use adaptive crawling, which identifies material that is more likely to be updated frequently and crawls only those sources likely to have been updated since the last crawl. This is quicker than incremental crawling, which checks the time stamp for every source in the index. Two important technologies used in crawling are protocol handlers and filters (also called IFilters). Protocol handlers enable a product to access data over a particular protocol or in a particular type of store. Commonly used protocol handlers include the file protocol, the Messaging API used by Exchange (MAPI), Hypertext Transfer Protocol (HTTP), and Web Distributed Authoring and Versioning (WebDAV, an extension to HTTP that allows information accessed over HTTP to behave similarly to material accessed on a local file server). IFilters extract textual information from particular document formats. SPS 2003, which is the Microsoft business product containing the most recent version of MS Search technology, has built-in IFilters for Office, HTML, Exchange, Lotus, and many other types of data. All filters are written to a public API, enabling third parties to create their own: Adobe offers an IFilter for PDF files, and Corel offers one for WordPerfect files. Indexing After gathering information, the searching application prepares an index in which each word is mapped back to every occurrence in the sources. This has several benefits:
Important technologies used in indexing include word-breaking, which lets the index builder recognize individual words within text, and word stemming, which lets the index builder recognize related forms of a word (e.g., run, running, ran). These technologies must be applied to each different language supported by a searching product: SPS 2001 supported 13 languages, SPS 2003 added four more, and SQL 2005 will support 23. Ranking When users perform a search query, they expect the most relevant information to appear first. Microsoft products apply two categories of ranking technologies: Algorithmic technologies use formulas to determine which sources are most relevant to a particular query. These technologies are regularly updated and refined by the MS Search team, with input from Microsoft Research. Because product groups generally take snapshots of MS Search technology and customize it for their own needs, the algorithms used by particular products vary. For example, SPS 2003, the Microsoft business product with the most recent version of MS Search technology, looks at the following factors when computing relevance:
In other cases, results are not ranked according to relevance—for example, full-text search in Exchange groups results by conversation and arranges them by date. Managed or editorial results. Other technologies allow administrators or data owners to manually enhance results for particular search queries. For example, SPS 2003 has a thesaurus that lets administrators define synonyms for a common term (e.g., "WinXP" for "Windows XP") and a Best Bets feature that lets an administrator specify particular sources that will appear in a special category at the top of the results page.
|