inset
"Tahoe" Document Management Server in Beta
Nov. 20, 2000

Microsoft’s new document management server, code-named Tahoe and released in public beta in November, provides powerful document management and search and retrieval functions currently delivered in much more expensive and specialized software packages. Because it uses Web standards, Tahoe allows organizations to build intranet portals that let users access knowledge more effectively. Until recently, Microsoft applications were designed to help individuals create and manage their own documents; collaboration on Office documents was done mainly by saving them to file servers where others could access them. Tahoe aims to make publishing and retrieval of knowledge much easier and more flexible than ever before, providing organizations with much more value than additional Office bells and whistles. This article gives readers a first look at how Tahoe addresses knowledge management needs.

The Challenges of Knowledge Management

Organizations have historically struggled with getting accumulated knowledge into the hands of workers who need it. Even when someone captures important organizational knowledge in a document, letting others know of its existence and making it easily available to them is no simple task. In the paper-based world, organizations have had librarians, clerks, and assistants filing, categorizing, and cataloguing information. Without this costly and labor-intensive step, paper records were nearly useless. With the advent of personal computers and networks, document authors have largely taken over the filing role, storing their documents in a folder hierarchy on network servers. This practice raises four problems:

Difficulty enabling group collaboration on documents. In most organizations today, several people contribute to a document’s lifecycle. Documents often have creators, editors, reviewers, and approvers, and organizations usually have established processes that must be followed in publishing information to readers. Although all popular network operating systems support access control lists to limit who can read or write to a document, and lock documents so that other people cannot make changes while it is open, users need more. They need to

  • See a document’s audit trail ("who did what, when?")
  • Know that authorized approval processes were followed before information was published
  • Be notified when changes have occurred in documents of interest or when they need to supply an approval
  • Be able to comment on a document even if they don’t have the authority to change it

Difficulty organizing and protecting information. File system folders comprise a single hierarchy, analogous to a paper filing system without cross-indexes. However, information has many attributes, such as

  • Who created it and when
  • Who last edited it and when
  • Who can read or edit it
  • The topic or subject
  • The kind of information it contains
  • The department it belongs to
  • Its version or revision number
  • Its stage of development (draft or final version)

Documents can be stored in a file hierarchy based on a few of these attributes, depending on how the authors intend readers to find and use the information. Unfortunately, different authors will have different ideas on how to organize their file hierarchies, which leads to inconsistencies within an organization.

An additional complication is that those same authors are rarely skilled at setting file and folder permissions such that the documents are safe from unauthorized viewing or editing.

Difficulty locating information. Although authors may know the standards and conventions of a file system hierarchy, readers are usually lost. Finding a document buried deep in a file system tree on an unspecified server is an onerous task, even when the searcher knows the name of the file he is seeking. Without the file name, the task becomes nearly impossible. Even if the searcher knows where to look, long filenames rarely fully describe a document, and users must often open many incorrect documents with applications like  Word before finding the desired information. This situation is akin to going to a library but not having access to a card catalog.

As organizations increasingly share information on an intranet Web site, the primary means of locating information should also be through a well-known portal. Furthermore, some types of information should find the users who need it, without flooding them with unwanted notifications.

Difficulty setting up and managing information. Organizations want information to be easily accessible, but want to minimize the expense of cataloging and publishing information without returning to the days of file clerks. They especially want to minimize the time that high-salaried system administrators and Web developers spend on day-to-day content management tasks. Even with simple tools such as FrontPage, a corporate intranet site requires a skilled administrator to organize the file hierarchy and create menus or rebuild search indexes, making it impossible for individuals and teams to self-publish.

Microsoft has built some document management functionality into previous products, such as Office, Site Server 3.0, and Exchange (see sidebar "Earlier Document Management Efforts"), but none really fit the bill as an adequate knowledge management solution. Sophisticated third-party document management and Web publishing applications for the Windows NT/2000 platform are available from vendors such as FileNet and Documentum. These products have made major inroads in vertical markets such as the legal, pharmaceutical, and insurance industries, but their cost and complexity have prevented them from becoming mainstream. Although Tahoe will undoubtedly overlap those products to some extent, Microsoft is targeting Tahoe at more generic document management and collaboration needs, for organizations of any size, from small businesses on up.

Tahoe’s main functions make it easier for users to organize, share, and find document-centric information within their organization. Tahoe achieves this through a Web-accessible document database and a flexible document indexing and search system. Tahoe can index many types of content sources and document types throughout an organization’s servers, or even other Internet Web sites.

Tahoe Document Management and Publishing

For documents stored in the Tahoe document database, Tahoe provides many features that enhance document collaboration, security, version control, approval workflow, publishing, change notification, and document discussion. These features and the concepts on which they are based are explained below.

Core Concepts of Tahoe Document Management

Before discussing the features of Tahoe’s document management, readers should first understand a few key Tahoe concepts:

Document metadata. Tahoe works with many types of document content, not just Hypertext Markup Language (HTML) files. Although Office and many other applications embed information about some of a document’s properties inside the file itself, Tahoe stores the metadata (data that describes data) separate from the document. This Tahoe metadata includes standard attributes such as author, creation date, title, and keywords. But Tahoe offers more than fixed file properties, allowing administrators to assign custom metadata attributes to various classes of documents. Thus, the metadata relevant to a fax may be completely different than the properties of a legal pleading.

Folders. Tahoe folders, while similar in concept to file system folders, are used primarily to implement document security and policy rather than as a means of organizing or classifying documents. Nested folders automatically inherit the attributes of their parent folders, but the inheritance can be broken if the creator wants the policy of a subfolder to differ from that of the parent. A folder creator also designates it to be either "standard" or "enhanced." Standard folders are intended for documents that do not require document control, such as editing, review, or approval, before being published. Enhanced folders are used when a more structured content collaboration and approval process is required.

One minor breach in Microsoft’s overall document management story is that Tahoe advanced folders cannot not support the compound HTML documents (known as "thickets") consisting of multiple files and folders, produced when Office applications or Internet Explorer (IE) save content in HTML format. Microsoft has indicated that the next release of Office will be able to save an HTML document as a single file, at least resolving this problem for Office applications.

Roles. Tahoe deals with security in a simpler and more intuitive way than by using native NT/Windows 2000 file and directory permissions. Users needing access to documents in a folder can be assigned to four possible roles for that folder: coordinator, author, reader, or approver. Coordinators are essentially Tahoe administrators for the folder. Authors can create and edit documents, but readers can access only documents that have been published. Approvers, if assigned, must okay a folder’s documents before they can be published. Coordinators need not deal with the intricacies of editing folder and document access control lists. They only need to map roles in Tahoe folders to NT or Windows 2000 user or group accounts, and Tahoe takes care of the low-level security details.

Profiles. Tahoe allows administrators to create custom document profiles, which define what type of metadata is associated with documents using that profile. Each folder is assigned one or more profiles that can be used by documents stored in that folder. When a user first brings a document into the Tahoe database, she must assign it one of the folder’s available profiles. For example, a profile for a news article might contain a metadata field indicating which section the article belongs to, e.g., business, editorial, entertainment, or sports. Users can later use a profile’s metadata fields to search for relevant documents.

Categories. Categories provide a method for classifying, organizing, describing, and browsing documents. Unlike other metadata attributes, categories are hierarchical and are not tied to a profile or a folder. The administrator can define any number of levels within the category structure. A document may belong to multiple categories, and even to a parent and child category at the same time. Categories provide a much more flexible means of classifying documents than placing them into a rigid hierarchical folder structure.

Tahoe’s Category Assistant can be "trained" from a sample of hand-categorized documents to automatically categorize documents imported in bulk into Tahoe. Coordinators and authors can later re-categorize any document that the Category Assistant improperly classified.

Best bets. The coordinator of a folder can assign certain "best bet" categories or keywords to all the documents in a folder. Best bets give extra weight to documents matching best bet categories or keywords in searches, and they are marked with a special icon at the beginning of search results. Authors or editors can assign or change a document’s categories or keywords, but they cannot assign best bets to a folder.

Document Management and Workflow Features

Tahoe uses the concepts described earlier to implement the following document management and workflow features:

Document browsing and searching. Tahoe allows users to browse documents stored in its Web Store by folder or by category. It also allows them to search on either a document’s profile metadata or on the textual content of the document itself. This search and retrieval is very fast, because the full content of any document that Tahoe can read is automatically indexed (more on this later).

Check-out/check-in. Tahoe provides a check-out/check-in feature that helps make the document revision process more organized and manageable. Authors wanting to edit documents stored in enhanced folders must first check them out. When a document is checked out, only the person who checked it out can edit it. The document may remain in the Web Store during editing, or it can be copied locally to the user’s PC. Other authors can see who has the document checked out and can open a read-only copy. When editing is complete, the document must be checked in before another author can edit it or it can be published.

Publishing. When a document is published, it is available for readers to view and will appear in their search results; until then only authors and coordinators can pull it up in searches. Published documents may also be edited and then re-published.

Versioning. When more than one individual works on a document, keeping track of a document’s change history is vital to avoid confusion and to protect prior revisions from being lost. Unlike the versioning system in Word, Tahoe introduces versioning that can be used with any type of document. Every time someone edits or publishes a document, it gets a new version number. Tahoe uses a dotted notation where the number to the left of the decimal point is the major version and the number to the right of the decimal point is the minor version. Each publishing cycle increments the major version number, and each checkout cycle increments the minor number. Document properties show the complete version history, and authors can open up read-only copies of earlier versions. In the first release of Tahoe, rolling back to an earlier version requires saving it as a new document.

Because Tahoe stores complete copies of each version rather than just the changes, documents with many versions can have sizeable storage requirements. Earlier versions cannot be pruned to conserve disk space. Unfortunately, the Tahoe client does not override Word’s native versioning system, which can confuse Word users if both versioning systems are used simultaneously. Microsoft plans to address these limitations in the next versions of Office and Tahoe.

Approval workflow. A document folder can be configured to require one or more approvals before any of its documents are published. The approval workflow can be in series or in parallel, depending on how the folder is configured. Approvers are sent e-mail to notify them of documents that need their review and approval.

Subscriptions. Tahoe can alert readers, by e-mail or a notice on the Tahoe portal, whenever documents in a folder or a category change. Users administer their own subscriptions to such alerts.

Discussions. Tahoe supports discussions about documents, similar to discussions hosted by Office Server Extensions. Discussions are comparable to Word’s "comments" in that they allow multiple reviewers to offer input on a document. However, discussions differ from Word comments in the following ways:

  • They are stored separately from the document, and thus are recorded even if someone else has the document open.
  • They refer to the document as a whole, not a specific part of the document.
  • They are threaded, so that reviewers can have a more orderly dialogue about the document.
  • Reviewers do not need write permission for the document to add a discussion entry.

Limitations of Tahoe Document Management

Although the Tahoe portal could be the sole intranet site for a small or medium company, it is not intended to replace large, sophisticated intranet Web farms. In its first edition, Tahoe does not support any type of content replication and does not provide any means to publish and distribute documents to a farm of Tahoe servers. However, an intranet Web page can easily redirect users to a Tahoe portal site.

Tahoe’s workflow capabilities are very simple, limited to gaining approval before publishing a document. Administrators cannot define a sequence of editing steps or any other type of procedural logic. Organizations needing sophisticated workflow will need to use a third-party solution like FileNet.

Tahoe’s Search Functions Extend to Other Servers

Tahoe contains a powerful indexing and searching engine that is not limited to documents in the local Tahoe Web Store. Tahoe can also index and search the following content sources:

  • Web URLs (with anonymous access)
  • File shares (on any network operating system to which the Tahoe indexing server can connect)
  • Exchange 5.5 and Exchange 2000 public folders
  • Lotus Notes version 4.6a and R5 databases
  • Other Tahoe servers

Although most of these other sources do not have Tahoe metadata, like categories, Tahoe can index the full text of their content and extract any embedded file property attributes of these sources. The search engine can be configured to crawl these sources and build indexes used in searching. These indexes are not stored in the Web Store, but reside in a separate database.

Building indexes from a variety of source formats is more complex than one may think. Tahoe includes two indexing components that make this possible: IFilters and word breakers.

IFilters. IFilters read the document, remove formatting, and extract the text and any file property information embedded in the source file. Tahoe includes IFilters for the following:

  • Text files
  • Office documents (Word, Excel, PowerPoint)
  • HTML files
  • Tagged Image File Format (TIFF) files

Since TIFF files are graphics files, the TIFF IFilter normally just looks at the file properties. However, if it is a fax or scanned textual document, the filter can perform optical character recognition (OCR) on the file and parse and index the text.

WordPerfect and Adobe Acrobat (.PDF) IFilters should be available by ship time, and other application vendors can create IFilters to crack their document file formats. Microsoft has made the Tahoe SDK available on the Tahoe Web site (see System Requirements and Resources section for the Web address).

Word breakers. A word breaker is a component that determines where the word boundaries are in the stream of characters of the document being crawled. Because Tahoe must extract the root forms of the words and ignore "noise" words like "the" or "an", the word breaker does much more than parse text on the white spaces and punctuation marks. Tahoe must also handle documents in many different languages and be able to break each down each according to its own language rules. This is especially difficult with Asian languages that use Unicode characters, but between Tahoe’s own word breakers and those in the Windows 2000 Server Indexing Service, it can currently index documents in 13 languages, including Chinese and Japanese.

Security issues also come into play when dealing with indexing and searching. The indexing service can only access sources for which the Tahoe indexing service account has read permissions. When accessing foreign file systems, like NFS on Unix, Tahoe’s credentials must be mapped to those of the foreign file system. Furthermore, when users perform searches, they must only see results for which they have read permissions. Tahoe is able to map most existing security schemes into the indexes so that only the appropriate results get returned to the user.

Once Tahoe builds the initial indexes, administrators can configure it to use one of four update methods for each source:

Full updates. As the name implies, this method completely re-indexes all of the source’s content, but at the expense of time, network bandwidth, and indexing server load. Full updates may be scheduled to run on a periodic basis.

Incremental updates. Only new, changed, and deleted items are indexed, making incremental updates much faster than full updates. Periodic incremental updates may also be scheduled.

Adaptive updates. Adaptive updates use a statistical formula to improve performance of incremental updates by identifying content that is updated most frequently and scheduling the update to occur automatically.

Notification updates. If the source can notify Tahoe when any content is changed, Tahoe can immediately begin an update of the new or changed content. Notifications are available only for Tahoe servers and for file shares located on Windows 2000 or NT file system partitions.

Tahoe Architecture and Scalability

Tahoe is built on Windows 2000’s built-in Web server, Internet Information Services (IIS), and on the Web Store technology first introduced in Exchange 2000 (see the sidebar "Web Storage System" and the illustration "Tahoe Architecture"). Tahoe is a standalone product that does not require or exploit Active Directory, Exchange 2000, or SQL Server, nor does it install an embedded version of SQL or the Microsoft Data Engine. It replaces the Windows 2000 Server’s Search service with an updated Tahoe-specific version. This lack of dependencies makes it easier for organizations to adopt Tahoe in a gradual manner; they can start at the departmental level without the need to make enterprise-scale commitments.

Each Tahoe server contains one or more "workspaces," which can be thought of as multiple virtual Tahoe servers residing on the same machine. Each workspace contains its own profiles, folders, categories, indexes, documents, security settings, and coordinators, allowing multiple departments to each have a workspace configured for their needs without requiring separate Tahoe servers. Depending on the hardware and processing loads, a single server could host up to 10 workspaces, but a single workspace cannot span multiple Tahoe servers.

Although Tahoe can index documents outside of the Web Store, and even across WAN links, the contents of source documents must be copied across the network to the Tahoe indexing server when it builds indexes. Many large documents and slow WAN links could dramatically slow indexing and saturate a network.

Indexing and searching are processing-intensive activities. As the numbers of users and documents grow, a single Tahoe server can quickly become overwhelmed. Fortunately, the Tahoe document management, indexing, and searching functions can each be offloaded to dedicated servers, allowing it to scale to support larger environments. Given sufficiently powerful hardware, Microsoft anticipates that a Tahoe server could service up to a thousand simultaneous users. However, at least in the first release, a single Tahoe site is not intended to support all of the knowledge management needs of a large enterprise, since it is not a truly distributed solution. Microsoft expects that large organizations needing enterprise-wide scalability will turn to third-party companies offering document management products that support content replication.

Tahoe Supports Multiple User Interfaces

Users access Tahoe documents in one of three ways: the Windows Explorer, from within Office, or with a browser accessing the Tahoe portal.

Windows Explorer. Once a user uses Network Neighborhood to map a Web Folder to a Tahoe workspace, he has access to all of Tahoe’s document management functions. Users can work with files in the Web Folder as though it were a normal folder: they can drag and drop files, double-click, and right-click on files in the folder. Web Folders are available to the "Open" and "Save As" dialog boxes in any Windows application.

The Tahoe search dialog box is not available in the standard Explorer and Start menu search options. Today, the only way to access Tahoe’s search engine is through the Tahoe portal.

Office 2000 applications. Once the Tahoe client software is installed on a client PC, Word, Excel, and PowerPoint display additional File menu entries that expose some of Tahoe’s document management functions. The next version of Office will have more extensive access to Tahoe, especially for searching.

Web browser to the Tahoe portal. Browser users get access to a special Tahoe portal Web page. Each Tahoe workspace hosts a Web portal that gives Tahoe readers, authors, and coordinators access to all of the workspace’s features and functions. The Tahoe portal will be Microsoft’s first product to ship utilizing the XML-based Digital Dashboard 2.0 technology. (For more information on Digital Dashboards, see System Requirements and Resources at the end of this article.) Digital Dashboard technology makes it simpler to construct intranet Web portals by snapping in pre-built "Web Part" components. The Tahoe portal comes standard with Web Parts that give users access to all of the Tahoe features and functions described earlier. It also includes additional Web parts that make it simple to add news, announcements, and quick links to other Web sites. The Digital Dashboard resource kit includes even more Web Parts that can create links to Exchange 2000, file shares, or SQL Server.

Management and Operations

Although folder coordinators do most of the day-to-day management activities in Tahoe, Tahoe administrators will have two important tasks: initial setup and routine backup.

Tahoe setup. Matching Tahoe’s configuration to the needs of an organization is critical to its success. The most difficult task for the Tahoe administrator will be creating the appropriate workspaces, document profiles, attributes, category hierarchies, document sources, approval workflows, and root-level folder role assignments. This cannot be done randomly and will take special knowledge of the organization’s document management needs. Once documents are linked to folders and profiles, the links cannot be changed without losing some metadata associated with the documents.

Backup and restore. Unlike Windows 2000 or NT file systems, the Tahoe Web Store and the index are databases and thus are always open. As with the Web Stores owned by Exchange 2000 servers, special backup and restore considerations come into play. Tahoe includes a backup script that dumps the Tahoe databases to a disk file, which can subsequently be safely copied by a conventional tape backup utility. Tahoe also includes a similar restore script that can restore the server back to its state at the time the backup script was run. It is likely that backup vendors such as CommVault or Veritas will offer Tahoe backup agents that leverage their Exchange 2000 solutions and automate the entire process.

In the beta version, restoring a Tahoe server is an all-or-nothing proposition. Although restoring the entire server is somewhat straight-forward, there is no easy way to restore a single document or a document folder. The only option is to restore the complete Tahoe image to a different machine and copy the documents back into the original Tahoe server, losing the metadata in the process. Unresolved, this drawback seriously limits Tahoe’s viability as a file system replacement, but it is likely that Microsoft and backup vendors will have this issue addressed before the released product ships.

System Requirements and Resources

Tahoe will initially be released in six language versions: English, French, German, Italian, Japanese, and Spanish. Microsoft has not yet released pricing or licensing information and has not indicated whether Tahoe users will require a special Client Access License.

Server requirements. Although Active Directory is not a requirement, Tahoe does need Windows 2000 Server (with Service Pack 1) with at least 256 KB of memory. As the number of users and managed documents increase, memory needs will also grow. The beta does not run on Windows 2000 Advanced Server, and thus cannot use the Cluster Service like Exchange 2000 can. Microsoft has not indicated whether the released product will run on Advanced Server. The beta can be installed on an Exchange 2000 server, but this offers no particular benefit and imposes restrictions on de-installing either product.

Client requirements. If users need access to the Tahoe server from Windows Explorer or from within Office, Tahoe’s client-side software must first be installed on the user’s PC. This software requires either Windows 2000, Windows NT 4.0 SP6A, Windows Millennium Edition, or Windows 98. Browsers on Windows platforms must be IE 4.0 or higher or Netscape Navigator 4.7. Macintosh and Solaris users can access the Tahoe portal if they are running IE 5. All browsers must be configured to support JavaScript.

For more information on Tahoe and to obtain the Tahoe beta or SDK, see www.microsoft.com/servers/tahoe.

For more information on Digital Dashboard 2.0, see www.microsoft.com/DirectAccess/Products/SBS/CRK/files/digital_dashboard/CD.