| Data Integration Improved in SQL Server 2005 |
| Nov. 29, 2004 |
SQL Server 2005 will ship with Integration Services, a new utility for building data-integration processes, such as gathering and aggregating data from operational databases to populate a data warehouse. Such processes are a key component of many companies’ business intelligence functions, which also typically include utilities for data analysis and reporting. Supporting such functions has been important for SQL Server—it has allowed Microsoft to position the product as more than just a data storage and management engine and helped it gain a foothold inside companies that use competing products (such as Oracle) for database management. Compared to the existing Data Transformation Services (DTS) utility it replaces, Integration Services promises simpler programming, better performance, and improved management capabilities. These improvements could entice SQL Server 2000 customers to upgrade to SQL Server 2005. In addition, Microsoft hopes to attract new customers with complex, enterprise-class data integration needs, whose options today are limited to costly third-party products such as Informatica’s PowerCenter. However, even experienced DTS developers will require some ramp-up time to take full advantage of Integration Services; furthermore, Integration Services is not fully backward-compatible with DTS, and Microsoft's planned migration tools will only work for some DTS packages. What Is Integration Services? A component of SQL Server 2005, Integration Services is a collection of development tools, run-time components, and management utilities that help businesses build and manage data-integration processes generically referred to as extract, transform, and load (ETL). ETL processes extract data from one or more data sources, transform or process those data (for example, casting them in a common format or correlating data elements from different sources), and then load the transformed data into a separate destination source. Perhaps the most important and common ETL process is the periodic update of data warehouses, special databases designed to aid business decision-making in organizations. (For more information on the ETL process and data warehouses, see the sidebar "ETL and Data Warehousing: A Complex Challenge".) Although Integration Services can be used separately from other SQL Server components, customers will typically combine it with Analysis Services (SQL Server’s OLAP and data mining component) and Reporting Services (a component for design and delivery of reports) to support data analysis, warehousing, ad-hoc reporting, and managed reporting. Together, these components form Microsoft’s business intelligence (BI) platform. (See the illustration "Business Intelligence in SQL Server 2005".) The components of this platform are data-source independent—they can work not only with SQL Server databases but also with Oracle and IBM DB2 databases, flat files, and a variety of other data sources. Consequently, many companies use SQL Server’s BI platform (or individual components of the platform) with database management systems other than SQL Server. Thus, improving SQL Server’s ETL capabilities and other components of the BI platform not only strengthens the BI platform for SQL Server database customers but also creates further sales opportunities among businesses using competing database management systems. Architecture Overhaul, New Tools Although Integration Services and past iterations of DTS support the same business processes (e.g., integrating and moving data among various systems), Integration Services is almost a complete rewrite of DTS and bears little architectural and technical resemblance to its predecessors. Among many changes, Integration Services includes a completely rewritten run-time engine for managing package-execution and data-transformation functions, new management utilities and facilities for event handling and logging, and a new development and debugging environment based on Visual Studio 2005. Combined, these changes make packages significantly easier to write, monitor, and manage and yield performance gains that allow a level of data-processing complexity that would have been prohibitive in DTS. (For an overview of Integration Services’ logical architecture and terminology, see the illustration "SQL Server 2005 Integration Services Architecture".) Improved Performance, Scalability Integration Services’ run-time engine shares almost no code with its predecessor, the DTS execution engine. The new engine, written in C++, emphasizes performance and scalability, central to which is a custom memory manager that is optimized for large data transfers and complex data transformations. For example, the Integration Services memory manager takes advantage of the increasing availability (and decreasing) cost of computers with large amounts of memory to minimize disk I/O during data transfer and transformation steps. While customers can expect speedier raw throughput with Integration Services—with no intermediate data transformation, Microsoft has demonstrated gains on the order of 10% compared with DTS—performance gains become more pronounced as data transformation increases; Microsoft has suggested that Integration Services packages with data transformations could run upwards of five times faster than their DTS counterparts. These gains could redress many of the compromises frequently made in data warehousing applications. For example, to process enormous volumes of data in a reasonable period of time, companies sometimes sample only a fraction of records from their operational data sources or reduce their DTS packages to little more than bulk data transfer tools. Better data transformation performance and support for 64-bit servers (DTS did not support 64-bit) could also let companies push higher-level BI functions into the ETL process, saving time and improving database performance. For example, companies could insert data mining algorithms in the ETL process to spot important trends or conditions (for example, conditions that indicated credit card fraud) in a timelier manner. Such analysis is typically done today only after the time-consuming data warehouse loading process. In addition, memory-intensive operations (such as sorting large, complex datasets) are today typically done on the data warehouse database server following the ETL process. Such operations can tax the database server considerably, often interfering with users attempting to query the database. Integration Services' improved performance, coupled with its support for 64-bit servers, could allow such operations to be moved into the ETL stage, and thus alleviate possible contention. Better Management Tools Integration Services includes several updated features for managing and executing packages: New management service. Integration Services introduces a utility called the Integration Services Service for managing package storage and monitoring package execution. The service runs within the SQL Server 2005 Management Studio, a consolidated suite of management tools that replaces many of the stand-alone tools of previous SQL Server versions. (For example, the SQL Server Management Studio includes capabilities previously provided by SQL Server Enterprise Manager and Query Analyzer.) Although the service is similar to the DTS package administration features in SQL Server 2000’s Enterprise Manager, it offers several refinements. For instance, the service introduces a storage folder concept that will help administrators organize and group packages. Improved package recoverability. In the event of failure, DTS packages could be restarted near the point of failure. This avoided rerunning packages from scratch, often an unacceptable option in ETL processes with large volumes of data. However, implementing reliable package recovery required developers to write a significant amount of custom code (for example, to log and track package state and for intermediate storage of data). Integration Services allows developers to design for package recoverability with less custom code. For example, Integration Services introduces the concept of checkpoints, which can be inserted in packages to store state and thus define package recovery points. Additionally, a multicast data transformation allows developers to load multiple destinations with a package’s output, effectively providing a data backup mechanism in the event of a failure in a package’s primary data destination. Faster Development, Debugging A new integrated development environment, called the BI Development Studio, is based on Visual Studio 2005 and bears only superficial resemblance to the simple stand-alone designer that developers used to create DTS packages in SQL Server 2000. (The BI Development Studio also provides tools and utilities for building Analysis Services and Reporting Services projects.) The new environment gives developers more programming features, improved structure for package creation and layout, and additional built-in data transformations, which should speed package development and ease project collaboration. In addition, the Development Studio can be used to extend Integration Services and could fuel a secondary market for Integration Services add-ons (such as custom data transformations or adapters), thus presenting an opportunity for partners. (The interface used to build Integration Services projects is depicted in the illustration "BI Studio Eases Package Development".) Compared with the DTS designer, the new development environment offers the following improvements: Built-in support for basic programming constructs. Integration Services provides built-in support for fundamental programming constructs such as conditional branches and looping. Programming a loop in DTS, for example, required embedding scripts in a package that modified the workflow logic of the package as it executed, which was effectively a form of self-modifying code, an error-prone programming technique generally avoided by most developers. Native support for branching and looping eliminates the need to resort to dubious programming tactics and will yield packages that are more stable and easier to develop, read, and debug than their DTS counterparts. New data access and transformation capabilities. Like DTS, Integration Services can connect to various data sources. It also adds support for several new sources. For example, Integration Services allows developers to extract and manipulate data from XML sources, such as a large online product catalog stored as an XML document or a call-center tracking application that stores customer call logs in XML and provides access to them via a Web service. Perhaps the most significant addition developers will notice is the long list of built-in data-processing and transformation features included in Integration Services. Integration Services will contain more than 30 distinct data transformations compared to fewer than 10 DTS transformations, most of which were simple string-manipulation functions. New transformations cover a broad spectrum, from relatively simple to much more complex transformations, such as:
Having built-in, advanced data-transformation capabilities will reduce the amount of custom code developers must build (or buy from third parties) and reduce the development time for complex packages. Modular design and development. Integration Services packages can be organized, developed, and tested in a more logical, modular manner than those in DTS. For example, developers can design, develop, and debug package control-flow logic separately from data-flow logic. This means that setup tasks can be developed, tuned, and debugged independently from data transformation tasks, for instance, allowing these tasks to be developed concurrently by different developers and later integrated or made available for use in other packages. Along with support for features such as source control built into Visual Studio, these new features make Integration Services more amenable to collaborative project development than DTS—a key consideration for organizations building relatively large and complex ETL processes. Integrated debugging. Debugging capabilities in DTS were limited. Typically, a developer’s main option was to run packages (or parts of packages) against sample data sets and examine the results for errors or problems. In contrast, the BI Development Studio allows developers to run packages from within the development environment, which provides standard debugging features, such as breakpoints, and the ability to watch variables and step-through code. In addition, developers can plug data viewers into data flow tasks to view the results of data processing tasks as a package executes in the debugger; for example, row-set results can be displayed on grids or pairs of variables can be plotted against one another in a scatter plot. Broader extensibility. Although developers could extend DTS to include custom tasks and data transformations (to implement proprietary, high-speed pattern recognition or sorting algorithms, for example), developers were limited in their choice of language. For example, custom data transformations in DTS could only be coded in Microsoft C++. Furthermore, DTS did not let developers create custom data source adapters (which could be required to extract data from legacy systems or data sources with proprietary storage or logging formats, for instance). Integration Services developers can write managed-code tasks and transformations in the .NET language of their choice, which gives those developers the inherent stability and security advantages of managed code along with the advanced development environment, code editors, and debuggers built into Visual Studio .NET. Integration Services also introduces an API that lets developers create custom data adapters. Partners could also benefit: compared to DTS, Integration Services’ more fully realized set of APIs, and support for Visual Studio .NET and the .NET Framework, creates a considerably more capable platform on which to build add-ons and tools. Compatibility, Other Considerations While Integration Services could prove a major boon to companies using DTS today and will likely generate new sales opportunities for SQL Server, certain caveats apply: Expanded feature set could threaten some partners. Companies that offered alternatives to the relatively limited capabilities of DTS (such as Ascential Software, Business Objects, and Informatica) could see their market share erode as the more powerful and flexible Integration Services becomes generally available and evolves. As it has done with Reporting Services, Microsoft may release additional Integration Services features between major SQL Server releases; for example, Microsoft could release additional data-mining algorithms or data source adapters for Integration Services. Furthermore, while the extensibility of Integration Services offers opportunities for partners (to supply custom data transformations or adapters, for instance), partners pursuing that strategy should proceed cautiously. Microsoft will likely compete aggressively for any Integration Services add-on business that it believes has applicability across a range of vertical markets or represents a substantial opportunity within a specific vertical. Clustering, load balancing not supported. With the release of SQL Server 2005, Microsoft moves Integration Services closer to the large-scale ETL space occupied by products such as Informatica’s PowerCenter and Ascential Software’s DataStage. However, several limitations could challenge Integration Services’ ability to tackle extremely large-scale ETL problems. Specifically, a given Integration Services package cannot run on multiple servers for failover clustering or load-balancing purposes. Although Integration Services packages can be restarted from near a point of failure, and architects can design ETL processes that involve multiple packages running on a number of servers (for example, packages can trigger execution of other packages running on separate servers and pass messages to one another), the inability to run individual packages in a clustered configuration could dampen adoption of Integration Services among companies with very large-scale ETL needs. Major changes imply learning curve. Integration Services includes many features that were not available in DTS, and the tools and procedures for creating Integration Services packages differ from those used to create DTS packages. Although these differences will ultimately result in faster package development, even expert DTS developers should factor some ramp-up time into their early Integration Services project schedules. Furthermore, because of the scope of the rewrite—Integration Services is effectively a version 1.0 release—users could see a degree of instability. Upgrading from previous versions is nontrivial. DTS packages are not compatible with Integration Services. Microsoft will ship a migration wizard that could help make some DTS packages compatible with Integration Services, but many complex DTS packages will need to be rewritten. For example, certain DTS tasks and data connections have direct counterparts in Integration Services—DTS packages containing only such elements can be directly migrated to Integration Services. However, packages that contain self-modifying loops, for instance, will not run in Integration Services and will thus need to be rewritten. Similarly, custom DTS tasks must be rewritten to be used in Integration Services packages. To soften the blow of package incompatibility, Microsoft will do two things. First, it will include SQL Server 2000 DTS in the SQL Server 2005 product and users will have the option of installing DTS when upgrading to SQL Server 2005. This will allow users to upgrade a server to SQL Server 2005 without immediately having to migrate or rewrite their exiting DTS packages. Second, Integration Services will ship with a special task that executes DTS packages, which will effectively let developers call DTS packages from Integration Services packages. However, the advanced management, development, and debugging capabilities available to Integration Services packages will not be available to DTS packages. Availability and Resources Microsoft released the second beta of SQL Server 2005 in July 2004 and plans to release the product in the third quarter of 2005. (A third beta will be available in the first quarter of 2005.) Integration Services will be included with SQL Server 2005 at no additional charge and customers will be able to install and operate the Integration Services run-time components independently of the core SQL Server database engine. However, a SQL Server license is required for any server running Integration Services even if the core SQL Server database services are not installed. The SQL Server 2005 site is www.microsoft.com/sql/2005/. The SQL Server 2005 BI platform is described at www.microsoft.com/technet/prodtechnol/sql/2005/evaluate/dwsqlsy.mspx. New features for developers in SQL Server 2005 are outlined in "SQL Server Gets .NET, XML" on page 5 of the Jan. 2004 Update. BizTalk 2004 architecture, terminology, and use scenarios are covered in "BizTalk Server Engine Reworked" on page 3 of the Apr. 2004 Update. |