Data Warehousing & Mining

Size: px

Start display at page:

Download "Data Warehousing & Mining"

Gervase Cornelius Spencer
6 years ago
Views:

1 1 Data Warehousing & Mining Data Warehouse Architecture: Architecture, in the context of an organization's data warehousing efforts, is a conceptualization of how the data warehouse is built. There is no right or wrong architecture. The worthiness of the architecture can be judged in how the conceptualization aids in the building, maintenance, and usage of the data warehouse. One possible simple conceptualization of a data warehouse architecture consists of the following interconnected layers: Operational database layer The source data for the data warehouse - An organization's ERP systems fall into this layer. Informational access layer The data accessed for reporting and analyzing and the tools for reporting and analyzing data - BI tools fall into this layer. And the Inmon-Kimball differences about design methodology, discussed later in this article, have to do with this layer. Data access layer The interface between the operational and informational access layer - Tools to extract, transform, Load data into the warehouse fall into this layer. Metadata layer The data directory - This is often usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool. Normalized versus dimensional approach for storage of data There are two leading approaches to storing data in a data warehouse - the dimensional approach and the normalized approach. In the dimensional approach, transaction data are partitioned into either "facts", which are generally numeric transaction data, and "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. The main disadvantages of the dimensional approach are: 1) In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated, and 2) It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business. In the normalized approach, the data in the data warehouse are stored following, to a degree, the Codd normalization rule. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to 1) join data from different sources into meaningful information and then 2) access the information without a precise understanding of the sources of data and of the data structure of the data warehouse. These approaches are not exact opposites of each other. Dimensional approaches can involve

2 2 normalizing data to a degree. Evolution in organization use of data warehouses Organizations generally start off with relatively simple use of data warehousing. Over time, more sophisticated use of data warehousing evolves. The following general stages of use of the data warehouse can be distinguished: Off line Operational Databases Data warehouses in this initial stage are developed by simply copying the data of an operational system to another server where the processing load of reporting against the copied data does not impact the operational system's performance. Off line Data Warehouse Data warehouses at this stage are updated from data in the operational systems on a regular basis and the data warehouse data is stored in a data structure designed to facilitate reporting. Real Time Data Warehouse Data warehouses at this stage are updated every time an operational system performs a transaction (e.g., an order or a delivery or a booking.) Integrated Data Warehouse Data warehouses at this stage are updated every time an operational system performs a transaction. The data warehouses then generate transactions that are passed back into the operational systems.hich are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. The main disadvantages of the dimensional approach are: 1) In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated, and 2) It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business. In the normalized approach, the data in the data warehouse are stored following, to a degree, the Codd normalization rule. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to 1) join data from different sources into meaningful information and then 2) access the information without a precise understanding of the sources of data and of the data structure of the data warehouse. These approaches are not exact opposites of each other. Dimensional approaches can involve normalizing data to a degree. Fact table: In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is often located at the centre of a star schema, surrounded by dimension tables.fact tables provide the (usually) additive values which act as independent variables by which dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined by a day, product and store. Other dimensions might be members of this fact table (such as location/region) but these add nothing to the uniqueness of the fact records. These "affiliate dimensions" allow for additional slices of the independent facts but generally provide insights at a higher level of aggregation (region is made up of many stores)

3 3 A data warehouse dimension provides the means to "slice and dice" data in a data warehouse. Dimensions provide structured labeling information to otherwise unordered numeric measures. For example, "Customer", "Date", and "Product" are all dimensions that could be applied meaningfully to a sales receipt. A dimensional data element is similar to a categorical variable in statistics. The primary function of dimensions is threefold: to provide filtering, grouping and labeling. For example, in a data warehouse where each person is categorized as having a gender of male, female or unknown, a user of the data warehouse would then be able to filter or categorize each presentation or report by either filtering based on the gender dimension or displaying results broken out by the gender. Star Schema: The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few "fact tables" (possibly only one, justifying the name) referencing any number of "dimension tables". The star schema is considered an important special case of the snowflake schema. Example Star schema used by example query. Consider a database of sales, perhaps from a store chain, classified by date, store and product. The image of the schema to the right is a star schema version of the sample schema provided in the snowflake schema article. Fact_Sales is the fact table and there are three dimension tables Dim_Date, Dim_Store and Dim_Product. Each dimension table has a primary key on its Id column, relating to one of the columns of the Fact_Sales table's three-column primary key (Date_Id, Store_Id, Product_Id). The non-primary key Units_Sold column of the fact table in this example represents a measure or metric that can be used in calculations and analysis. The non-primary key columns of the dimension tables represent additional attributes of the dimensions (such as the Year of the Dim_Date dimension). Star schema used by example query. The following query extracts how many TV sets have been sold, for each brand and country, in Normalization: Database normalization, sometimes referred to as canonical synthesis, is a technique for designing relational database tables to minimize duplication of information and, in so doing, to safeguard the database against certain types of logical or structural problems, namely data anomalies. For example, when multiple instances of a given piece of information occur in a table, the possibility exists that these instances will not be kept consistent when the data within the table is updated, leading to a loss of data integrity. A table that is sufficiently normalized is less vulnerable to problems of this kind, because its structure reflects the basic assumptions for when multiple instances of the same information should be represented by a single instance only. Higher degrees of normalization typically involve more tables and create the need for a larger number of joins, which can reduce performance. Accordingly, more highly normalized tables are typically used in database applications involving many isolated transactions (e.g. an automated teller machine), while less normalized tables tend to be used in database applications that need to map complex relationships between data entities and data attributes (e.g. a reporting application, or a full-text search application). Database theory describes a table's degree of normalization in terms of normal forms of successively

4 4 higher degrees of strictness. A table in third normal form (3NF), for example, is consequently in second normal form (2NF) as well; but the reverse is not necessarily the case. Although the normal forms are often defined informally in terms of the characteristics of tables, rigorous definitions of the normal forms are concerned with the characteristics of mathematical constructs known as relations. Whenever information is represented relationally, it is meaningful to consider the extent to which the representation is normalized. materialised view: In a database management system following the relational model, a view is a virtual table representing the result of a database query. Whenever an ordinary view's table is queried or updated, the DBMS converts these into queries or updates against the underlying base tables. A materialized view takes a different approach in which the query result is cached as a concrete table that may be updated from the original base tables from time to time. This enables much more efficient access, at the cost of some data being potentially out-of-date. It is most useful in data warehousing scenarios, where frequent queries of the actual base tables can be extremely expensive. In addition, because the view is manifested as a real table, anything that can be done to a real table can be done to it, most importantly building indexes on any column, enabling drastic speedups in query time. In a normal view, it's typically only possible to exploit indexes on columns that come directly from (or have a mapping to) indexed columns in the base tables; often this functionality is not offered at all. Materialized views were implemented first by the Oracle database. There are three types of materialized views: 1) Read only Cannot be updated and complex materialized views are supported 2) Updateable Can be updated even when disconnected from the master site. Are refreshed on demand. Consumes fewer resources. Requires Advanced Replication option to be installed. 3) Writeable Created with the for update clause. Changes are lost when view is refreshed. Requires Advanced Replication option to be installed. Data Warehouses, OLTP, OLAP, and Data Mining A relational database is designed for a specific purpose. Because the purpose of a data warehouse differs from that of an OLTP, the design characteristics of a relational database that supports a data warehouse differ from the design characteristics of an OLTP database. Data warehouse database Designed for analysis of business measures by categories and attributes OLTP database Designed for real-time business operations Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table Optimized for a common set of transactions, usually adding or retrieving a single row at a time per table

5 5 Loaded with consistent, valid data; requires no real time validation Optimized for validation of incoming data during transactions; uses validation data tables Supports few concurrent users relative to OLTP Supports thousands of concurrent users A Data Warehouse Supports OLTP A data warehouse supports an OLTP system by providing a place for the OLTP database to offload data as it accumulates, and by providing services that would complicate and degrade OLTP operations if they were performed in the OLTP database. Without a data warehouse to hold historical information, data is archived to static media such as magnetic tape, or allowed to accumulate in the OLTP database. If data is simply archived for preservation, it is not available or organized for use by analysts and decision makers. If data is allowed to accumulate in the OLTP so it can be used for analysis, the OLTP database continues to grow in size and requires more indexes to service analytical and report queries. These queries access and process large portions of the continually growing historical data and add a substantial load to the database. The large indexes needed to support these queries also tax the OLTP transactions with additional index maintenance. These queries can also be complicated to develop due to the typically complex OLTP database schema. A data warehouse offloads the historical data from the OLTP, allowing the OLTP to operate at peak transaction efficiency. High volume analytical and reporting queries are handled by the data warehouse and do not load the OLTP, which does not need additional indexes for their support. As data is moved to the data warehouse, it is also reorganized and consolidated so that analytical queries are simpler and more efficient. OLAP is a Data Warehouse Tool Online analytical processing (OLAP) is a technology designed to provide superior performance for ad hoc business intelligence queries. OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses. A data warehouse provides a multidimensional view of data in an intuitive model designed to match the types of queries posed by analysts and decision makers. OLAP organizes data warehouse data into multidimensional cubes based on this dimensional model, and then preprocesses these cubes to provide maximum performance for queries that summarize data in various ways. For example, a query that requests the total sales income and quantity sold for a range of products in a specific geographical region for a specific time period can typically be answered in a few seconds or less regardless of how many hundreds of millions of rows of data are stored in the data warehouse database. OLAP is not designed to store large volumes of text or binary data, nor is it designed to support high volume update transactions. The inherent stability and consistency of historical data in a data warehouse enables OLAP to provide its remarkable performance in rapidly summarizing information for analytical queries.

6 6 In SQL Server 2000, Analysis Services provides tools for developing OLAP applications and a server specifically designed to service OLAP queries. Data Mining is a Data Warehouse Tool Data mining is a technology that applies sophisticated and complex algorithms to analyze data and expose interesting information for analysis by decision makers. Whereas OLAP organizes data in a model suited for exploration by analysts, data mining performs analysis on data and provides the results to decision makers. Thus, OLAP supports model-driven analysis and data mining supports datadriven analysis. Data mining has traditionally operated only on raw data in the data warehouse database or, more commonly, text files of data extracted from the data warehouse database. In SQL Server 2000, Analysis Services provides data mining technology that can analyze data in OLAP cubes, as well as data in the relational data warehouse database. In addition, data mining results can be incorporated into OLAP cubes to further enhance model-driven analysis by providing an additional dimensional viewpoint into the OLAP model. For example, data mining can be used to analyze sales data against customer attributes and create a new cube dimension to assist the analyst in the discovery of the information embedded in the cube data. For more information and details about data mining in SQL Server 2000, see Chapter 24, "Effective Strategies for Data Mining," in the SQL Server 2000 Resource Kit. Designing a Data Warehouse: Prerequisites Before embarking on the design of a data warehouse, it is imperative that the architectural goals of the data warehouse be clear and well understood. Because the purpose of a data warehouse is to serve users, it is also critical to understand the various types of users, their needs, and the characteristics of their interactions with the data warehouse. Data Warehouse Architecture Goals A data warehouse exists to serve its users analysts and decision makers. A data warehouse must be designed to satisfy the following requirements: Deliver a great user experience user acceptance is the measure of success Function without interfering with OLTP systems Provide a central repository of consistent data Answer complex queries quickly Provide a variety of powerful analytical tools, such as OLAP and data mining Most successful data warehouses that meet these requirements have these common characteristics:

7 Are based on a dimensional model Contain historical data Include both detailed and summarized data Consolidate disparate data from multiple sources while retaining consistency Focus on a single

7 7 Are based on a dimensional model Contain historical data Include both detailed and summarized data Consolidate disparate data from multiple sources while retaining consistency Focus on a single subject, such as sales, inventory, or finance Data warehouses are often quite large. However, size is not an architectural goal it is a characteristic driven by the amount of data needed to serve the users. Data Warehouse Users The success of a data warehouse is measured solely by its acceptance by users. Without users, historical data might as well be archived to magnetic tape and stored in the basement. Successful data warehouse design starts with understanding the users and their needs. Data warehouse users can be divided into four categories: Statisticians, Knowledge Workers, Information Consumers, and Executives. Each type makes up a portion of the user population as illustrated in this diagram. Figure 1. The User Pyramid Statisticians: There are typically only a handful of sophisticated analysts Statisticians and operations research types in any organization. Though few in number, they are some of the best users of the data warehouse; those whose work can contribute to closed loop systems that deeply influence the operations and profitability of the company. It is vital that these users come to love the data warehouse. Usually that is not difficult; these people are often very self-sufficient and need only to be pointed to the database and given some simple instructions about how to get to the data and what times of the day are best for performing large queries to retrieve data to analyze using their own sophisticated tools. They can take it from there.

8 8 Knowledge Workers: A relatively small number of analysts perform the bulk of new queries and analyses against the data warehouse. These are the users who get the "Designer" or "Analyst" versions of user access tools. They will figure out how to quantify a subject area. After a few iterations, their queries and reports typically get published for the benefit of the Information Consumers. Knowledge Workers are often deeply engaged with the data warehouse design and place the greatest demands on the ongoing data warehouse operations team for training and support. Information Consumers: Most users of the data warehouse are Information Consumers; they will probably never compose a true ad hoc query. They use static or simple interactive reports that others have developed. It is easy to forget about these users, because they usually interact with the data warehouse only through the work product of others. Do not neglect these users! This group includes a large number of people, and published reports are highly visible. Set up a great communication infrastructure for distributing information widely, and gather feedback from these users to improve the information sites over time. Executives: Executives are a special case of the Information Consumers group. Few executives actually issue their own queries, but an executive's slightest musing can generate a flurry of activity among the other types of users. A wise data warehouse designer/implementer/owner will develop a very cool digital dashboard for executives, assuming it is easy and economical to do so. Usually this should follow other data warehouse work, but it never hurts to impress the bosses. How Users Query the Data Warehouse Information for users can be extracted from the data warehouse relational database or from the output of analytical services such as OLAP or data mining. Direct queries to the data warehouse relational database should be limited to those that cannot be accomplished through existing tools, which are often more efficient than direct queries and impose less load on the relational database. Reporting tools and custom applications often access the database directly. Statisticians frequently extract data for use by special analytical tools. Analysts may write complex queries to extract and compile specific information not readily accessible through existing tools. Information consumers do not interact directly with the relational database but may receive reports or access web pages that expose data from the relational database. Executives use standard reports or ask others to create specialized reports for them. When using the Analysis Services tools in SQL Server 2000, Statisticians will often perform data mining, Analysts will write MDX queries against OLAP cubes and use data mining, and Information Consumers will use interactive reports designed by others. Developing a Data Warehouse: Details The phases of a data warehouse project listed below are similar to those of most database projects, starting with identifying requirements and ending with deploying the system: Identify and gather requirements Design the dimensional model

9 9 Develop the architecture, including the Operational Data Store (ODS) Design the relational database and OLAP cubes Develop the data maintenance applications Develop analysis applications Test and deploy the system Identify and Gather Requirements Identify sponsors. A successful data warehouse project needs a sponsor in the business organization and usually a second sponsor in the Information Technology group. Sponsors must understand and support the business value of the project. Understand the business before entering into discussions with users. Then interview and work with the users, not the data learn the needs of the users and turn these needs into project requirements. Find out what information they need to be more successful at their jobs, not what data they think should be in the data warehouse; it is the data warehouse designer's job to determine what data is necessary to provide the information. Topics for discussion are the users' objectives and challenges and how they go about making business decisions. Business users should be closely tied to the design team during the logical design process; they are the people who understand the meaning of existing data. Many successful projects include several business users on the design team to act as data experts and "sounding boards" for design concepts. Whatever the structure of the team, it is important that business users feel ownership for the resulting system. Interview data experts after interviewing several users. Find out from the experts what data exists and where it resides, but only after you understand the basic business needs of the end users. Information about available data is needed early in the process, before you complete the analysis of the business needs, but the physical design of existing data should not be allowed to have much influence on discussions about business needs. Communicate with users often and thoroughly continue discussions as requirements continue to solidify so that everyone participates in the progress of the requirements definition. Design the Dimensional Model User requirements and data realities drive the design of the dimensional model, which must address business needs, grain of detail, and what dimensions and facts to include. The dimensional model must suit the requirements of the users and support ease of use for direct access. The model must also be designed so that it is easy to maintain and can adapt to future changes. The model design must result in a relational database that supports OLAP cubes to provide "instantaneous" query results for analysts. An OLTP system requires a normalized structure to minimize redundancy, provide validation of input data, and support a high volume of fast transactions. A transaction usually involves a single business

10 10 event, such as placing an order or posting an invoice payment. An OLTP model often looks like a spider web of hundreds or even thousands of related tables. In contrast, a typical dimensional model uses a star or snowflake design that is easy to understand and relate to business needs, supports simplified business queries, and provides superior query performance by minimizing table joins. For example, contrast the very simplified OLTP data model in the first diagram below with the data warehouse dimensional model in the second diagram. Which one better supports the ease of developing reports and simple, efficient summarization queries? Figure 2. Flow Chart (click for larger image) Figure 3. Star Diagram Dimensional Model Schemas The principal characteristic of a dimensional model is a set of detailed business facts surrounded by multiple dimensions that describe those facts. When realized in a database, the schema for a dimensional model contains a central fact table and multiple dimension tables. A dimensional model

11 may produce a star schema or asnowflake schema. Star Schemas A schema is called a star schema if all dimension tables can be joined directly to the fact table.

11 11 may produce a star schema or asnowflake schema. Star Schemas A schema is called a star schema if all dimension tables can be joined directly to the fact table. The following diagram shows a classic star schema. Figure 4. Classic star schema, sales (click for larger image) The following diagram shows a clickstream star schema.

tables do not join directly to the fact table but must join through other dimension tables.

12 12 Figure 5. Clickstream star schema (click for larger image) Snowflake Schemas A schema is called a snowflake schema if one or more dimension tables do not join directly to the fact table but must join through other dimension tables. For example, a dimension that describes products may be separated into three tables (snowflaked) as illustrated in the following diagram. Figure 6. Snowflake, three tables (click for larger image) A snowflake schema with multiple heavily snowflaked dimensions is illustrated in the following diagram.

13 13 Figure 7. Many dimension snowflake (click for larger image) Star or Snowflake Both star and snowflake schemas are dimensional models; the difference is in their physical implementations. Snowflake schemas support ease of dimension maintenance because they are more normalized. Star schemas are easier for direct user access and often support simpler and more efficient queries. The decision to model a dimension as a star or snowflake depends on the nature of the dimension itself, such as how frequently it changes and which of its elements change, and often involves evaluating tradeoffs between ease of use and ease of maintenance. It is often easiest to maintain a complex dimension by snow flaking the dimension. By pulling hierarchical levels into separate tables, referential integrity between the levels of the hierarchy is guaranteed. Analysis Services reads from a snowflaked dimension as well as, or better than, from a star dimension. However, it is important to present a simple and appealing user interface to business users who are developing ad hoc queries on the dimensional database. It may be better to create a star version of the snowflaked dimension for presentation to the users. Often, this is best accomplished by creating an indexed view across the snowflaked dimension, collapsing it to a virtual star. Dimension Tables Dimension tables encapsulate the attributes associated with facts and separate these attributes into logically distinct groupings, such as time, geography, products, customers, and so forth. A dimension table may be used in multiple places if the data warehouse contains multiple fact tables or contributes data to data marts. For example, a product dimension may be used with a sales fact table and an inventory fact table in the data warehouse, and also in one or more departmental data marts. A dimension such as customer, time, or product that is used in multiple schemas is called a conforming dimension if all copies of the dimension are the same. Summarization data and reports will not correspond if different schemas use different versions of a dimension table. Using conforming dimensions is critical to successful data warehouse design. User input and evaluation of existing business reports help define the dimensions to include in the data warehouse. A user who wants to see data "by sales region" and "by product" has just identified two dimensions (geography and product). Business reports that group sales by salesperson or sales by

14 14 customer identify two more dimensions (salesforce and customer). Almost every data warehouse includes a time dimension. In contrast to a fact table, dimension tables are usually small and change relatively slowly. Dimension tables are seldom keyed to date. The records in a dimension table establish one-to-many relationships with the fact table. For example, there may be a number of sales to a single customer, or a number of sales of a single product. The dimension table contains attributes associated with the dimension entry; these attributes are rich and user-oriented textual details, such as product name or customer name and address. Attributes serve as report labels and query constraints. Attributes that are coded in an OLTP database should be decoded into descriptions. For example, product category may exist as a simple integer in the OLTP database, but the dimension table should contain the actual text for the category. The code may also be carried in the dimension table if needed for maintenance. This denormalization simplifies and improves the efficiency of queries and simplifies user query tools. However, if a dimension attribute changes frequently, maintenance may be easier if the attribute is assigned to its own table to create a snowflake dimension. It is often useful to have a pre-established "no such member" or "unknown member" record in each dimension to which orphan fact records can be tied during the update process. Business needs and the reliability of consistent source data will drive the decision as to whether such placeholder dimension records are required. Hierarchies The data in a dimension is usually hierarchical in nature. Hierarchies are determined by the business need to group and summarize data into usable information. For example, a time dimension often contains the hierarchy elements: (all time), Year, Quarter, Month, Day, or (all time), Year Quarter, Week, Day. A dimension may contain multiple hierarchies a time dimension often contains both calendar and fiscal year hierarchies. Geography is seldom a dimension of its own; it is usually a hierarchy that imposes a structure on sales points, customers, or other geographically distributed dimensions. An example geography hierarchy for sales points is: (all), Country or Region, Sales-region, State or Province, City, Store. Note that each hierarchy example has an "(all)" entry such as (all time), (all stores), (all customers), and so forth. This top-level entry is an artificial category used for grouping the first-level categories of a dimension and permits summarization of fact data to a single number for a dimension. For example, if the first level of a product hierarchy includes product line categories for hardware, software, peripherals, and services, the question "What was the total amount for sales of all products last year?" is equivalent to "What was the total amount for the combined sales of hardware, software, peripherals, and services last year?" The concept of an "(all)" node at the top of each hierarchy helps reflect the way users want to phrase their questions. OLAP tools depend on hierarchies to categorize data Analysis Services will create by default an "(all)" entry for a hierarchy used in a cube if none is specified. A hierarchy may be balanced, unbalanced, ragged, or composed of parent-child relationships such as an organizational structure. For more information about hierarchies in OLAP cubes, see SQL Server Books Online.

15 15 Surrogate Keys A critical part of data warehouse design is the creation and use of surrogate keys in dimension tables. A surrogate key is the primary key for a dimension table and is independent of any keys provided by source data systems. Surrogate keys are created and maintained in the data warehouse and should not encode any information about the contents of records; automatically increasing integers make good surrogate keys. The original key for each record is carried in the dimension table but is not used as the primary key. Surrogate keys provide the means to maintain data warehouse information when dimensions change. Special keys are used for date and time dimensions, but these keys differ from surrogate keys used for other dimension tables. GUID and IDENTITY Keys Avoid using GUIDs (globally unique identifiers) as keys in the data warehouse database. GUIDs may be used in data from distributed source systems, but they are difficult to use as table keys. GUIDs use a significant amount of storage (16 bytes each), cannot be efficiently sorted, and are difficult for humans to read. Indexes on GUID columns may be relatively slower than indexes on integer keys because GUIDs are four times larger. The Transact-SQL NEWID function can be used to create GUIDs for a column of uniqueidentifier data type, and the ROWGUIDCOL property can be set for such a column to indicate that the GUID values in the column uniquely identify rows in the table, but uniqueness is not enforced. Because a uniqueidentifier data type cannot be sorted, the GUID cannot be used in a GROUP BY statement, nor can the occurrences of the uniqueidentifierguid be distinctly counted both GROUP BY and COUNT DISTINCT operations are very common in data warehouses. The uniqueidentifier GUID cannot be used as a measure in an Analysis Services cube. The IDENTITY property and IDENTITY function can be used to create identity columns in tables and to manage series of generated numeric keys. IDENTITY functionality is more useful in surrogate key management than uniqueidentifier GUIDs. Date and Time Dimensions Each event in a data warehouse occurs at a specific date and time; and data is often summarized by a specified time period for analysis. Although the date and time of a business fact is usually recorded in the source data, special date and time dimensions provide more effective and efficient mechanisms for time-oriented analysis than the raw event time stamp. Date and time dimensions are designed to meet the needs of the data warehouse users and are created within the data warehouse. A date dimension often contains two hierarchies: one for calendar year and another for fiscal year. Time Granularity A date dimension with one record per day will suffice if users do not need time granularity finer than a single day. A date by day dimension table will contain 365 records per year (366 in leap years). A separate time dimension table should be constructed if a fine time granularity, such as minute or second, is needed. A time dimension table of one-minute granularity will contain 1,440 rows for a day, and a table of seconds will contain 86,400 rows for a day. If exact event time is needed, it should be

16 16 stored in the fact table. When a separate time dimension is used, the fact table contains one foreign key for the date dimension and another for the time dimension. Separate date and time dimensions simplify many filtering operations. For example, summarizing data for a range of days requires joining only the date dimension table to the fact table. Analyzing cyclical data by time period within a day requires joining just the time dimension table. The date and time dimension tables can both be joined to the fact table when a specific time range is needed. For hourly time granularity, the hour breakdown can be incorporated into the date dimension or placed in a separate dimension. Business needs influence this design decision. If the main use is to extract contiguous chunks of time that cross day boundaries (for example 11/24/ p.m. to 11/25/ a.m.), then it is easier if the hour and day are in the same dimension. However, it is easier to analyze cyclical and recurring daily events if they are in separate dimensions. Unless there is a clear reason to combine date and hour in a single dimension, it is generally better to keep them in separate dimensions. Date and Time Dimension Attributes It is often useful to maintain attribute columns in a date dimension to provide additional convenience or business information that supports analysis. For example, one or more columns in the time-by-hour dimension table can indicate peak periods in a daily cycle, such as meal times for a restaurant chain or heavy usage hours for an Internet service provider. Peak period columns may be Boolean, but it is better to "decode" the Boolean yes/no into a brief description, such as "peak"/"offpeak". In a report, the decoded values will be easier for business users to read than multiple columns of "yes" and "no". These are some possible attribute columns that may be used in a date table. Fiscal year versions are the same, although values such as quarter numbers may differ. Format/Exam Column name Data type ple Comment date_key int yyyymmdd day_date smalldatetime day_of_week char Monday week_begin_dat e smalldatetime week_num tinyint 1 to 52 or 53 Week 1 defined by business rules month_num tinyint 1 to 12

17 17 month_name char January month_short_na me char Jan month_end_date smalldatetime Useful for days in the month days_in_month tinyint Alternative for, or in addition to month_end_date yearmo int yyyymm quarter_num tinyint 1 to 4 quarter_name char 1Q2000 year smallint weekend_ind bit Indicates weekend workday_ind bit Indicates work day weekend_weekd ay char weekend Alternative for weekend_ind andweekday _ind. Can be used to make reports more readable. holiday_ind bit Hardware & I/O considerations: Overview of Hardware and I/O Considerations in Data Warehouses I/O performance should always be a key consideration for data warehouse designers and administrators. The typical workload in a data warehouse is especially I/O intensive, with operations such as large data loads and index builds, creation of materialized views, and queries over large volumes of data. The underlying I/O system for a data warehouse should be designed to meet these heavy requirements. In fact, one of the leading causes of performance issues in a data warehouse is poor I/O configuration. Database administrators who have previously managed other systems will likely need to pay more careful attention to the I/O configuration for a data warehouse than they may have previously done for

18 18 other environments. This chapter provides the following five high-level guidelines for data-warehouse I/O configurations: Configure I/O for Bandwidth not Capacity Stripe Far and Wide Use Redundancy Test the I/O System Before Building the Database Plan for Growth The I/O configuration used by a data warehouse will depend on the characteristics of the specific storage and server capabilities, so the material in this chapter is only intended to provide guidelines for designing and tuning an I/O system. Configure I/O for Bandwidth not Capacity Storage configurations for a data warehouse should be chosen based on the I/O bandwidth that they can provide, and not necessarily on their overall storage capacity. Buying storage based solely on capacity has the potential for making a mistake, especially for systems less than 500GB is total size. The capacity of individual disk drives is growing faster than the I/O throughput rates provided by those disks, leading to a situation in which a small number of disks can store a large volume of data, but cannot provide the same I/O throughput as a larger number of small disks. As an example, consider a 200GB data mart. Using 72GB drives, this data mart could be built with as few as six drives in a fully-mirrored environment. However, six drives might not provide enough I/O bandwidth to handle a medium number of concurrent users on a 4-CPU server. Thus, even though six drives provide sufficient storage, a larger number of drives may be required to provide acceptable performance for this system. While it may not be practical to estimate the I/O bandwidth that will be required by a data warehouse before a system is built, it is generally practical with the guidance of the hardware manufacturer to estimate how much I/O bandwidth a given server can potentially utilize, and ensure that the selected I/O configuration will be able to successfully feed the server. There are many variables in sizing the I/O systems, but one basic rule of thumb is that your data warehouse system should have multiple disks for each CPU (at least two disks for each CPU at a bare minimum) in order to achieve optimal performance. Stripe Far and Wide The guiding principle in configuring an I/O system for a data warehouse is to maximize I/O bandwidth by having multiple disks and channels access each database object. You can do this by striping the datafiles of the Oracle Database. A striped file is a file distributed across multiple disks. This striping can be managed by software (such as a logical volume manager), or within the storage hardware. The goal is to ensure that each tablespace is striped across a large number of disks (ideally, all of the disks) so

19 19 that any database object can be accessed with the highest possible I/O bandwidth. Use Redundancy Because data warehouses are often the largest database systems in a company, they have the most disks and thus are also the most susceptible to the failure of a single disk. Therefore, disk redundancy is a requirement for data warehouses to protect against a hardware failure. Like disk-striping, redundancy can be achieved in many ways using software or hardware. A key consideration is that occasionally a balance must be made between redundancy and performance. For example, a storage system in a RAID-5 configuration may be less expensive than a RAID-0+1 configuration, but it may not perform as well, either. Redundancy is necessary for any data warehouse, but the approach to redundancy may vary depending upon the performance and cost constraints of each data warehouse. Test the I/O System Before Building the Database The most important time to examine and tune the I/O system is before the database is even created. Once the database files are created, it is more difficult to reconfigure the files. Some logical volume managers may support dynamic reconfiguration of files, while other storage configurations may require that files be entirely rebuilt in order to reconfigure their I/O layout. In both cases, considerable system resources must be devoted to this reconfiguration. When creating a data warehouse on a new system, the I/O bandwidth should be tested before creating all of the database datafiles to validate that the expected I/O levels are being achieved. On most operating systems, this can be done with simple scripts to measure the performance of reading and writing large test files. Plan for Growth A data warehouse designer should plan for future growth of a data warehouse. There are many approaches to handling the growth in a system, and the key consideration is to be able to grow the I/O system without compromising on the I/O bandwidth. You cannot, for example, add four disks to an existing system of 20 disks, and grow the database by adding a new tablespace striped across only the four new disks. A better solution would be to add new tablespaces striped across all 24 disks, and over time also convert the existing tablespaces striped across 20 disks to be striped across all 24 disks. Storage Management Two features to consider for managing disks are Oracle Managed Files and Automatic Storage Management. Without these features, a database administrator must manage the database files, which, in a data warehouse, can be hundreds or even thousands of files. Oracle Managed Files simplifies the administration of a database by providing functionality to automatically create and manage files, so the database administrator no longer needs to manage each database file. Automatic Storage Management provides additional functionality for managing not only files but also the disks. With Automatic Storage Management, the database administrator would administer a small number of disk groups. Automatic

20 20 Storage Management handles the tasks of striping and providing disk redundancy, including rebalancing the database files when new disks are added to the system. Data parallelism: Data parallelism (also known as loop-level parallelism) is a form of parallelization of computing across multiple processors in parallel computing environments. Data parallelism focuses on distributing the data across different parallel computing nodes. It contrasts to task parallelism as another form of parallelism. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code. For instance, if we are running code on a 2-processor system (CPUs A and B) in a parallel environment, and we wish to do a task on some data D, it is possible to tell CPU A to do that task on one part of D and CPU B on another part simultaneously, thereby reducing the runtime of the execution. The data can be assigned using conditional statements. As a specific example, consider adding two matrices. In a data parallel implementation, CPU A could add all elements from the top half of the matrices, while CPU B could add all elements from the bottom half of the matrices. Since the two processors work in parallel, the job of performing matrix addition would take one half the time of performing the same operation in serial using one CPU alone. Data parallelism emphasizes the distributed (parallelized) nature of the data, as opposed to the processing (task parallelism). Most real programs fall somewhere on a continuum between Task parallelism and Data parallelism.- Data Extraction, Transformation, and Loading Techniques "Data Warehouse Design Considerations," discussed the use of dimensional modeling to design databases for data warehousing. In contrast to the complex, highly normalized, entity-relationship schemas of online transaction processing (OLTP) databases, data warehouse schemas are simple and denormalized. Regardless of the specific design or technology used in a data warehouse, its implementation must include mechanisms to migrate data into the data warehouse database. This process of data migration is generally referred to as the extraction, transformation, and loading (ETL) process. Some data warehouse experts add an additional term management to ETL, expanding it to ETLM. Others use the M to mean meta data. Both refer to the management of the data as it flows into the data warehouse and is used in the data warehouse. The information used to manage data consists of data about data, which is the definition of meta data.

21 21 The topics in this chapter describe the elements of the ETL process and provide examples of procedures that address common ETL issues such as managing surrogate keys, slowly changing dimensions, and meta data. The code examples in this chapter are also available on the SQL Server 2000 Resource Kit CD-ROM, in the file \Docs\ChapterCode\CH19Code.txt. For more information, see Chapter 39, "Tools, Samples, ebooks, and More." Introduction During the ETL process, data is extracted from an OLTP database, transformed to match the data warehouse schema, and loaded into the data warehouse database. Many data warehouses also incorporate data from non-oltp systems, such as text files, legacy systems, and spreadsheets; such data also requires extraction, transformation, and loading. In its simplest form, ETL is the process of copying data from one database to another. This simplicity is rarely, if ever, found in data warehouse implementations; in reality, ETL is often a complex combination of process and technology that consumes a significant portion of the data warehouse development efforts and requires the skills of business analysts, database designers, and application developers. When defining ETL for a data warehouse, it is important to think of ETL as a process, not a physical implementation. ETL systems vary from data warehouse to data warehouse and even between department data marts within a data warehouse. A monolithic application, regardless of whether it is implemented in Transact-SQL or a traditional programming language, does not provide the flexibility for change necessary in ETL systems. A mixture of tools and technologies should be used to develop applications that each perform a specific ETL task. The ETL process is not a one-time event; new data is added to a data warehouse periodically. Typical periodicity may be monthly, weekly, daily, or even hourly, depending on the purpose of the data warehouse and the type of business it serves. Because ETL is an integral, ongoing, and recurring part of a data warehouse, ETL processes must be automated and operational procedures documented. ETL also changes and evolves as the data warehouse evolves, so ETL processes must be designed for ease of modification. A solid, well-designed, and documented ETL system is necessary for the success of a data warehouse project. Data warehouses evolve to improve their service to the business and to adapt to changes in business processes and requirements. Business rules change as the business reacts to market influences the data warehouse must respond in order to maintain its value as a tool for decision makers. The ETL implementation must adapt as the data warehouse evolves. Microsoft SQL Server 2000 provides significant enhancements to existing performance and capabilities, and introduces new features that make the development, deployment, and maintenance of ETL processes easier and simpler, and its performance faster. ETL Functional Elements

22 22 Regardless of how they are implemented, all ETL systems have a common purpose: they move data from one database to another. Generally, ETL systems move data from OLTP systems to a data warehouse, but they can also be used to move data from one data warehouse to another. An ETL system consists of four distinct functional elements: Extraction Transformation Loading Meta data Extraction The ETL extraction element is responsible for extracting data from the source system. During extraction, data may be removed from the source system or a copy made and the original data retained in the source system. It is common to move historical data that accumulates in an operational OLTP system to a data warehouse to maintain OLTP performance and efficiency. Legacy systems may require too much effort to implement such offload processes, so legacy data is often copied into the data warehouse, leaving the original data in place. Extracted data is loaded into the data warehouse staging area (a relational database usually separate from the data warehouse database), for manipulation by the remaining ETL processes. Data extraction is generally performed within the source system itself, especially if it is a relational database to which extraction procedures can easily be added. It is also possible for the extraction logic to exist in the data warehouse staging area and query the source system for data using ODBC, OLE DB, or other APIs. For legacy systems, the most common method of data extraction is for the legacy system to produce text files, although many newer systems offer direct query APIs or accommodate access through ODBC or OLE DB. Data extraction processes can be implemented using Transact-SQL stored procedures, Data Transformation Services (DTS) tasks, or custom applications developed in programming or scripting languages. Transformation The ETL transformation element is responsible for data validation, data accuracy, data type conversion, and business rule application. It is the most complicated of the ETL elements. It may appear to be more efficient to perform some transformations as the data is being extracted (inline transformation); however, an ETL system that uses inline transformations during extraction is less robust and flexible than one that confines transformations to the transformation element. Transformations performed in

23 23 the OLTP system impose a performance burden on the OLTP database. They also split the transformation logic between two ETL elements and add maintenance complexity when the ETL logic changes. Tools used in the transformation element vary. Some data validation and data accuracy checking can be accomplished with straightforward Transact-SQL code. More complicated transformations can be implemented using DTS packages. The application of complex business rules often requires the development of sophisticated custom applications in various programming languages. You can use DTS packages to encapsulate multi-step transformations into a single task. Listed below are some basic examples that illustrate the types of transformations performed by this element: Data Validation Check that all rows in the fact table match rows in dimension tables to enforce data integrity. Data Accuracy Ensure that fields contain appropriate values, such as only "off" or "on" in a status field. Data Type Conversion Ensure that all values for a specified field are stored the same way in the data warehouse regardless of how they were stored in the source system. For example, if one source system stores "off" or "on" in its status field and another source system stores "0" or "1" in its status field, then a data type conversion transformation converts the content of one or both of the fields to a specified common value such as "off" or "on". Business Rule Application Ensure that the rules of the business are enforced on the data stored in the warehouse. For example, check that all customer records contain values for both FirstName and LastName fields. Loading The ETL loading element is responsible for loading transformed data into the data warehouse database. Data warehouses are usually updated periodically rather than continuously, and large numbers of records are often loaded to multiple tables in a single data load. The data warehouse is often taken offline during update operations so that data can be loaded faster and SQL Server 2000 Analysis Services can update OLAP cubes to incorporate the new data. BULK INSERT, bcp, and the Bulk Copy API are the best tools for data loading operations. The design of the loading element should focus on efficiency and performance to minimize the data warehouse offline time. For more information and details about performance tuning, see Chapter 20, "RDBMS Performance Tuning Guide for Data Warehousing."

24 24 Meta Data The ETL meta data functional element is responsible for maintaining information (meta data) about the movement and transformation of data, and the operation of the data warehouse. It also documents the data mappings used during the transformations. Meta data logging provides possibilities for automated administration, trend prediction, and code reuse. Examples of data warehouse meta data that can be recorded and used to analyze the activity and performance of a data warehouse include: Data Lineage, such as the time that a particular set of records was loaded into the data warehouse. Schema Changes, such as changes to table definitions. Data Type Usage, such as identifying all tables that use the "Birthdate" userdefined data type. Transformation Statistics, such as the execution time of each stage of a transformation, the number of rows processed by the transformation, the last time the transformation was executed, and so on. DTS Package Versioning, which can be used to view, branch, or retrieve any historical version of a particular DTS package. Data Warehouse Usage Statistics, such as query times for reports. ETL Design Considerations Regardless of their implementation, a number of design considerations are common to all ETL systems: Modularity ETL systems should contain modular elements that perform discrete tasks. This encourages reuse and makes them easy to modify when implementing changes in response to business and data warehouse changes. Monolithic systems should be avoided. Consistency ETL systems should guarantee consistency of data when it is loaded into the data warehouse. An entire data load should be treated as a single logical transaction either the entire data load is successful or

25 25 the entire load is rolled back. In some systems, the load is a single physical transaction, whereas in others it is a series of transactions. Regardless of the physical implementation, the data load should be treated as a single logical transaction. Flexibility ETL systems should be developed to meet the needs of the data warehouse and to accommodate the source data environments. It may be appropriate to accomplish some transformations in text files and some on the source data system; others may require the development of custom applications. A variety of technologies and techniques can be applied, using the tool most appropriate to the individual task of each ETL functional element. Speed ETL systems should be as fast as possible. Ultimately, the time window available for ETL processing is governed by data warehouse and source system schedules. Some data warehouse elements may have a huge processing window (days), while others may have a very limited processing window (hours). Regardless of the time available, it is important that the ETL system execute as rapidly as possible. Heterogeneity ETL systems should be able to work with a wide variety of data in different formats. An ETL system that only works with a single type of source data is useless. Meta Data Management ETL systems are arguably the single most important source of meta data about both the data in the data warehouse and data in the source system. Finally, the ETL process itself generates useful meta data that should be retained and analyzed regularly. Meta data is discussed in greater detail later in this chapter. ETL Architectures Before discussing the physical implementation of ETL systems, it is important to understand the different ETL architectures and how they relate to each other. Essentially, ETL systems can be classified in two architectures: the homogenous architecture and the heterogeneous architecture. Homogenous Architecture A homogenous architecture for an ETL system is one that involves only a single source system and a single target system. Data flows from the single source of data through the ETL processes and is loaded into the data warehouse, as shown in the following diagram.

26 Most homogenous ETL architectures have the following characteristics: Single data source: Data is extracted from a single source system, such as an OLTP system.

26 26 Most homogenous ETL architectures have the following characteristics: Single data source: Data is extracted from a single source system, such as an OLTP system. Rapid development: The development effort required to extract the data is straightforward because there is only one data format for each record type. Light data transformation: No data transformations are required to achieve consistency among disparate data formats, and the incoming data is often in a format usable in the data warehouse. Transformations in this architecture typically involve replacing NULLs and other formatting transformations. Light structural transformation: Because the data comes from a single source, the amount of structural changes such as table alteration is also very light. The structural changes typically involve denormalization efforts to meet data warehouse schema requirements. Simple research requirements: The research efforts to locate data are generally simple: if the data is in the source system, it can be used. If it is not, it cannot. The homogeneous ETL architecture is generally applicable to data marts, especially those focused on a single subject matter. Heterogeneous Architecture A heterogeneous architecture for an ETL system is one that extracts data from multiple sources, as shown in the following diagram. The complexity of this architecture arises from the fact that data from more than one source must be merged, rather than from the fact that data may be formatted differently in the different sources. However, significantly different storage formats and database schemas do provide additional complications.

27 27 Most heterogeneous ETL architectures have the following characteristics: Multiple data sources. More complex development: The development effort required to extract the data is increased because there are multiple source data formats for each record type. Significant data transformation: Data transformations are required to achieve consistency among disparate data formats, and the incoming data is often not in a format usable in the data warehouse. Transformations in this architecture typically involve replacing NULLs, additional data formatting, data conversions, lookups, computations, and referential integrity verification. Precomputed calculations may require combining data from multiple sources, or data that has multiple degrees of granularity, such as allocating shipping costs to individual line items. Significant structural transformation: Because the data comes from multiple sources, the amount of structural changes, such as table alteration, is significant. Substantial research requirements to identify and match data elements. Heterogeneous ETL architectures are found more often in data warehouses than in data marts. ETL Development ETL development consists of two general phases: identifying and mapping data, and developing functional element implementations. Both phases should be carefully documented and stored in a central, easily accessible location, preferably in electronic form.

28 28 Identify and Map Data This phase of the development process identifies sources of data elements, the targets for those data elements in the data warehouse, and the transformations that must be applied to each data element as it is migrated from its source to its destination. High level data maps should be developed during the requirements gathering and data modeling phases of the data warehouse project. During the ETL system design and development process, these high level data maps are extended to thoroughly specify system details. Identify Source Data For some systems, identifying the source data may be as simple as identifying the server where the data is stored in an OLTP database and the storage type (SQL Server database, Microsoft Excel spreadsheet, or text file, among others). In other systems, identifying the source may mean preparing a detailed definition of the meaning of the data, such as a business rule, a definition of the data itself, such as decoding rules (O = On, for example), or even detailed documentation of a source system for which the system documentation has been lost or is not current. Identify Target Data Each data element is destined for a target in the data warehouse. A target for a data element may be an attribute in a dimension table, a numeric measure in a fact table, or a summarized total in an aggregation table. There may not be a one-to-one correspondence between a source data element and a data element in the data warehouse because the destination system may not contain the data at the same granularity as the source system. For example, a retail client may decide to roll data up to the SKU level by day rather than track individual line item data. The level of item detail that is stored in the fact table of the data warehouse is called the grain of the data. If the grain of the target does not match the grain of the source, the data must be summarized as it moves from the source to the target. Map Source Data to Target Data A data map defines the source fields of the data, the destination fields in the data warehouse and any data modifications that need to be accomplished to transform the data into the desired format for the data warehouse. Some transformations require aggregating the source data to a coarser granularity, such as summarizing individual item sales into daily sales by SKU. Other transformations involve altering the source data itself as it moves from the source to the target. Some transformations decode data into human readable form, such as replacing "1" with "on" and "0" with "off" in a status field. If two source systems encode data destined for the same target differently (for example, a second source system uses Yes and No for status), a separate transformation for each source system must be defined. Transformations must be documented and maintained in the data maps. The relationship between the source and target systems is maintained in a map that is referenced to execute the transformation of the data before it is loaded in the data warehouse. Develop Functional Elements Design and implementation of the four ETL functional elements, Extraction, Transformation, Loading,

29 29 and meta data logging, vary from system to system. There will often be multiple versions of each functional element. Each functional element contains steps that perform individual tasks, which may execute on one of several systems, such as the OLTP or legacy systems that contain the source data, the staging area database, or the data warehouse database. Various tools and techniques may be used to implement the steps in a single functional area, such as Transact-SQL, DTS packages, or custom applications developed in a programming language such as Microsoft Visual Basic. Steps that are discrete in one functional element may be combined in another. Extraction The extraction element may have one version to extract data from one OLTP data source, a different version for a different OLTP data source, and multiple versions for legacy systems and other sources of data. This element may include tasks that execute SELECT queries from the ETL staging database against a source OLTP system, or it may execute some tasks on the source system directly and others in the staging database, as in the case of generating a flat file from a legacy system and then importing it into tables in the ETL database. Regardless of methods or number of steps, the extraction element is responsible for extracting the required data from the source system and making it available for processing by the next element. Transformation Frequently a number of different transformations, implemented with various tools or techniques, are required to prepare data for loading into the data warehouse. Some transformations may be performed as data is extracted, such as an application on a legacy system that collects data from various internal files as it produces a text file of data to be further transformed. However, transformations are best accomplished in the ETL staging database, where data from several data sources may require varying transformations specific to the incoming data organization and format. Data from a single data source usually requires different transformations for different portions of the incoming data. Fact table data transformations may include summarization, and will always require surrogate dimension keys to be added to the fact records. Data destined for dimension tables in the data warehouse may require one process to accomplish one type of update to a changing dimension and a different process for another type of update. Transformations may be implemented using Transact-SQL, as is demonstrated in the code examples later in this chapter, DTS packages, or custom applications. Regardless of the number and variety of transformations and their implementations, the transformation element is responsible for preparing data for loading into the data warehouse. Loading The loading element typically has the least variety of task implementations. After the data from the various data sources has been extracted, transformed, and combined, the loading operation consists of inserting records into the various data warehouse database dimension and fact tables. Implementation

30 may vary in the loading tasks, such as using BULK INSERT, bcp, or the Bulk Copy API. The loading element is responsible for loading data into the data warehouse database tables.

30 30 may vary in the loading tasks, such as using BULK INSERT, bcp, or the Bulk Copy API. The loading element is responsible for loading data into the data warehouse database tables. Meta Data Logging Meta data is collected from a number of the ETL operations. The meta data logging implementation for a particular ETL task will depend on how the task is implemented. For a task implemented by using a custom application, the application code may produce the meta data. For tasks implemented by using Transact-SQL, meta data can be captured with Transact-SQL statements in the task processes. The meta data logging element is responsible for capturing and recording meta data that documents the operation of the ETL functional areas and tasks, which includes identification of data that moves through the ETL system as well as the efficiency of ETL tasks. Common Tasks Each ETL functional element should contain tasks that perform the following functions, in addition to tasks specific to the functional area itself: Confirm Success or Failure. A confirmation should be generated on the success or failure of the execution of the ETL processes. Ideally, this mechanism should exist for each task so that rollback mechanisms can be implemented to allow for incremental responses to errors. Scheduling. ETL tasks should include the ability to be scheduled for execution. Scheduling mechanisms reduce repetitive manual operations and allow for maximum use of system resources during recurring periods of low activity. Data Mining Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions). Stage 1: Exploration. This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of the analytic problem, this first stage of the process Datawarehousing & Mini ng -

31 31 of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods (see Exploratory Data Analysis (EDA)) in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage. Stage 2: Model building and validation. This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques - which are often considered the core of predictive data mining - include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning. Stage 3: Deployment. That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome. The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business Data Mining (e.g., Classification Trees), but Data Mining is still based on the conceptual principles of statistics including the traditional Exploratory Data Analysis (EDA) and modeling and it shares with them both some components of its general approaches and specific techniques. However, an important general difference in the focus and purpose between Data Mining and the traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented towards applications than the basic nature of the underlying phenomena. In other words, Data Mining is relatively less concerned with identifying the specific relations between the involved variables. For example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of Data Mining. Instead, the focus is on producing a solution that can generate useful predictions. Therefore, Data Mining accepts among others a "black box" approach to data exploration or knowledge discovery and uses not only the traditional Exploratory Data Analysis (EDA) techniques, but also such techniques as Neural Networks which can generate valid predictions but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based. Data Mining is often considered to be "a blend of statistics, AI [artificial intelligence], and data base research" (Pregibon, 1997, p. 8), which until very recently was not commonly recognized as a field of interest for statisticians, and was even considered by some "a dirty word in Statistics" (Pregibon, 1997, p. 8). Due to its applied importance, however, the field emerges as a rapidly growing and major area (also in statistics) where important theoretical advances are being made (see, for example, the recent annual International Conferences on Knowledge Discovery and Data Mining, co-hosted by the American Statistical Association). For information on Data Mining techniques, please review the summary topics included below in this chapter of the Electronic Statistics Textbook. There are numerous books that review the theory and practice of data mining; the following books offer a representative sample of recent general books on

32 32 data mining, representing a variety of approaches and perspectives: Berry, M., J., A., & Linoff, G., S., (2000). Mastering data mining. New York: Wiley. Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery (3rd ed). Potomac, MD: Two Crows Corp. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery & data mining. Cambridge, MA: MIT Press. Han, J., Kamber, M. (2000). Data mining: Concepts and Techniques. New York: Morgan-Kaufman. Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning : Data mining, inference, and prediction. New York: Springer. Pregibon, D. (1997). Data Mining. Statistical Computing and Graphics, 7, 8. Weiss, S. M., & Indurkhya, N. (1997). Predictive data mining: A practical guide. New York: Morgan- Kaufman. Westphal, C., Blaxton, T. (1998). Data mining solutions. New York: Wiley. Witten, I. H., & Frank, E. (2000). Data mining. New York: Morgan-Kaufmann. Crucial Concepts in Data Mining Bagging (Voting, Averaging) The concept of bagging (voting for classification, averaging for regression-type problems with continuous dependent variables of interest) applies to the area of predictive data mining, to combine the predicted classifications (prediction) from multiple models, or from the same type of model for different learning data. It is also used to address the inherent instability of results when applying complex models to relatively small data sets. Suppose your data mining task is to build a model for predictive classification, and the dataset from which to train the model (learning data set, which contains observed classifications) is relatively small. You could repeatedly sub-sample (with replacement) from the dataset, and apply, for example, a tree classifier (e.g., C&RT and CHAID) to the successive samples. In practice, very different trees will often be grown for the different samples, illustrating the instability of models often evident with small datasets. One method of deriving a single prediction (for new observations) is to use all trees found in the different samples, and to apply some simple voting: The final classification is the one most often predicted by the different trees. Note that some weighted combination of predictions (weighted vote, weighted average) is also possible, and commonly used. A sophisticated (machine learning) algorithm for generating weights for weighted prediction or voting is the Boosting procedure. Boosting The concept of boosting applies to the area of predictive data mining, to generate multiple models or classifiers (for prediction or classification), and to derive weights to combine the predictions from those models into a single prediction or predicted classification (see also Bagging). A simple algorithm for boosting works like this: Start by applying some method (e.g., a tree classifier such as C&RT or CHAID) to the learning data, where each observation is assigned an equal weight.

33 33 Compute the predicted classifications, and apply weights to the observations in the learning sample that are inversely proportional to the accuracy of the classification. In other words, assign greater weight to those observations that were difficult to classify (where the misclassification rate was high), and lower weights to those that were easy to classify (where the misclassification rate was low). In the context of C&RT for example, different misclassification costs (for the different classes) can be applied, inversely proportional to the accuracy of prediction in each class. Then apply the classifier again to the weighted data (or with different misclassification costs), and continue with the next iteration (application of the analysis method for classification to the re-weighted data). Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an "expert" in classifying observations that were not well classified by those preceding it. During deployment (for prediction or classification of new cases), the predictions from the different classifiers can then be combined (e.g., via voting, or some weighted voting procedure) to derive a single best prediction or classification. Note that boosting can also be applied to learning methods that do not explicitly support weights or misclassification costs. In that case, random sub-sampling can be applied to the learning data in the successive steps of the iterative boosting procedure, where the probability for selection of an observation into the subsample is inversely proportional to the accuracy of the prediction for that observation in the previous iteration (in the sequence of iterations of the boosting procedure). CRISP See Models for Data Mining. Data Preparation (in Data Mining) Data preparation and cleaning is an often neglected but extremely important step in the data mining process. The old saying "garbage-in-garbage-out" is particularly applicable to the typical data mining projects where large data sets collected via some automatic methods (e.g., via the Web) serve as the input into the analyses. Often, the method by which the data where gathered was not tightly controlled, and so the data may contain out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), and the like. Analyzing data that has not been carefully screened for such problems can produce highly misleading results, in particular in predictive data mining. Data Reduction (for Data Mining) The term Data Reduction in the context of data mining is usually applied to projects where the goal is to aggregate or amalgamate the information contained in large datasets into manageable (smaller) information nuggets. Data reduction methods can include simple tabulation, aggregation (computing descriptive statistics) or more sophisticated techniques like clustering, principal components analysis, etc. See also predictive data mining, drill-down analysis. Deployment The concept of deployment in predictive data mining refers to the application of a model for prediction or classification to new data. After a satisfactory model or set of models has been identified (trained) for a particular application, one usually wants to deploy those models so that predictions or predicted classifications can quickly be obtained for new data. For example, a credit card company may want to deploy a trained model or set of models (e.g., neural networks, meta-learner) to quickly identify transactions which have a high probability of being fraudulent.

34 34 Drill-Down Analysis The concept of drill-down analysis applies to the area of data mining, to denote the interactive exploration of data, in particular of large databases. The process of drill-down analyses begins by considering some simple break-downs of the data by a few variables of interest (e.g., Gender, geographic region, etc.). Various statistics, tables, histograms, and other graphical summaries can be computed for each group. Next one may want to "drill-down" to expose and further analyze the data "underneath" one of the categorizations, for example, one might want to further review the data for males from the mid-west. Again, various statistical and graphical summaries can be computed for those cases only, which might suggest further break-downs by other variables (e.g., income, age, etc.). At the lowest ("bottom") level are the raw data: For example, you may want to review the addresses of male customers from one region, for a certain income group, etc., and to offer to those customers some particular services of particular utility to that group. Feature Selection One of the preliminary stage in predictive data mining, when the data set includes more variables than could be included (or would be efficient to include) in the actual model building phase (or even in initial exploratory operations), is to select predictors from a large list of candidates. For example, when data are collected via automated (computerized) methods, it is not uncommon that measurements are recorded for thousands or hundreds of thousands (or more) of predictors. The standard analytic methods for predictive data mining, such as neural network analyses, classification and regression trees, generalized linear models, or general linear models become impractical when the number of predictors exceed more than a few hundred variables. Feature selection selects a subset of predictors from a large list of candidate predictors without assuming that the relationships between the predictors and the dependent or outcome variables of interest are linear, or even monotone. Therefore, this is used as a pre-processor for predictive data mining, to select manageable sets of predictors that are likely related to the dependent (outcome) variables of interest, for further analyses with any of the other methods for regression and classification. Machine Learning Machine learning, computational learning theory, and similar terms are often used in the context of Data Mining, to denote the application of generic model-fitting or classification algorithms for predictive data mining. Unlike traditional statistical data analysis, which is usually concerned with the estimation of population parameters by statistical inference, the emphasis in data mining (and machine learning) is usually on the accuracy of prediction (predicted classification), regardless of whether or not the "models" or techniques that are used to generate the prediction is interpretable or open to simple explanation. Good examples of this type of technique often applied to predictive data mining are neural networks or meta-learning techniques such as boosting, etc. These methods usually involve the fitting of very complex "generic" models, that are not related to any reasoning or theoretical understanding of underlying causal processes; instead, these techniques can be shown to generate accurate predictions or classification in cross validation samples. Meta-Learning The concept of meta-learning applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. In this context, this procedure is also referred to as Stacking (Stacked Generalization).

35 35 Suppose your data mining project includes tree classifiers, such as C&RT and CHAID, linear discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications for a cross validation sample, from which overall goodness-of-fit statistics (e.g., misclassification rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method (e.g., see Witten and Frank, 2000). The predictions from different classifiers can be used as input into a meta-learner, which will attempt to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy. One can apply meta-learners to the results from different meta-learners to create "meta-meta"- learners, and so on; however, in practice such exponential increase in the amount of data processing, in order to derive an accurate prediction, will yield less and less marginal utility. Models for Data Mining In the business environment, complex data mining projects may require the coordinate efforts of various experts, stakeholders, or departments throughout an entire organization. In the data mining literature, various "general frameworks" have been proposed to serve as blueprints for how to organize the process of gathering data, analyzing data, disseminating results, implementing results, and monitoring improvements. One such model, CRISP (Cross-Industry Standard Process for data mining) was proposed in the mid- 1990s by a European consortium of companies to serve as a non-proprietary standard process model for data mining. This general approach postulates the following (perhaps not particularly controversial) general sequence of steps for data mining projects: Another approach - the Six Sigma methodology - is a well-structured, data-driven methodology for eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other business activities. This model has recently become very popular (due to its successful implementations) in various American industries, and it appears to gain favor worldwide. It postulated a sequence of, so-called, DMAIC steps - - that grew up from the manufacturing, quality improvement, and process control traditions and is

36 36 particularly well suited to production environments (including "production of services," i.e., service industries). Another framework of this kind (actually somewhat similar to Six Sigma) is the approach proposed by SAS Institute called SEMMA - - which is focusing more on the technical activities typically involved in a data mining project. All of these models are concerned with the process of how to integrate data mining methodology into an organization, how to "convert data into information," how to involve important stake-holders, and how to disseminate the information in a form that can easily be converted by stake-holders into resources for strategic decision making. Some software tools for data mining are specifically designed and documented to fit into one of these specific frameworks. The general underlying philosophy of StatSoft's STATISTICA Data Miner is to provide a flexible data mining workbench that can be integrated into any organization, industry, or organizational culture, regardless of the general data mining process-model that the organization chooses to adopt. For example, STATISTICA Data Miner can include the complete set of (specific) necessary tools for ongoing company wide Six Sigma quality control efforts, and users can take advantage of its (still optional) DMAIC-centric user interface for industrial data mining tools. It can equally well be integrated into ongoing marketing research, CRM (Customer Relationship Management) projects, etc. that follow either the CRISP or SEMMA approach - it fits both of them perfectly well without favoring either one. Also, STATISTICA Data Miner offers all the advantages of a general data mining oriented "development kit" that includes easy to use tools for incorporating into your projects not only such components as custom database gateway solutions, prompted interactive queries, or proprietary algorithms, but also systems of access privileges, workgroup management, and other collaborative work tools that allow you to design large scale, enterprise-wide systems (e.g., following the CRISP, SEMMA, or a combination of both models) that involve your entire organization. Predictive Data Mining The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models (e.g., neural networks, meta-learner) that can quickly identify transactions which have a high probability of being fraudulent. Other types of data mining projects may be more exploratory in nature (e.g., to identify cluster or segments of customers), in which case drill-down descriptive and exploratory methods would be applied. Data reduction is another possible objective for data mining (e.g., to aggregate or amalgamate the information in very large data sets into useful and manageable chunks). SEMMA See Models for Data Mining. Stacked Generalization See Stacking.

37 37 Stacking (Stacked Generalization) The concept of stacking (short for Stacked Generalization) applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. Suppose your data mining project includes tree classifiers, such as C&RT or CHAID, linear discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications for a cross validation sample, from which overall goodness-of-fit statistics (e.g., misclassification rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method (e.g., see Witten and Frank, 2000). In stacking, the predictions from different classifiers are used as input into a metalearner, which attempts to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy. Other methods for combining the prediction from multiple models or methods (e.g., from multiple datasets used for learning) are Boosting and Bagging (Voting). Text Mining While Data Mining is typically concerned with the detection of patterns in numeric data, very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data, text is often amorphous, and difficult to deal with. Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques (e.g., to determine co-occurrences of concepts, key phrases, names, addresses, product names, etc.). Data Transformation Services (DTS) in SQL Server 2000 Most organizations have multiple formats and locations in which data is stored. To support decisionmaking, improve system performance, or upgrade existing systems, data often must be moved from one data storage location to another. Microsoft SQL Server 2000 Data Transformation Services (DTS) provides a set of tools that lets you extract, transform, and consolidate data from disparate sources into single or multiple destinations. By using DTS tools, you can create custom data movement solutions tailored to the specialized needs of your organization, as shown in the following scenarios: You have deployed a database application on an older version of SQL Server or another platform, such as Microsoft Access. A new version of your application requires SQL Server 2000, and requires you to change your database schema and convert some data types.

38 38 To copy and transform your data, you can build a DTS solution that copies database objects from the original data source into a SQL Server 2000 database, while at the same time remapping columns and changing data types. You can run this solution using DTS tools, or you can embed the solution within your application. You must consolidate several key Microsoft Excel spreadsheets into a SQL Server database. Several departments create the spreadsheets at the end of the month, but there is no set schedule for completion of all the spreadsheets. To consolidate the spreadsheet data, you can build a DTS solution that runs when a message is sent to a message queue. The message triggers DTS to extract data from the spreadsheet, perform any defined transformations, and load the data into a SQL Server database. Your data warehouse contains historical data about your business operations, and you use Microsoft SQL Server 2000 Analysis Services to summarize the data. Your data warehouse needs to be updated nightly from your Online Transaction Processing (OLTP) database. Your OLTP system is in-use 24-hours a day, and performance is critical. You can build a DTS solution that uses the file transfer protocol (FTP) to move data files onto a local drive, loads the data into a fact table, and aggregates the data using Analysis Services. You can schedule the DTS solution to run every night, and you can use the new DTS logging options to track how long this process takes, allowing you to analyze performance over time. What Is DTS? DTS is a set of tools you can use to import, export, and transform heterogeneous data between one or more data sources, such as Microsoft SQL Server, Microsoft Excel, or Microsoft Access. Connectivity is provided through OLE DB, an open-standard for data access. ODBC (Open Database Connectivity) data sources are supported through the OLE DB Provider for ODBC. You create a DTS solution as one or more packages. Each package may contain an organized set of tasks that define work to be performed, transformations on data and objects, workflow constraints that define task execution, and connections to data sources and destinations. DTS packages also provide services, such as logging package execution details, controlling transactions, and handling global variables. These tools are available for creating and executing DTS packages: The Import/Export Wizard is for building relatively simple DTS packages, and supports data migration and simple transformations. The DTS Designer graphically implements the DTS object model, allowing you to create DTS packages with a wide range of functionality. DTSRun is a command-prompt utility used to execute existing DTS packages.

39 39 DTSRunUI is a graphical interface to DTSRun, which also allows the passing of global variables and the generation of command lines. SQLAgent is not a DTS application; however, it is used by DTS to schedule package execution. Using the DTS object model, you also can create and run packages programmatically, build custom tasks, and build custom transformations. What's New in DTS? Microsoft SQL Server 2000 introduces several DTS enhancements and new features: New DTS tasks include the FTP task, the Execute Package task, the Dynamic Properties task, and the Message Queue task. Enhanced logging saves information for each package execution, allowing you to maintain a complete execution history and view information for each process within a task. You can generate exception files, which contain rows of data that could not be processed due to errors. You can save DTS packages as Microsoft Visual Basic files. A new multiphase data pump allows advanced users to customize the operation of data transformations at various stages. Also, you can use global variables as input parameters for queries. You can use parameterized source queries in DTS transformation tasks and the Execute SQL task. You can use the Execute Package task to dynamically assign the values of global variables from a parent package to a child package. Using DTS Designer DTS Designer graphically implements the DTS object model, allowing you to graphically create DTS packages. You can use DTS Designer to: Create a simple package containing one or more steps. Create a package that includes complex workflows that include multiple steps using conditional logic, event-driven code, or multiple connections to data sources. Edit an existing package. The DTS Designer interface consists of a work area for building packages, toolbars containing package elements that you can drag onto the design sheet, and menus containing workflows and package

you can easily build powerful DTS packages using DTS Designer.

40 40 management commands. Figure 1: DTS Designer interface By dragging connections and tasks onto the design sheet, and specifying the order of execution with workflows, you can easily build powerful DTS packages using DTS Designer. The following sections define tasks, workflows, connections, and transformations, and illustrate the ease of using DTS Designer to implement a DTS solution. Tasks: Defining Steps in a Package A DTS package usually includes one or more tasks. Each task defines a work item that may be performed during package execution. You can use tasks to: Transform data

41 41 Copy and manage data Run tasks as jobs from within a package

42 42 1 New in SQL Server Available only when SQL Server 2000 Analysis Services is installed. You also can create custom tasks programmatically, and then integrate them into DTS Designer using the Register Custom Task command. To illustrate the use of tasks, here is a simple DTS Package with two tasks: a Microsoft ActiveX Script task and a Send Mail task: Figure 2: DTS Package with two tasks The ActiveX Script task can host any ActiveX Scripting engine including Microsoft Visual Basic Scripting Edition (VBScript), Microsoft JScript, or ActiveState ActivePerl, which you can download from The Send Mail task may send a message indicating that the package has run. Note that there is no order to these tasks yet. When the package executes, the ActiveX Script task and the Send Mail task run concurrently. Workflows: Setting Task Precedence When you define a group of tasks, there is usually an order in which the tasks should be performed. When tasks have an order, each task becomes a step of a process. In DTS Designer, you manipulate tasks on the DTS Designer design sheet and use precedence constraints to control the sequence in which the tasks execute. Precedence constraints sequentially link tasks in a package. The following table shows the types of precedence constraints you can use in DTS. Precedence constraint Description On Completion (blue arrow) On Success (green arrow) If you want Task 2 to wait until Task 1 completes, regardless of the outcome, link Task 1 to Task 2 with an On Completion precedence constraint. If you want Task 2 to wait until Task 1 has successfully completed, link Task 1 to Task 2 with an On Success precedence constraint. On Failure (red arrow) If you want Task 2 to begin execution only if Task 1 fails to execute successfully, link Task 1 to Task 2 with an On Failure precedence constraint. The following illustration shows the ActiveX Script task and the Send Mail task with an On Completion

43 43 precedence constraint. When the Active X Script task completes, with either success or failure, the Send Mail task runs. Figure 3: ActiveX Script task and the Send Mail task with an On Completion precedence constraint You can configure separate Send Mail tasks, one for an On Success constraint and one for an On Failure constraint. The two Send Mail tasks can send different messages based on the success or failure of the ActiveX script. Figure 4: Mail tasks You also can issue multiple precedence constraints on a task. For example, the Send Mail task "Admin Notification" could have both an On Success constraint from Script #1 and an On Failure constraint from Script #2. In these situations, DTS assumes a logical "AND" relationship. Therefore, Script #1 must successfully execute and Script #2 must fail for the Admin Notification message to be sent. Figure 5: Example of multiple precedence constraints on a task Connections: Accessing and Moving Data To successfully execute DTS tasks that copy and transform data, a DTS package must establish valid connections to its source and destination data and to any additional data sources, such as lookup tables.

44 44 When creating a package, you configure connections by selecting a connection type from a list of available OLE DB providers and ODBC drivers. The types of connections that are available are: Microsoft Data Access Components (MDAC) drivers Microsoft Jet drivers Other drivers DTS allows you to use any OLE DB connection. The icons on the Connections toolbar provide easy access to common connections. The following illustration shows a package with two connections. Data is being copied from an Access database (the source connection) into a SQL Server production database (the destination connection). Figure 6: Example of a package with two connections The first step in this package is an Execute SQL task, which checks to see if the destination table already exists. If so, the table is dropped and re-created. On the success of the Execute SQL task, data is copied to the SQL Server database in Step 2. If the copy operation fails, an is sent in Step 3. The Data Pump: Transforming Data The DTS data pump is a DTS object that drives the import, export, and transformation of data. The data pump is used during the execution of the Transform Data, Data Driven Query, and Parallel Data Pump tasks. These tasks work by creating rowsets on the source and destination connections, then creating an instance of the data pump to move rows between the source and destination.

45 Transformations occur on each row as the row is copied. In the following illustration, a Transform Data task is used between the Access DB task and the SQL Production DB task in Step 2.

45 45 Transformations occur on each row as the row is copied. In the following illustration, a Transform Data task is used between the Access DB task and the SQL Production DB task in Step 2. The Transform Data task is the gray arrow between the connections. Figure 7: Example of a Transform Data task To define the data gathered from the source connection, you can build a query for the transformation tasks. DTS supports parameterized queries, which allow you to define query values when the query is executed. You can type a query into the task's Properties dialog box, or use the Data Transformation Services Query Designer, a tool for graphically building queries for DTS tasks. In the following illustration, the Query Designer is used to build a query that joins three tables in the pubs database. Figure 8: Data Transformation Services Query Designer interface In the transformation tasks, you also define any changes to be made to data. The following table describes the built-in transformations that DTS provides. Transformation Description Copy Column Use to copy data directly from source to destination columns, without any transformations applied to the data.

46 46 ActiveX Script DateTime String Use to build custom transformations. Note that since the transformation occurs on a row-by-row basis, an ActiveX script can affect the execution speed of a DTS package. Use to convert a date or time in a source column to a different format in the destination column. Lowercase String Uppercase String Middle of String Trim String Read File Write File Use to convert a source column to lowercase characters and, if necessary, to the destination data type. Use to convert a source column to all uppercase characters and, if necessary, to the destination data type. Use to extract a substring from the source column, transform it, and copy the result to the destination column. Use to remove leading, trailing, and embedded white space from a string in the source column and copy the result to the destination column. Use to open the contents of a file, whose name is specified in a source column, and copy the contents into a destination column. Use to copy the contents of a source column (data column) to a file whose path is specified by a second source column (file name column). You can also create your own custom transformations programmatically. The quickest way to build custom transformations is to use the Active Template Library (ATL) custom transformation template, which is included in the SQL Server 2000 DTS sample programs. Data Pump Error Logging A new method of logging transformation errors is available in SQL Server You can define three exception log files for use during package execution: an error text file, a source error rows file, and a destination error rows file. General error information is written to the error text file. If a transformation fails, then the source row is in error, and that row is written to the source error rows file.

47 If an insert fails, then the destination row is in error, and that row is written to the destination error rows file. The exception log files are defined in the tasks that transform data.

47 47 If an insert fails, then the destination row is in error, and that row is written to the destination error rows file. The exception log files are defined in the tasks that transform data. Each transformation task has its own log files. Data pump phases By default, the data pump has one phase: row transformation. That phase is what you configure when mapping column-level transformations in the Transform Data task, Data Driven Query task, and Parallel Data Pump task, without selecting a phase. Multiple data pump phases are new in SQL Server By selecting the multiphase data pump option in SQL Server Enterprise Manager, you can access the data pump at several points during its operation and add functionality. When copying a row of data from source to a destination, the data pump follows the basic process shown in the following illustration. Figure 9:. Data pump process After the data pump processes the last row of data, the task is finished and the data pump operation terminates. Advanced users who want to add functionality to a package so that it supports any data pump phase can do so by: Writing an ActiveX script phase function for each data pump phase to be customized. If you use ActiveX script functions to customize data pump phases, no additional code outside of the package is required.

Data warehouse architecture consists of the following interconnected layers:

Data warehouse architecture consists of the following interconnected layers: Architecture, in the Data warehousing world, is the concept and design of the data base and technologies that are used to load the data. A good architecture will enable scalability, high performance and