Data Warehouse and Data Mining Lecture No. 03 Architecture of DW Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
Basic Architecture Architecture of DW
Data Warehouse Architecture
Data Warehouse Architecture
Operational Source systems These are the operational systems of record that capture the transactions of the business. These systems are outside the data warehouse which do not have control over contents and format of the data The source systems maintain little historical data These systems generate operation data that is detailed, current and subject to change
Data Staging Area Data staging area can be divided into three phases Extraction (E) Transformation (T) Loading (L) Extraction: It means reading and understanding the source data and copying the data needed for the data warehouse into staging area for further manipulation (i.e. transformation)
Data Staging Area Loading: Loading refers to populating of data warehouse with data that has been extracted from operational systems. There are two types of loads, which generally take place in data warehouse environment: Initial load Incremental load
Data Staging Area Transformation: The transformation phase applies a series of rules or functions to the extracted/ loaded data. This may include some or all of the following: Select only certain columns to load (or if you prefer, null columns not to load) Translate coded values Derive a new calculated value (e.g. sale_amount = qty * unit_price) Denormalization in order to fit the Dawarehouse Schema Summarize multiple rows of data (e.g. total sales for each region)
Data Staging Area The Data Staging Area Is both a storage and process area (the ETL process) It represents everything that happens between the operational source system and the data presentation area The key architectural requirement for data staging area is that it is off-limits to business users and does not provide query and presentation services should be accessible only to skilled professionals
ETL versus ELT ETL (The traditional approach): ETL (Extract, transform, and load) is a process in data warehousing that involves: Extracting data from outside sources transforming it to fit business needs, and ultimately loading it into the data warehouse ELT (The Teradata Approach): ELT (Extract, Load and Transform) strategy extracts and loads the data into a Teradata Database first, then uses the power and performance of the Teradata Warehouse to perform the transformation
Data Presentation Area Extended Relational DBMS (ROLAP servers) data stored in RDB star-join schemas support SQL extensions (Cube) Index structures (bitmap, join) Multidimensional DBMS (MOLAP servers) data stored in arrays (n-dimensional array) direct access to array data structure poor storage utilization, especially when the data is sparse
Data Presentation Area The Data Presentation Area Is where data is organized, stored and made available for queries, report writers, and other analytical processing This area is the Warehouse as far as the business community is concerned
Data Access Tools Analysis / OLAP / DSS Tools Querying / Reporting Tools Data Mining
Warehouse components
Component: Operational Data The sources of data for the data warehouse is supplied from: The data from the mainframe systems in the traditional network and hierarchical format Data can also come from the relational DBMS like Oracle, Informix In addition to these internal data, operational data also includes external data obtained from commercial databases and databases associated with supplier and customers
Component: Load Manager The load manager (also called the front end component) performs all the operations associated with extraction and loading data into the data warehouse These operations include simple transformations of the data to prepare the data for entry into the warehouse The size and complexity of this component will vary between data warehouses and may be constructed using a combination of vendor data loading tools and custom built programs
Component: Warehouse Manager The warehouse manager performs all the operations associated with the management of data in the warehouse This component is built using vendor data management tools and custom built programs The operations performed by warehouse manager include: Analysis of data to ensure consistency Transformation and merging the source data from temporary storage into data warehouse tables Create indexes and views on the base table. Generation of de-normalization Generation of aggregation Backing up and archiving of data
Warehouse Manager: Detailed Data This area of the warehouse stores all the detailed data in the database schema In most cases detailed data is not stored online but aggregated to the next level of details However the detailed data is added regularly to the warehouse to supplement the aggregated data
Warehouse Manager: Lightly and Highly summarized data The area of the data warehouse stores all the predefined lightly and highly summarized (aggregated) data generated by the warehouse manager This area of the warehouse is transient as it will be subject to change on an ongoing basis in order to respond to the changing query profiles The purpose of the summarized information is to speed up the query performance The summarized data is updated continuously as new data is loaded into the warehouse
Warehouse Manager: Archive and Back-up Data This area of the warehouse stores detailed and summarized data for the purpose of archiving and back-up The data is transferred to storage archives such as magnetic tapes or optical disks
Warehouse Manager: Meta Data The data warehouse also stores all the Meta data (data about data) definitions used by all processes in the warehouse It is used for variety of purposed including: The extraction and loading process Meta data is used to map data sources to a common view of information within the warehouse. The warehouse management process Meta data is used to automate the production of summary tables. As part of Query Management process Meta data is used to direct a query to the most appropriate data source. The structure of Meta data will differ in each process, because the purpose is different
Component: Query Manager The query manager (also called the back end component) performs all operations associated with management of user queries This component is usually constructed using vendor end-user access tools, data warehousing monitoring tools, database facilities and custom built programs The complexity of a query manager is determined by facilities provided by the end-user access tools and database
Component: End-user Access Tools The principal purpose of data warehouse is to provide information to the business managers for strategic decision-making These users interact with the warehouse using end user access tools The examples of some of the end user access tools can be: Reporting and Query Tools Application Development Tools Executive Information Systems Tools Online Analytical Processing Tools Data Mining Tools