Data Mining Concepts & Techniques Lecture No. 01 Databases, Data warehouse Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
Database A database is a large, integrated collection of data A database management system (DBMS) is a software system designed to store, manage and facilitate access to the database A schema is a description of a database
What is Data warehouse? Basically a very large database Not all very large databases are data warehouses, but all data warehouses are pretty large databases Nowadays a warehouse is considered to start at around 800 GB and goes up to several TB It spans over several servers and needs an impressive amount of computing power
What is Data warehouse? More specific, a collective data repository Containing snapshots of the operational data (history) Obtained through data cleansing ETL (Extract-Transform- Load) Useful for analytics
What is Data warehouse? Compared to other solutions it Is suitable for tactical/strategic focus Implies a small number of transactions Implies large transactions spanning over a long period of time
Some Definitions Ralph Kimball: a copy of transaction data specifically structured for query and analysis Bill Inmon (father of data warehousing, in 1993): A Data Warehouse is a: subject oriented integrated non-volatile time-variant collection of data in support of management s decisions
Data Warehouse Subject oriented: Data is arranged by subject area rather than by application. Data is organized so that all the data elements relating to the same real-world event or object are linked together Typical subject areas in DWs are Customer, Product, Order, Claim, Account,
Data Warehouse Subject oriented: Example: customer as subject in a DW DW is organized in this case by the customer It may consist of 10, 100 or more physical tables, all related
Data Warehouse Integrated: Data is collected and consistently stored from multiple, diverse sources of an organization's operational systems and this data is made consistent E.g. gender, measurement, conflicting keys, consistency,
Data Warehouse Non-volatile: Data in the data warehouse is never over-written or deleted - once committed, the data is static, read-only, and retained for future reporting. Data is loaded, but not updated When subsequent changes occur, a new snapshot record is written.
Data Warehouse Time-variant: The changes to the data in the data warehouse are tracked and recorded so that reports can be produced showing changes over time. Different environments have different time horizons associated While for operational systems a 60-to-90 day time horizon is normal, data warehouse has a 5-to-10 year horizon
Data Warehouse vs. Operational Database Data Warehouse Subject oriented Operational Database Application oriented Integrated Multiple diverse sources Non-volatile Updateable Time-variant Real-time, current
OnLine Transaction Processing OLTP (OnLine Transaction Processing): Also known under the name of operational data, it represents day-to-day operational business activities: Purchasing, sales, production distribution, Typically for data entry and retrieval transaction processing Reflects only the current state of the data
OnLine Analytical Processing OLAP (OnLine Analytical Processing): Represents front-end analytics based on a DW repository It provides information for activities like: Resource planning, capital budgeting, marketing initiatives,... It is decision oriented
OLTP vs. DW Properties Operational DB Mostly updates Many small transactions MB-TB of data Raw data Clerical users Up-to-date data DW Mostly reads Queries long, complex GB-PB of data Summarized data Decision makers May be slightly outdated
OLTP vs. DW OLTP Data Warehouse users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated usage repetitive ad-hoc access read/write lots of scans index/hash on prim. key unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB historical, summarized, multidimensional integrated, consolidated metric transaction throughput query throughput, response
Applications of DW A DW is the base repository for front-end analytics OLAP KDD Data visualization Reporting KDD (Knowledge Discovery in Databases) a data mining process
Lifecycle of DW Classical SDLC vs. DW SDLC DW SDLC is almost the opposite of classical SDLC
Lifecycle of DW Classical SDLC vs. DW SDLC Because it is the opposite of SDLC, DW SDLC is also called CLDS
Basic Architecture Architecture of DW
Data Warehouse Architecture
Data Mart A data mart is a special purpose subset of enterprise data for a particular function or application (It may contain detail or summary data or both). Data Mart types: Independent created directly from operational systems to a separate physical data store Logical exists as a subset of existing data warehouse. Dependent created from data warehouse to a separate physical data store
Phases
Data Modeling Conceptual Design Transforms data requirements to conceptual model Conceptual model describes data entities, relationships, constraints, etc. on high-level Does not contain any implementation details Independent of used software and hardware Logical Design Maps the conceptual data model to the logical data model used by the DBMS e.g. relational model, dimensional model,... Technology independent conceptual model is adapted to the used DBMS software Physical Design Creates internal structures needed to efficiently store/manage data Table spaces, indexes, access paths,... Depends on used hardware and DBMS software
Data Modeling Conceptual Modeling: DW Modeling Multidimensional Entity Relationship (ME/R) Model Multidimensional UML (muml) Logical Modeling: Cubes, Dimensions, Hierarchies Physical Modeling: Star, Snowflake, Array storage
DW Modeling Components Facts: a fact is a focus of interest for decision-making, e.g., sales, shipments.. Measures: attributes that describe facts from different points of view, e.g., each sale is measured by its revenue Dimensions: discrete attributes which determine the granularity adopted to represent facts, e.g., product, store, date Hierarchies: are made up of dimension attributes Determine how facts may be aggregated and selected, e.g., day month quarter - year
OLAP A decision support system (DSS) that support adhoc querying, i.e. enables managers and analysts to interactively manipulate data. Analysis of information in a database for the purpose of making management decision The idea is to allow the users to easy and quickly manipulate and visualize the data through multidimensional views (i.e. different perspectives) OLAP (OnLine Analytical Processing) analyzes historical data (terabytes) using complex queries
OLAP Council definition: OLAP A category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless of database size and complexity.
OLAP OLAP primarily involves aggregating large amounts of diverse data OLAP functionality provides dynamic multidimensional analysis, supporting analytical and navigational activities OLAP functionality is provided by the OLAP Server OLAP Council defines OLAP Server as: A high capacity, multi-user data manipulation engine specifically designed to support and operate on multidimensional data structures.
OLTP vs. OLAP OLTP Operational processing Transaction-oriented For operational staffs Daily operations Current, up-to-date data Primitive, highly detailed data Detailed, flat relational views Short, simple transactions Read/write Index on keys Many users Large databases OLAP Informational processing Analysis-oriented For managers, executive & analysts Decision support Historical data Summarized, consolidated data Summarized, multi-dimensional views Complex aggregate queries Mostly read only Many scans Small number of users Very large databases
OLTP vs. OLAP On-Line Transaction Processing Transfer $100 balance from my saving account to my checking account On-Line Analytical Processing What is the average balance of accounts by customer groups, account types, areas, account managers, and their combinations?
DW Queries DW queries are big queries Imply a large portion of the data Read only queries no Updates Redundancy a necessity Materialized Views, special-purpose indexes, de-normalized schemas Data is refreshed periodically E.g., Daily or weekly Their purpose is to analyze data OLAP (OnLine Analytical Processing)
Typical OLAP operations Roll-up Drill-down Slice and dice Pivot (rotate) Other operations Aggregate functions Ranking and comparing Drill-across Drill-through OLAP operations
Roll-up Roll-up (drill-up) Taking the current aggregation level of fact values and doing a further aggregation Summarize data by Climbing up hierarchy (hierarchical roll-up) By dimensional reduction Or by a mix of these 2 techniques Used for obtaining an increased generalization E.g., from Time.Week to Time.Year
Roll-up Hierarchical roll-ups Performed on the fact table and some dimension tables by climbing up the attribute hierarchies E.g., climbed the Time hierarchy to Quarter and Article hierarchy to Prod. group
Roll-up Dimensional roll-ups Are done solely on the fact table by dropping one or more dimensions E.g., drop the Client dimension
Roll-up Climbing above the top in hierarchical roll-up In an ultimate case, hierarchical roll-up above the top level of an attribute hierarchy (attribute ALL ) can be viewed as converting to a dimensional roll-up
Drill-down (roll-down) Reverse of Roll-up Drill-down Represents a de-aggregate operation From higher level of summary to lower level of summary detailed data Introducing new dimensions Requires the existence of materialized finer grained data One can t drill if it doesn t have the data
Roll-up drill-down example
Roll-up drill-down example
Slice Slice: a subset of the multi-dimensional array corresponding to a single value for one or more dimensions and projecting on the rest of dimensions E.g., project on Geo (store) and Time from values corresponding to Laptops in the product dimension π StoreId, TimeId, Ammount (σ ArticleId = LaptopId (Sales))
Slice Amounts to equality select condition WHERE clause in SQL E.g., slice Laptops
Slice Slicing means taking out the slice of a cube, given certain set of select dimension e.g., sales where city = Karachi and date = 20/1/2014 day 2 day 1 s1 s2 s3 p1 44 4 p2 s1 s2 s3 p1 12 50 p2 11 8 TIME = day 1 s1 s2 s3 p1 12 50 p2 11 8
Dice Dice: amounts to range select condition on one dimension, or to equality select condition on more than one dimension E.g., Range SELECT π StoreId, TimeId, Amount (σ ArticleId {Laptop, CellP} (Sales))
Dice E.g., Equality SELECT on 2 dimensions Product and Time π StoreId, Amount (σ ArticleId = Laptop MonthID = December (Sales))
Region Juice Cola Milk Cream 10 47 30 12 Pivot A pivot is a two dimensional lay-out of the summary data The x and y axis are the dimensions and the intersection cells for any two dimension values contain the value of the measures 3/1 3/2 3/3 3/4 Date Product
Pivot Pivot (rotate): re-arranging data for viewing purposes The simplest view of pivoting is that it selects two dimensions to aggregate the measure The aggregated values are often displayed in a grid where each point in the (x, y) coordinate system corresponds to an aggregated value of the measure The x and y coordinate values are the values of the selected two dimensions The result of pivoting is also called cross tabulation
Pivot Consider pivoting the following data
Pivoting on City and Day Pivot
OLAP query languages Getting from OLAP operations to the data As in the relational model, through queries In OLTP one has SQL as the standard query language However, OLAP operations are hard to express in SQL There is no standard query language for OLAP Choices are: SQL-99 for ROLAP Grouping Set, Roll-up, Cube operators MDX (Multidimensional expressions) for both MOLAP and ROLAP Similar to SQL, used especially MOLAP solutions, in ROLAP it is mapped to SQL