Database Technologies for E-Business. Dongmei CUI

Size: px

Start display at page:

Download "Database Technologies for E-Business. Dongmei CUI"

Helen Singleton
5 years ago
Views:

1 Database Technologies for E-Business 15 Database Technologies for E-Business Dongmei CUI Abstract In today's fast-paced business environment, business processes such as designing product, obtaining suppliers, selling, fulfilling orders, and providing services, are performed through the extensive use of computer and communication technologies and computerized data. Under the "e" environment, many companies have been already collecting and refining vast amounts of data. From analyzing the data, they understand customer expectations and optimize operations to unprecedented degrees which lead to success. The companies, called analytics competitor, are competing on analytics and three key attributes they own are found out. This paper aims to investigate database technologies from data perspective, focusing on the requirement for the exploitation of analytic techniques in today's e-business. key words Relational data model, Multidimensional data model, Data warehouse, ETL, Online analytical processing, Data mining, Analytics competitor 1. INTRODUCTION In today's competitive business environment characterized by globalization, short product life cycles, short spans of distribution and diversity of customers' needs, a lot of innovations have been appearing on the business and technology side of organizations (Hammer 2001; Crainer 2000). On the business side we have been seeing business process reengineering (BPR) and balanced scorecard (BSC), the management philosophies of customer relationship and supply chain management, electronic commerce, and business-to-business trading exchanges. On the technology side, there is a move from standalone information systems to large-scale information systems, such as enterprise resource planning (ERP) systems, knowledge management systems (KMS), and different information technology solutions for enterprise application integration (e.g. the customer relationship management (CRM) systems), interorganizational systems (IOS), as well as the standardization like the internet, electronic data interchange (EDI) and so forth. It can say that those technologies are indispensable for such business processes as designing product, obtaining suppliers, selling, fulfilling orders, and providing service, we call it e-business (Alter 2002), where business activities are conducted using computer and communication technologies and computerized data. Under the "e" environment, many companies have been already collecting and refining vast amounts of data in databases / data warehouses. From analyzing the data, they understand customer expectations and optimize operations to unprecedented degrees. Several well-known stories, such as electronic reservations from American Airlines, online ordering from American Hospital, and online book selling from Amazon and so on, have been intensively studied. Those companies are called analytics competitors and the key attributes they have are found out from a recent research (Davenport 2006): they use widely data modeling and optimization technologies; they apply analytic techniques through the whole enterprise; and their senior executive promotes analytics for

2 16 competing. Needless to say, the exploitation of analytic techniques has brought competitive advantage to the companies, which is not just because they can analysis, but also because they should keep competing on analytics. This paper attempts to overview database technologies from data perspective, focusing on the requirement for applying analytic techniques in e-business. The paper is organized in the following way: Firstly, we start with investigating how the data is design in a database / data warehouse in the coming section. Secondly, for the purpose of analyzing the multidimensional data, data warehousing and online analytical processing are discussed in detail under the data warehouse architecture in section 3. Finally, a conclusion is discussed in the end. 2. RELATIONAL AND MULTIDIMENSIONAL DATA MODEL According to the database terminology, a data model is defined as a set of constructs that describe the structure of data, and a set of operations which are used for manipulating the data. In this section, we use the data model term to refer a structure of the data by design, not a structure discovered existing within the data which is usually modeled by entity-relationship (ER) modeling method to represent the semantic data model of entities and their relationships. 2.1 Relational Data Model Generally, the data stored in a database is designed by a tabular way, or organized in a tabular form. A database can contain multiple tables. The schema of a table consists of the table name and a set of columns with the corresponding names; the column names are also referred to as attributes and a row of table is called a tuple. An instance of the schema (namely, the actual table) is called a relation. Each table entry (relational data) in the column for each attribute is a value within a domain of the corresponding attribute. For example, suppose we have a database containing five tables for storing the data regarding the process of supply which might be as the following: Figure 1. Example of a database containing five relations: supply, part, project, inventory and supply. QOH and QTY stand for quantity of house and quantity of supply respectively. Herein, we describe the relational data more formally. Let R be a relation schema, which is a set of attributes {A 1,..., A i,..., A n}, where each attribute A i has a associated domain Dom(A i). A row over the schema R is a mapping t : R i Dom(A i) where t(a i) Dom(A i). A table (relation) over the schema R is a collection of rows over R. A database schema R is a collection of {R 1,..., R i,..., R m} of relation schemas and a database r over the schema R consists of relations over R i for each i = 1,..., m. The database consisting of the tables is called relational database (Codd 1970). Here, each instance of R i, i = 1,..., m can be seen as a basic relation of database. Moreover, new relations can be derived from the basic relations or the pre-derived relations. Figure 2 shows user can operate database through two ways: accessing basic relations (table) directly, or operating the basic relations indirectly through the derived relations. Two kinds of relations can be derived: view and snapshot. View is a set of data satisfying a constraint, or collection of some attributes (columns), or a result obtained by relational operation, which is a dynamic, also called virtual relation as it doesn't shore the data, so that user has to access basic relations (tables) of database according to the definition of view. On the other hand, snapshot is a copy of data of a period in database which is not time-varying, or says, static relation. For example, we can create a snapshot for making monthly sales report on the each final day of every month. As illustrated in Figure 2, users use the interface SQL (Structured Query Language) to access R1 directly, and also they can get the data in R1 through V3. Further, in order to obtain the data from R3, they can operate V3 and V1 since V3 is defined

3 Database Technologies for E-Business 17 over V1 and V1 is derived from R3, where the notation of R1, R2, and R3 are referred to as basic relations, and V1, V2, and V3 as derived relations which could be a view or snapshot. Figure 2. Basic relations and derived relations In the relational data model, each relation is normalized according to the concept of function dependence, and entries of a given relation can crossreference other entries of the same relation or entries of a different relation under the primary-foreign key relationship (Codd 1970). In other words, the primary key (a column or combination of columns) plays a role to be used to reference relation itself or to be referenced by other relations. In contrast, the foreign key is used by combining with the primary key of other relations for relation connecting with other relations. 2.2 Multidimensional Data Model Although data could be modeled in relational modeling, it is more intuitive to think of it in terms of dimensions and facts while aggregating/summarizing data, such as the sales for the product in the month of September over all stores. Information about the dimension values is maintained in the dimension tables, usually, one dimension table is created for one dimension. And information about the facts is organized in a table. Each row contains one fact, which is represented by references to the dimensions and the measures (e.g. sales). Technically, each dimension table holds a primary key, which is also included in the fact table as a foreign key. The combination of all foreign keys becomes the primary key of the fact table. Now we arrive at considering what the schema expressing facts from multiple dimensions would be. Most commonly the multidimensional data model is mapped onto a star schema which consists of a fact table and several dimension tables. The fact table has measure attributes that record the facts and dimension attributes that form a foreign key to the dimension tables. Imagine a picture in which the fact table is in the middle encircled by several dimension tables. Figure 3 gives a concrete example from (Han & Kamber 2006) showing a star schema in the left side. As shown in the right side of Figure 3, the values for a dimension can be grouped in a hierarchical tree structure, so that the analyst can view data at the different levels along the hierarchy of time and Figure 3. A star schema and the hierarchy of dimension

4 18 location dimension. Different aggregation functions can be applied to the lower-level data in order to obtain the data at higher levels along the direction of arrow (see Figure 3). In business analysis, sum, count, min, max, and average are commonly used aggregation functions. For instance, the aggregation function "sum" can be applied to sales values within each month to get the monthly figure to see the total sales figure at monthly level. In order to reduce the number of dimension tables (or say, joins) during query processing, the dimension tables in a star schema are de-normalized. For normalizing each dimension table, the alternative approach is to have a snowflake schema. In the snowflake schema, the dimension tables can be normalized by splitting the information in them in several tables; unlike star schema described above, picture a large fact table in the middle surrounded by dimension tables. But now, each dimension table in turn may be surrounded by a number of smaller tables. As shown in Figure 4, the item dimension table is normalized into supplier dimension table, and location dimension table into city dimension table, that is, the attribute of supplier_key in item dimension table is a foreign key while the attribute of supplier_key in supplier dimension table is the primary key after the item dimension table is normalized. Hence, the same changing is happening between the location dimension table and city dimension table. Figure 4. A snowflake schema 2.3 Representation of Multidimensional Data Model Besides the n-tuple of the tables described in section 2.1 is represented as n-array in the computer (Codd 1970), facts with n dimensions can be organized by an n-dimensional cube stored as an n-dimensional array (Gray, Chauduri and Bosworth el al. 1997) (see Figure 5). All the dimensions together are assumed to uniquely determine the measure, that is, in our example from (Han & Kamber 2006), a particular time, item, location and branch give us a unique sale. Thus, time t, item i, location j and branch k give a cell [t][i][j][k] that records the content of the measures for that sale. Clearly, each cell in the cube corresponds to a row in a fact table, which brings a great problem with sparseness in the representation of n-dimensional cube, because the fact that most combinations of dimensions do not have an associated measure (see Figure 5). For example, all items are not sold at all branches at all times. Several compression versions such as iceberg cube (Fang, Shivakumar and Garcia-Molina el al. 1998) are used in the model to avoid having to store large, mostly empty arrays. The multidimensional data cube technology is influenced by the success of spreadsheet programs in business analysis. However, nowadays, the model based on relational technology is mostly used since it allows leveraging all the know-how and software already existing in relational database systems.

5 Database Technologies for E-Business 19 Figure 5. Lattice of cuboids, making up a 4-D data cube for the dimensions time, item, location, and branch. Each cuboid represents a different degree of summarization (Han & Kamber 2006) 3. DATA WAREHOUSING AND ONLINE ANALYTICAL PROCESSING Data warehousing and online analytical processing have been becoming increasingly important for comprehensive analysis of current and historical data, in order to extract key insights from the vast amounts of data being collected. The purpose of these systems is to provide users a fast analysis so that they can interactively analyze the data to understand business pattern. Generally, relational database described in section 2.1 is designed to mainly maintain data for everyday operations. A bank database, for instance, is a typical example which contains information about accounts and runs everyday under a network of ATM machines. The database is mainly to support transactions which are operations that access and change (e.g. insert or update) the data in the database, called OnLine Transaction Processing (OLTP). OLTP system uses the primary-foreign key relationship to relate tables to each other, and usually is created for a specific use such as the example of bank, as well as order processing, ticket tracking, or personnel file systems. However, after emphasis shifted towards comprehensive analysis of current and historical data, in order to understand customers' expectation and business patterns, data processing is demanded to summarize large amounts of low-level data (see Figure 3) and relate different aspects of business to find interesting correlations. Therefore, database access needs to be based on complex queries, which is called OnLine Analytical Processing (OLAP) (Kimball & Strehlo 1997). In order to process the complex queries efficiently for analyzing vast amounts of data from multiple dimensions, data has to be collected in advance of queries, that is to say, data needs to be extracted from many sources to be collected in a database holding information about subjects spanning the entire organization, called data warehouse, or multiple smallsize databases holding information about a subset of corporation-wide data (e.g. marketing data), called data mart. Figure 6 shows a multi-tier architecture of data warehouse (McFadden 1996; Han & Kimball 2006). Usually, the data stored in a data warehouse (data mart) is copied from multiple OLTP databases to keep history of many data (sets of snapshots). In order to get the data and allow them to continue working normally, it is necessary to watch out for redundant data, missing data, or heterogeneous data. In data warehousing systems, a variety of data extraction and cleaning tools, and utilities of load and refresh are exploited for populating warehouses during the extracting, transforming, and loading process (Immon 2002; Kimball & Ross 2002). Data extraction from "foreign" sources is usually implemented via gateways and standard interfaces.

6 20 Figure 6. Data warehouse: a multi-tier architecture Not surprisingly, there is a high probability of errors and anomalies in the data, since large volumes of data from multiple sources are involved. For example, data cleaning is processed several tasks: filling in missing entries, identifying outliers and smooth out noisy data (e.g. incorrect attribute values: random error or variance in a measured variable), correcting inconsistent data (e.g. inconsistent value assignments, inconsistent field length, inconsistent descriptions) and resolving redundancy caused by data integration (e.g. schema integration: A.cust-id = B.cust-#). The discrepancy of data is usually detected through checking field overloading, checking uniqueness rule, consecutive rule and null rule, using metadata such as domain, range, or dependency, as well as applying some tools: Data scrubbing tools use simple domain knowledge (e.g. postal addresses, spell-check) to detect and correct the data. Parsing and fuzzy matching techniques are often exploited to scrub the data from multiple sources. Data auditing tools scan the data and discover rules and relationships to detect violators, for example, analyzing correlation and clustering to find outliers. Thus, such tools may be considered variants of data mining tools, for instance, the tool may discover a suspicious pattern based on statistical analysis that a certain car dealer has never received any complaints. For the migration and integration of data, data migration tools and ETL (extraction/ transformation/loading) tools fall in this category: data migration tools allow simple transformation rules to be specified, for example, replace the string "gender" by "sex", while ETL tools provide users a graphical user interface to specify transformation. After extracting, cleaning and transforming data, typically, batch load utilities are used for populating the warehouse. Several processes are required: checking integrity constraints; sorting; creating the derived tables stored in the warehouse by summarization, aggregation and other computation; building indices and other access paths; and partitioning data to multiple target storage areas. Furthermore, a load utility must allow the system administrator to monitor status, to cancel, suspend and resume a load. If a failure occurs during the load, the loading process can be restart from the last checkpoint by using periodic checkpoints. In practice, pipelined and partitioned parallelism are typically exploited to prevent loads taking a very long time, for example, sequential loads may take weeks and months for loading a terabyte of data. However, even using parallelism, loading process may still take too long time. Therefore, incremental loading can be used during refresh, in order to reduce the volume of data that has to be incorporated into the warehouse, in which only the updated tuples are

7 Database Technologies for E-Business 21 inserted. Refreshing a warehouse consists in propagating updates on source data to correspondingly update the data stored in the warehouse. Usually, the warehouse is refreshed periodically (e.g., daily or weekly). The refresh policy is set by the warehouse administrator depending on user needs and traffic and so on. Most contemporary database systems provide replication servers that support incremental techniques for propagating updates from a primary database to one or more replicas. Such replication servers can be used to incrementally refresh a warehouse when the sources change. The data shored in data warehouse is modeled in multidimensional data model as described in section 2.2. How to compute and organize the data cube is important process in OLAP. The multidimensional data can be stored and organized in different ways. In the OLAP engine tier shown in Figure 6, there are two contrasting approaches called relational OLAP (ROLAP) and multidimensional OLAP (MOLAP). In a ROLAP system, the data is stored in relational tables and the analytical engine is built on the top of relation database system through standard SQL interface to access the multidimensional data in the tables which are commonly mapped onto a star schema or a snowflake schema. On the other hand, the data in MOLAP systems is stored in a specialized form such as multidimensional arrays described in section 2.3. Since ROLAP uses the well-developed relational database technology (e.g. query processing and optimization), it can coexist with other data sources based on relational database technology and dose not need any specialized storage mechanisms, whereas MOLAP computes and organizes the data cube in a n-dimensional array which could lead to a fast multidimensional analysis. The benefits of both can combined in Hybird OLAP (HOLAP), for example, the Microsoft SQL Server 2000 supports a HOLAP server, which can store large volumes of detail data in a relational database, while aggregations are kept in a separate MOLAP store. According to recent reports from vendors (Ault 2003; Oracle presentation 2005), there are two main moves among them: building a specialized multidimensional engine and attempting to push OLAP functionality into relational databases. At the front-end of the data warehouse architecture illustrated in Figure 6, users use the front-end tools to make complex queries, modify information in a report, swapping between aggregated and detail data, select part of the data, and so forth through OLAP operations: explore the multidimensional data cube by moving up the dimension hierarchy (roll up), moving down (drill down), restricting to a dimension value (slice), selecting an aggregated sub-space (dice), and crossing tabulation (pivot). Alike online analytical processing, data mining is one of most important approaches for multidimensional analysis in data warehouses. Data mining is the extraction of interesting, such as nontrivial, implicit, previously unknown, and potentially useful information or patterns from data in large databases (Fayad, Piatetsky-Shiapiro, Smyth, & Uthurusamy 1996; Han & Kamber 2006), trying to generate such a hypothesis by uncovering hidden patterns. Motivated by the popularity of OLAP technology, Han developed an online analytical mining (OLAM) mechanism to integrate OLAP with multidimensional data mining (Han 1997; Han & Kamber 2006). OLAM provides facilities for data mining on different subsets of data and at different level of abstraction by drilling, pivoting, filtering, dicing, and slicing on a data cube. This can greatly enhance the power and flexibility of exploratory data mining together with visualization tools (Aggarwal 2002). 4. CONCLUSION Since the advent of information technology, businesses have been collecting vast amounts of data about their daily transactions, refining the system that produce transaction data, making data from multiple sources available in warehouses, selecting and implementing analytic tools and assembling the hardware and communication environment. From data perspective, we discussed database technologies associated with the exploitation of analytic techniques:

8 22 multidimensional data modeling, data warehousing and online analytical processing, which are indispensable technological demand for being an analytics competitor. The purpose of these systems is to provide users a fast analysis so that they can interactively analyze the data to understand business pattern such as customer behavior, product movement, employee performance, and financial reactions. In order to build such a system, there are a lot of challenges including data modeling, schema design, loading, maintenance, query processing and so on. For ease to use, simpler and more deployment, and optimal value, a trend has been appearing that data collection, storage, processing, and other issues specific to analytics are incorporated into overall system design. REFERENCES Aggarwal, Charu C. (2002) Towards Effective and Interpretable Data Mining by Visual Interaction. SIGMOD Explorations, Vol.3 Issue 2 pp.11/22 Alter Steven (2002) Information Systems-The Foundation of E-Business, Fourth Edition, Prentice Hall, pp.3/35 Ault Mike (2003) Oracle Data Warehouse Management- Secrets of Oracle Data Warehousing, Rampant TechPress Codd, E. F. (1970) A Relational Model of Data for Large Shared Data Banks, Communication of ACM, Vol. 13, No. 6, June Crainer Stuart (2000) The Management Century A Critical Review of 20th Century Thought & Practice, Booz Allen & Hamilton Inc. Japanese Translation pp.240/296 Davenport, Thomas H. (2006) Competing on analytics, Harvard Business Review, Jan. Fang, M., Shivakumar, H., Garcia-Molina, F., Motwani, R., and Ullman, J.D. (1998) Computing iceberg queries efficiently, Proceedings of Very Large Data Bases, pp.299/310, New York, Aug. Fayad, U., Piatetsky-Shiapiro, G., Smyth, P., and Uthurusamy, R. (1996) Advances in Knowledge Discovery and Data mining, Menlo Park, CA: AAAI Press Giudici, P. (2003) Applied Data Mining Statistical Methods for Business and Industry, England, Wiley & Sons Gray, J., Chaudhuri, S., Bosworth, A., Layman A., Reichart, D., VenKatrao, M., Pellow, F., and Pirahesh, H. (1997) Data Cube: A relational aggregation operator generalizing group-by, cross-tab and sub-total, Data Mining and Knowledge Discovery, No.1 pp.29/54 Hammer Michael (2001) The Agenda What Every Business Must Do to Dominate The Decade, Three River Press Han, J. W. and Kamber, M. (2006) Data Mining: Concepts and Techniques, Morgan Kaufmann Publisher Han, J. (1997) OLAP mining: An Integration of OLAP with Data Mining, Proceedings of the 1997 IFIP Conference on Data Semantics, Oct. IDG Japan (2004) Business Innovation Powered by Oracle E-Business Suite, ISBN Immon, W. H. (2002) Building the Data Warehouse (3 rd Ed.), New York, Wiley & Sons Kimball, R. and Ross, M. (2002) The Data Warehouse Toolkit (2 nd Ed.), New York, Wiley & Sons Kimball, R. and Strehlo, K. (1995) Why Decision Support Fails and How to Fix It, SIGMOD Record, 24(3) pp.92/97 Kinghtsbridge (2005) Top 10 Trends in Business Intelligence and Data Wareshousing for 2005, White Paper, Kinghtsbridge Solutions LLC, Jan. McFadden, Fred R. (1996) Data Warehouse for EIS: Some Issues and Impacts, Proceedings of the Hawaii International Conference on Systems Sciences Oracle Presentation (2005) Oracle Database 10g Release 2-The Exploitation of Data Warehouse, Oracle Corporation

ETL and OLAP Systems

ETL and OLAP Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first semester