On Supporting the Data Warehouse Design by Data Mining Techniques 1

Size: px

Start display at page:

Download "On Supporting the Data Warehouse Design by Data Mining Techniques 1"

Shanon Poole
6 years ago
Views:

1 On Supporting the Data Warehouse Design by Data Mining Techniques 1 C. Sapia, G. Höfling FORWISS Knowledge Bases Group Munich, Germany M. Müller, C. Hausdorf, H. Stoyan FORWISS Knowledge Acquisition Group Erlangen, Germany U. Grimmer DaimlerChrysler AG Research & Technology, FT3/KL Ulm, Germany Abstract Integration of data warehousing and data mining by applying the latter as a front end technology to the former is a known approach. Nevertheless, the data warehouse design process can also be seen as an area of application for data mining techniques. In this paper, we compile the requirements of the data warehouse design process concerning data source analysis, structural integration of data sources, data cleansing, data modeling, and physical data warehouse design. Based on this, we demonstrate how currently existing data mining techniques can support these phases and identify directions for future research. The paper contains a collection of ideas that can serve as a basis for systematic practical and research work in this promising field. 1 Introduction Today, large IT projects focusing on the integration of distributed, heterogeneous databases are initiated in nearly every big company. This paper is inspired by a scenario at DaimlerChrysler AG (DC), which was formed by the fusion of former companies Chrysler Corporation and Daimler-Benz AG at November 17 th, If not marked otherwise, references will mainly focus on application areas from the former Daimler- Benz AG. Being one of the largest global provider of automotive and transportation products and services implies having closely linked units like production facilities, administration, R&D centers, and representations at various locations. Within each of those units data is generated, transferred, stored, and analyzed. Typical categories of data types are technical/product, financial, and organizational data. Traditionally, operational database systems running on mainframe computers have been used for managing these data at DC. Although supported by Management Information Systems, analyses still required a lot of manual work regarding data access and transformation tasks. With the ever increasing amount of electronically accessible data over the last few years, there has been a strong demand for more efficient data management applications, both in terms of improved application support and technological advances. The data warehousing (DW) methodology seems to provide the appropriate means for such an advanced data management. In contrast to the classical data warehouse approach, where one central warehouse contains all corporate data, DC is following the approach of building so called data marts, which are application specific database systems for homogeneous user groups. Examples for data marts can be found at DC in areas like car quality improvement or marketing applications where customer and product data are linked. Quality and availability of such data marts will be one of the main differentiating factors for DC s future business success. Regarding the large number and different types of existing operational databases at DC, the design and implementation process of any data mart is a challenging and time-consuming process. DC sees potentials for applying data mining (DM) methods as shown in this article to increase both implementation speed and quality of future data marts and warehouse solutions. Basically, there are three ways to combine data warehouse and data mining technologies: 1. Integration on the front end level combining OLAP and data mining tools into a homogeneous GUI (e.g., [HCC98]). 2. Data warehouse technology supporting the data mining process by providing efficient database technology (e.g., [CS97]) and a documented high-quality data source (e.g., [JQJ98]). Most of the pragmatic and scientific efforts are currently invested into topics 1 and 2. In this paper, we show that a third way of integration seems beneficial: 3. Data mining techniques supporting the data warehouse design process. We investigate the data warehouse design process as an area of application for data mining and statistical methods. To provide a solid foundation for practical and research work in this area, we systematically compile the 1 Part of this work was supported by DaimlerChrysler AG, Research & Technology.

2 FIELD Description Type Unit Example VID ID number of the vehicle integer VTP vehicle type string - C240 ET engine integer - 3 ST steering wheel type integer - 1 CTY country string - D POW power integer h.p. 170 WGT weight integer kg 1345 PD production date date - 87/02 MIL mileage integer km/miles 2319 DAT date of the repair date - 99/02 PRC repair price integer DM/ 1235 TX taxes float CN customer name string - Müller CA customer age integer years 45 CB customer birthday date Figure 1: The scheme of the data source table VehicleRepair used in our example requirements of the DW process. For each step of the process, we investigate if and which current data mining techniques can support the work of a DW designer. If some requirements cannot be met by current methods, we derive directions for future research. The design phase is the most costly and critical part of a data warehouse implementation (roughly 80% of the effort). It currently involves a lot of manual work because of the lack of adequate tools to support the designer s work. To understand where data mining can support this phase, we systematically cover the different phases of this process: n Data Source Analysis (section 4) n Structural Integration of Data Sources (section 5) n Data Cleansing (section 6) n Multidimensional Data Modeling (section 7) n Physical DW Design (section 8) 2 Terminology and Scope Being active areas of research, both the data mining and the data warehouse community developed their own terminology. In this section, we present our terminology for this paper to avoid inconsistencies. The goal of a data warehouse system is to provide the analyst with an integrated and consistent view on all enterprise data which are relevant for the analysis tasks. A data warehouse constitutes the single point of truth in an enterprise. Therefore, a high standard of data quality has to be assured. The warehouse system includes components that extract data from the sources, transform and clean data, load them into the warehouse database, and allow users to query data according to their needs. Our focus in this paper is the design process of this system. This process is a special case of the general software engineering process. We treat the terms knowledge discovery in databases (KDD) and data mining (DM) synonymously. Data mining can be looked at from a process oriented perspective (e.g., CRISP-DM process model [CC98]) and from a method oriented perspective. In this paper, we take the method oriented point of view as we focus on which methods and algorithms can be beneficial during data warehouse design. There is no common definition when a data analysis algorithm is to be called a data mining algorithm. This stems partly from the fact that data mining is related to a lot of research areas including machine learning, statistics, artificial intelligence, knowledge management, and database technology. In this paper, we define data mining methods in a wider sense, i.e., data mining methods in a narrower sense like association rule methods, machine learning methods like decision tree induction, statistical methods like cluster analysis, and soft computing methods like neural nets are subsumed under this term. 3 The Vehicle Repair Scenario Throughout the rest of this paper, we use the following example to demonstrate the usefulness and limitations of data mining methods during the data warehouse design process. The design of this fictional example is based on experiences from several real world projects in the automotive application area (e.g., [HBD+97]). Nevertheless, it is a simplification in order be able to demonstrate the basic concepts without assuming deeper knowledge of the automotive domain. Let us assume, a car manufacturer like DaimlerChrysler wants to build a data warehouse which contains information about vehicle repairs. Typical analysis tasks in this context are the analysis of the quality of products (cars), processes (handling of warranty claims), and services (e.g., garage repair details), the evaluation and redefinition of warranty policies, the prediction of warranty costs, and the collection and analysis of data on wear and damages of cars. The information about the individual repairs is gathered by the individual garages using a repair accounting system. Due to the distributed nature of the large organization, different repair accounting systems are 2

3 VID VTP ET ST CTY POW WGT PD MIL DAT PRC TX CN CA CB C D / / Müller C D / Ott?? C D / Mütze E D / Scharfe 56? E D / / Schmitt Figure 2: Extract of the example table for data source A VID VTP ET ST CTY POW WGT PD MIL DAT PRC - CN CA CB E GB / / Eaton 34 unknown C GB / / Smith C GB / / Philips Figure 3: Extract of the example table for data source B used in different countries or organizational entities (e.g., as a consequence of the merger). Figure 1 shows an excerpt of the data source systems (for reasons of simplicity, we assume that the database has already been transformed into a relational form). The database contains information about cars, customers, and car repairs done in the garage. In Figure 1, we added a description of the database fields that explain the semantics. This information is not necessarily available to the warehouse designer. In Figure 2, example vehicle repair cases from data source A (Germany), in Figure 3, those from data source B (Great Britain) are shown. 4 Data Source Analysis After having identified the information needs and the corresponding relevant sources of data covering these needs in an enterprise (e.g., in our scenario the different underlying databases of the country specific repair accounting systems), the first step of the warehouse design cycle is the analysis of these sources. During this step, data sources have to be analyzed regarding structure, semantics, and data quality (correctness, consistency etc.). Data sources that are relevant for data warehousing (mostly legacy systems) are often not adequately documented. That makes a reverse engineering approach necessary. To this end, data mining algorithms can be used to discover the implicit information about the semantics of the data structures provided by the data stored in the system. In the following sections, the main issues are sketched where data mining algorithms can support the preparation of a data source selected for building up a data warehouse. Discovering the Meaning of Attributes. Often, the exact meaning of an attribute cannot be deduced from its name and data type. In this case, the existing data can help discovering functional dependencies which give you more information about the semantics. For example, association rules as introduced in [AIS93] are suited for this purpose. Given a set of transactions, where each transaction is a set of items (e.g., items of country like D or GB or items of steering wheel type like left or right), an association rule is an expression of the form X È Y, where X and Y are sets of items. The intuitive meaning of such a rule is that transactions of the database which contain X tend to contain Y. An example of such a rule is: 90% of repaired cars from GB also have steering wheel type 2 (right). 20% of all repaired cars have both properties. In this example, 90% is called the confidence and 20% the support of the rule. Both sides of such a rule X and Y can contain sets of properties. Normally, the result of data mining algorithms for finding association rules consists of all rules with a support greater than a given minimum support and a confidence greater than a given minimum confidence. A generalization of this algorithm for mining association rules is to apply it to data in relational form instead of data consisting of a set of transactions. Such rules have the form (A i1 = a i1 ) (A i2 = a i2 )... È (A j1 = a j1 ) (A j2 = a j2 )..., where A xy are attributes (column names) and a xy are attribute values (contents cells). Applied to the example above and the table in Figure 3 such a rule could be: (CTY = GB) È (ST = 2): 90% of repaired cars from GB also have steering wheel type 2 (right). Let us consider a relation describing component or equipment characteristics of automobiles like engine and steering wheel type. From strong and weak association rules discovered from this kind of data you can conclude in which attribute which type of component or equipment is modeled. But this requires also considering background knowledge about allowed and probable combinations of equipment. In our example, you can conclude with the appropriate background knowledge that the rather unintelligible attribute name ST models the different steering wheel types. 3

4 Other data mining techniques, e.g., decision tree and rule induction, and statistical methods, e.g., multivariate regression, Bayesian networks, can also produce useful hypotheses in this context. In summary, the task of reconstructing the meaning of attributes would be optimally supported by dependency modeling using data mining techniques and mapping this model against expert knowledge, e.g., business models. Discovering Encoding Schemes. Many attribute values are (numerically) encoded. Different vehicle properties (e.g., the installed engine type) might be encoded into the vehicle ID number (e.g., 4 th and 5 th digits are 03 for a specific fuel injection engine). Discovering the encoding scheme is an important step in understanding the semantics of this field. Identifying inter-field dependencies helps to build hypotheses about encoding schemes when the semantics of some fields are known (e.g., if field steering wheel type of an automobile record nearly always contains 2 if the country is GB, this value may encode right steering wheel). Analogously to the previous subsection, domain knowledge about the attribute values has to be considered. To make it more complicated, encoding schemes change over time (e.g., some former car components encoded in the VID are no more part of modern cars whereas some new components have to be encoded. In this case, the digits of the no more existing components are possibly used to encode the new components). Data mining algorithms are useful to identify changes in encoding schemes, the time when they took place, and the part of the code that is effected. As described below, new data mining methods to discover those patterns have to be developed and current methods have to be adapted and extended. Let us assume, the segmentation of the VID code is known and codes for old components are not used for encoding new ones. The 4 th and 5 th digits of the vehicle ID number encode the installed engine type. In former vehicle cases, engine types with the code 03 occur. Since a certain time, this engine type was no longer installed and, thus, 03 does no longer occur in current cases. A possible sequence of codes for vehicle cases is (03, 02, 03, 03, 01, 02, 03, 02, 02, 01) associated with the following sequence of relative frequencies of the code 03 : (1.00, 0.50, 0.66, 0.75, 0.60, 0.50, 0.57, 0.50, 0.44, 0.40). A data mining method can search for the time from which on the relative frequency of a code, e.g., 03, is monotone decreasing. In the example above, we see that this time corresponds with the seventh case. The sequence of relative frequencies above can be seen as a time series. With advanced data mining techniques for time series ([AR99]) more complex types of patterns can be discovered. When codes for old components like 03 can be used to encode new components the approach above is not applicable. You have to search in the space of partitions of the cases (e.g., partition 1 : case 1 to case i, parti- 4 tion 2 : case i+1 to case n ) so that the following rules can be found: partition 1 : (code = 03) È (ET = old engine type) with 100%, partition 2 : (code = 03) È (ET = new engine type) with 100%. When even code segmentation is not known search has to be expanded to possible types of code segmentation. Furthermore, methods which use data sets to train a normal behavior can be adapted to the task. The model learned can be used to evaluate significant changes. Neural nets, for example, are used to discover fraud of credit cards. [AAR96] proposes an algorithm that identifies elements of a list (e.g., numbers or texts) that do not fit into the context of the list. [GMV96] describes an incremental learning algorithm that identifies patterns which are not explicable with the previously learned model. To identify the record number where the encoding scheme changes you have to look where the most informative patterns occur. A further approach would be to partition the data set, to build models on these partitions applying the same data mining algorithms, and to compare the differences between these models. Discovery of Integrity Constraints. Integrity constraints are useful for understanding a data source, but can also be applied to ensure data quality because detailed integrity constraints help to identify incorrect records. Integrity constraints can be domains of attributes (e.g., vehicle type {C180, C220, C240, C250, E290, E430,...}, power {122, 129, 170, 229,...}, or weight [1000, 2000]) or constraints on different attributes (e.g., the production date of a car precedes dates of repair). Furthermore, inter relation integrity constraints carry semantics useful for reverse engineering (e.g., a foreign key relationship in a normalized database like from the vehicle repair relation of our example to an engine relation modeling all characteristics of existing engine types). Data mining and statistical methods can be used to induce integrity constraint candidates from the data. These include, for example, visualization methods to identify distributions for finding domains of attributes or methods for dependency modeling. In our example, a possible result of a data mining method would be the finding that vehicle type, power, and weight correlate. Other data mining methods can find intervals of attribute values which are rather compact and cover a high percentage of the existing values, e.g., in 99% of the cases: WGT [1000, 2000]. The trade-off between the compactness and the coverage can be expressed in a specific interestingness measure for these types of findings, e.g., the finding above would be more interesting than the following finding: in 100% of the cases: WGT [1000, 20000]. These results can be seen as hypotheses for attribute domains. Besides, they give hints for noise in the data. In our example, 1% of the cases with WGT [1000, 2000] could have incorrect weight values.

5 5 Structural Integration of Data Sources Once each single data source is understood, content and structural integration follows. This step involves resolving different kinds of structural and semantic conflicts. This problem has been researched in the context of scheme integration for federated database systems (e.g., [CHS+97]). The task is to integrate two given schemes into a global one. In a data warehouse environment, this situation occurs, e.g., if two data sources contain data about the component structure of a car or if data sources containing information about vehicle repairs from different countries have to be consolidated, as in our example. Among others, the following types of conflicts can be distinguished: n Description conflicts (e.g., naming conflicts, domain conflicts, scaling conflicts) occur if the same real world objects are modeled in different schemes and are described differently. n Structural conflicts occur if the same real world entity is contained in different schemes and represented using different modeling constructs. In our scenario, for example, data source A contains a field taxes which is not present in data source B. n Data conflicts can be the result of wrong data. These include specifically incorrect inputs or outdated data. Another source for such conflicts are different forms of representation on data level. These include conflicts caused by different expressions, different measures, or different accuracy. To a certain degree, data mining methods can be used to identify and resolve these conflicts. To illustrate this, we have to extend our example. Let us assume, that the same vehicle repair cases are stored in two different databases. Take the example, that one data base contains the mileage in kilometers, another in miles (as shown in Figure 2 and 3). A further example is that a price is stored with taxes and the other without taxes or that a price is stored in german marks and the other in pounds. In both cases, data mining methods can discover those functional relationships when they are not too complex. A linear regression method would discover the corresponding conversion factors. If the type of functional dependency (linear, quadratic, exponential etc.) is a priori not known, model search instead of parameter search has to be applied. Furthermore, attributes can contain the exact value (e.g., weight) or an abstraction of the value (e.g., a code for a range of weights). Data mining methods can discover that a group is always associated with a homogeneous interval of values. Here, association rule methods can be deployed if there are not too many singles values e.g., (WGT = 800) È (WGT_GROUP = light): 100% of repaired cars with weight 800 also have weight group light. Grouping those rules by their right hand sides, the attribute value groups can be inferred. 6 Data Cleansing Data cleansing is a non-trivial task in data warehouse environments. It is crucial for the data quality which is in turn a key factor for the success of a warehouse. The main focus is the identification of missing or incorrect data (noise) and conflicts between data of different sources and the correction of these problems. The employed algorithms are very problem type and data source specific. Completion of Missing Values. Typically, missing values are indicated by either blank fields ( ) or special attribute values (e.g.,?, unknown ). A way to handle these records is to replace them by the mean or most frequent value or the value which is most common to similar objects. Advanced data mining methods for completion could be similarity based methods or methods for dependency modeling to get hypotheses for missing values. Correction of Noise and Incorrect Data. On the other hand, data entries might be wrong or noisy. Consider a 10 year old car with a mileage of 123 miles not very likely to be realistic. Those records could be identified by tree or rule induction method. For example, a 99% association rule gives hints to data errors, e.g., (AGE > 10) È (MIL > ): 99% of repaired cars older than 10 years also have a mileage over kilometers. Resolving Conflicts between Data Sources. Different sources can contain ambiguous or inconsistent data, e.g., different values for the address of a customer or the price of the same article. With similarity based methods you might find records which describe the same real-world entity but differ in some attributes which have to be cleaned. 7 Multidimensional Data Modeling Using the multidimensional paradigm (e.g., [SBH99]) to model views on warehouse data is a prerequisite for using OLAP tools for data analysis. Data mining can support the following aspects of this design process: Identification of Orthogonal Dimensions. In our scenario, it is certainly not sensible to model all the fields as dimensions of the cube as some fields are functionally dependent (e.g., the customer birthday and age) and other fields do not strongly influence the measures (e.g., the steering wheel type might not show a strong influence on the number of repairs). Data mining methods can help to rank the variables according to their importance in the domain. Of course, this ranking has to be matched with the user analysis requirements, but a mining algorithm can be used to drive or validate this ranking process. Non-correlated sets of attributes (e.g., vehicle type and mileage) can be found with correlation analysis. Furthermore, using data mining methods to drive this cube design seems promising as they can help to identify 5

6 distance metric query prototype space query space distance metric (e.g., number of navigation steps between prototypes) structural query prototype (e.g., number of repairs, garages per country, year) individual query (e.g., number of repairs for all garages in Germany for year 1998) (possibly weak) functional dependencies which indicate non orthogonal dimension attributes. Identification of Sparse and Dense Areas of the Resulting Cube. To find sparse regions in a data cube the method proposed in [LHM+97] can be deployed, e.g., the combination of vehicle type E230, date values during the year 90, and country GB may not contain values because this vehicle type was not yet sold in this country during this time. Sparse regions should be avoided during modeling. Where this is not possible knowledge about the sparse regions can be used to fine tune the data management of the OLAP tool. Using special cluster methods typical, respectively representative, data points can be identified as the center of dense regions. This information is useful for modeling but also for physical design decisions. Handling of Continuous Attributes. The multidimensional paradigm demands that dimensions are of discrete data type. Therefore, attributes with a continuous domain have to be mapped to discrete values if this attribute is to be modeled as a dimension. Algorithms which find meaningful intervals in numeric attributes help to get discrete values, e.g., building weight classes for the vehicles of our example scenario. 8 Physical DW Design Tuning the physical scheme of a data warehouse database is a non-trivial task. To find the optimal physical structure including clustering schemes, indexing schemes, or pre-aggregation, it is indispensable to possess detailed knowledge about the anticipated query behavior of the user. OLAP tools using a multidimensional query formalism are the predominant front end tools for knowledge worker to access the data warehouse. Using these tools, the business user can interactively formulate queries. This interaction possesses several characteristics that make it feasible to look for navigational patterns [Sap99]: 6 Figure 4: Prototypes subsume conceptually equal queries n session oriented: A warehouse user typically works on a task. Normally, the task is to find an answer to a business question (e.g., How can we make our warranty policies more efficient? ). In order to do so he executes several multidimensional queries against the system. We call the sequence of queries that are executed to answer a business query a session. n explorative, interactive, navigational: The user starts with a query (which often corresponds to a predefined business report) and then successively applies multidimensional operations to the query results. That way, he incrementally explores the multidimensional data space. n task and user specific: Each individual session executed by a certain user has a different individual sequence of queries even if the users are working on the same business question or task (e.g., warranty policy analysis). Nevertheless, all sessions corresponding to the same task or all sessions executed by the same user contain specific patterns. For a warranty analysis e.g., it might be common to start with a report listing the number of warranty cases per vehicle type and country. After identifying a country to look at (which may contain different queries) the user typically changes his view analyzing the price of warranty cases according to the mileage. In this context, data mining algorithms can be used to identify tasks and to find query and interaction patterns that are typical for certain tasks or users. Thus, the input to the data mining process is a set of sessions (each consisting of a sequence of multidimensional queries) that can be extracted from a log file of the data warehouse system. The first step during the data preparation phase of the knowledge discovery process is to find a formal representation of an individual query that contains the interesting properties of this query. This involves mapping different representations of conceptually equal queries to the same query prototype. Figure 4 shows a graphical visualization of this process.

7 A database administrator might be interested in patterns that concern the structural composition of queries (e.g., that certain dimensions are often queried together in certain phases of the analysis process). Among others, this information can be used to plan the indexing structure. In this case, a query prototype would contain information about the structure of the query omitting the values used (e.g., year = 1997). That means that all queries possessing the same structure are represented by the same prototype regardless of the actual values. Mathematically, this first step corresponds to the definition of equivalency classes on the space of all queries. The actual design of the representation and abstraction function is mainly influenced by the type of patterns that should be discovered. The second step is to represent the transformed log file information as a structure that can serve as an input for an existing data mining method. The type of this algorithmic mapping can be used to classify the approaches. Metric Space Approach. A first approach is to map query prototypes to an abstract metric space by defining a distance function for queries. For example, [Sap99] proposes the following distance metric: The number of user interactions necessary to navigate from query 1 to query 2 defines the distance between the two queries (see Figure 4 for a graphical visualization). Following this approach, it is possible to deploy spatial data mining algorithms that work on abstract metric spaces. For example, [EKS+96] describes an algorithm to find clusters in metric spaces. This approach can only discover knowledge about the frequency and distribution of individual queries (not taking the navigational characteristic of OLAP sessions into account). Nevertheless, this knowledge is highly valuable for the physical data warehouse designer. Let us assume, that a relational database system is used for storage. For example, if dimensions (modeled as attributes in the relational representation) are often queried together, this is an indication to cluster the values together on the secondary storage. Furthermore, the metric space approach can be used to partition the users into user groups by means of the queries they execute. This can be done by building a classification of users such that the queries issued by the members of a group form a cluster in the metric space. This information can be used to partition the database into different data marts. Each mart only contains the information mainly relevant for the user group accessing the data mart. It is possible to extend this approach from single queries to sessions by defining meaningful distance metrics for different sessions. Hypertext Approach. Discovering typical interaction patterns for world wide web environments is currently an active topic of research (e.g., [SF99], [MJH+96]). OLAP systems and WWW applications share the common characteristic of navigational and explorative data access. Therefore, techniques from this area can be used for analysis of navigational patterns in OLAP systems. [Sap99] shows a possible transformation such that all possible query prototypes for a given multidimensional scheme form a graph structure containing the query prototypes as nodes and possible user interactions as edges. Thus, an OLAP session corresponds to a path inside this graph. Using this transformation it is possible to deploy the methods proposed for WWW mining (e.g., WUM: Web Utilization Miner [SF99]) for the analysis of the log data. The results of this approach are patterns concerning the navigational behavior of the user, i.e., knowledge about the correlation of successive queries. This information can be used for prediction purposes. An algorithm that predicts possible next queries of the user on the basis of the discovered knowledge and the actual prefix of an ongoing session can improve the caching and prefetching behavior of the OLAP system at runtime. Furthermore, this profile information can in turn be used to drive the user interface design (e.g., supporting or automating frequent interaction sequences) and the design of the conceptual scheme. Sequence of Events Approach. A query can also be seen as an event in time and a session as a sequence of queries. This time sequence can be task or user specific (if only the queries and sessions connected with certain tasks are included in the event sequence). Using this interpretation, it is possible to apply algorithms that were designed for mining of time series data. For example, association rule algorithms on sequence data (e.g., [MT95]) can be used to identify typical sequences of queries. [RS99] contains an upto-date bibliography of temporal data mining approaches. The result of this approach is knowledge about the correlation in time (intra session patterns), i.e., general patterns inherent in all sessions (e.g., after more than three drill down operations in a certain dimension it is likely that a roll up operation occurs). Summing up, the adequate representation enables the use of temporal and spatial data mining methods in order to find groups of users that are similar concerning the time of their queries (e.g., between 8am and 10am), the contents of their queries, their navigational behavior, the granularity of their analysis, or the complexity of their analysis. 9 Conclusions and Future Work In the previous sections we showed that employing data mining methods can be used to support the most important and costly tasks of data warehouse design. Thus, data mining methods incorporated into a ware- 7

8 house designer workbench tool could make the design process more efficient. For those areas where current algorithms do not meet the requirements research work can be devoted to the extension of existing methods. References [AAR96] Arning, A.; Agrawal, R.; Raghavan, P.: A Linear Method for Deviation Detection in Large Databases, in Proc. of the 2 nd International Conference on Knowledge Discovery & Data Mining, AAAI Press, August [AIS93] Agrawal, R.; Imielinski, T.; Swami, A.: Mining Association Rules between Sets of Items in Large Databases, in Proc. of the ACM SIGMOD Conference on Management of Data, [AR99] Abraham, T.; Roddick, J. F.: Incremental Meta-mining from Large Temporal Data Sets, in Kambayashi, Y.; Lee, D. K.; Lim, E.-P. et al. (eds.): Advances in Database Technologies, Proc. of the 1 st International Workshop on Data Warehousing and Data Mining, DWDM 98., Lecture Notes in Computer Science Vol. 1552, Springer Verlag, Berlin, [CC98] Chapman, P.; Clinton, J.: The Current CRISP- DM Process Model for Data Mining, Discussion Paper, [CHS+97] Conrad, S.; Höding, M.; Saake, G.; Schmitt, I.; Türker, C.: Scheme Integration with Integrity Constraints, BNCOD, [CS97] Choenni, S.; Siebes, A.: Query Optimization to Support Data Mining, in Proc. of the 8 th International Conference and Workshops on Database and Expert Systems Applications, IEEE Computer Society Press, [EKS+96] Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, in Proc. of the 2 nd International Conference on Knowledge Discovery and Data Mining (KDD 96), Portland, Oregon, AAAI Press, [GMV96] Guyon, I.; Matic, N.; Vapnik, V.: Discovering Informative Patterns and Data Cleaning, in Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P.; Uthurusamy, R. (eds.): Advances in Knowledge Discovery and Data Mining, AAAI Press, [HBD+97] Höfling, G.; Blaschka, M.; Dinter, B.; Spiegel, P.; Ringel, T.: Data Warehouse Technology for the Management of Diagnosis Data (in German), in Dittrich, Geppert (Hrsg.): Datenbanksysteme in Büro, Technik und Wissenschaft (BTW), Springer Verlag, [HCC98] Han, J.; Chee, S.; Chiang, J. Y.: Issues for On-line Analytical Mining of Data Warehouses, in Proc. of the 1998 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD 98), Washington, June [JQJ98] Jeusfeld, M. A.; Quix, C.; Jarke, M.: Design and Analysis of Quality Information for Data Warehouses, in Proc. of the 17 th International Conference on Conceptual Modeling (ER 98), Singapore, [LHM+97] Liu, B.; Hsu, W.; Mun, L.-F.; Lee, H.-Y.: Identifying Interesting Missing Patterns, in Lu, H.; Motoda, H.; Liu, H. (eds.): Proc. of the 1 st Pacific- Asia Conference on Knowledge Discovery and Data Mining (PAKDD 97), World Scientific, Singapore, [MT95] Mannila, H.; Toivonen, H.: Discovering Generalized Episodes in Sequences, in Proc. of the 1 st International Conference on Knowledge Discovery and Data Mining (KDD 95), Canada, [MJH+96] Mobasher, B.; Jain, N.; Han, E.-H.; Srivastava, J.: Web Mining: Pattern Discovery from World Wide Web Transactions, Technical Report TR96-050, Department of Computer Science, University of Minnesota, [RS99] Roddick, J. F.; Spiliopoulou, M.: A Bibliography of Temporal, Spatial, and Spatio-temporal Data Mining Research, SIGKDD Vol. 1(1), [Sap99] Sapia, C.: On Modeling and Prediction Query Behavior for OLAP Systems, in Proc. of the CaiSE 99 Workshop on Design and Management of Data Warehouses (DMDW 99), Heidelberg, [SBH99] Sapia, C.; Blaschka, M.; Höfling, G.: An Overview of Multidimensional Data Models for OLAP, FORWISS Technical Report , February [SF99] Spiliopoulou, M.; Faulstich, L. C.: WUM: A Tool for Web Utilization Analysis, in extended version of Proc. EDBT Workshop WebDB 98, LNCS 1590, Springer Verlag, [SFW99] Spiliopoulou, M.; Faulstich, L. C.; Winkler, K.: A Data Miner Analyzing the Navigational Behavior of Web Users, in Proc. of the Workshop on Machine Learning in User Modeling of the ACAI 99 International Conference, Creta, Greece, July

Chapter 1, Introduction

CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from