On Supporting the Data Warehouse Design by Data Mining Techniques 1

Size: px
Start display at page:

Download "On Supporting the Data Warehouse Design by Data Mining Techniques 1"

Transcription

1 On Supporting the Data Warehouse Design by Data Mining Techniques 1 C. Sapia, G. Höfling FORWISS Knowledge Bases Group Munich, Germany M. Müller, C. Hausdorf, H. Stoyan FORWISS Knowledge Acquisition Group Erlangen, Germany U. Grimmer DaimlerChrysler AG Research & Technology, FT3/KL Ulm, Germany Abstract Integration of data warehousing and data mining by applying the latter as a front end technology to the former is a known approach. Nevertheless, the data warehouse design process can also be seen as an area of application for data mining techniques. In this paper, we compile the requirements of the data warehouse design process concerning data source analysis, structural integration of data sources, data cleansing, data modeling, and physical data warehouse design. Based on this, we demonstrate how currently existing data mining techniques can support these phases and identify directions for future research. The paper contains a collection of ideas that can serve as a basis for systematic practical and research work in this promising field. 1 Introduction Today, large IT projects focusing on the integration of distributed, heterogeneous databases are initiated in nearly every big company. This paper is inspired by a scenario at DaimlerChrysler AG (DC), which was formed by the fusion of former companies Chrysler Corporation and Daimler-Benz AG at November 17 th, If not marked otherwise, references will mainly focus on application areas from the former Daimler- Benz AG. Being one of the largest global provider of automotive and transportation products and services implies having closely linked units like production facilities, administration, R&D centers, and representations at various locations. Within each of those units data is generated, transferred, stored, and analyzed. Typical categories of data types are technical/product, financial, and organizational data. Traditionally, operational database systems running on mainframe computers have been used for managing these data at DC. Although supported by Management Information Systems, analyses still required a lot of manual work regarding data access and transformation tasks. With the ever increasing amount of electronically accessible data over the last few years, there has been a strong demand for more efficient data management applications, both in terms of improved application support and technological advances. The data warehousing (DW) methodology seems to provide the appropriate means for such an advanced data management. In contrast to the classical data warehouse approach, where one central warehouse contains all corporate data, DC is following the approach of building so called data marts, which are application specific database systems for homogeneous user groups. Examples for data marts can be found at DC in areas like car quality improvement or marketing applications where customer and product data are linked. Quality and availability of such data marts will be one of the main differentiating factors for DC s future business success. Regarding the large number and different types of existing operational databases at DC, the design and implementation process of any data mart is a challenging and time-consuming process. DC sees potentials for applying data mining (DM) methods as shown in this article to increase both implementation speed and quality of future data marts and warehouse solutions. Basically, there are three ways to combine data warehouse and data mining technologies: 1. Integration on the front end level combining OLAP and data mining tools into a homogeneous GUI (e.g., [HCC98]). 2. Data warehouse technology supporting the data mining process by providing efficient database technology (e.g., [CS97]) and a documented high-quality data source (e.g., [JQJ98]). Most of the pragmatic and scientific efforts are currently invested into topics 1 and 2. In this paper, we show that a third way of integration seems beneficial: 3. Data mining techniques supporting the data warehouse design process. We investigate the data warehouse design process as an area of application for data mining and statistical methods. To provide a solid foundation for practical and research work in this area, we systematically compile the 1 Part of this work was supported by DaimlerChrysler AG, Research & Technology.

2 FIELD Description Type Unit Example VID ID number of the vehicle integer VTP vehicle type string - C240 ET engine integer - 3 ST steering wheel type integer - 1 CTY country string - D POW power integer h.p. 170 WGT weight integer kg 1345 PD production date date - 87/02 MIL mileage integer km/miles 2319 DAT date of the repair date - 99/02 PRC repair price integer DM/ 1235 TX taxes float CN customer name string - Müller CA customer age integer years 45 CB customer birthday date Figure 1: The scheme of the data source table VehicleRepair used in our example requirements of the DW process. For each step of the process, we investigate if and which current data mining techniques can support the work of a DW designer. If some requirements cannot be met by current methods, we derive directions for future research. The design phase is the most costly and critical part of a data warehouse implementation (roughly 80% of the effort). It currently involves a lot of manual work because of the lack of adequate tools to support the designer s work. To understand where data mining can support this phase, we systematically cover the different phases of this process: n Data Source Analysis (section 4) n Structural Integration of Data Sources (section 5) n Data Cleansing (section 6) n Multidimensional Data Modeling (section 7) n Physical DW Design (section 8) 2 Terminology and Scope Being active areas of research, both the data mining and the data warehouse community developed their own terminology. In this section, we present our terminology for this paper to avoid inconsistencies. The goal of a data warehouse system is to provide the analyst with an integrated and consistent view on all enterprise data which are relevant for the analysis tasks. A data warehouse constitutes the single point of truth in an enterprise. Therefore, a high standard of data quality has to be assured. The warehouse system includes components that extract data from the sources, transform and clean data, load them into the warehouse database, and allow users to query data according to their needs. Our focus in this paper is the design process of this system. This process is a special case of the general software engineering process. We treat the terms knowledge discovery in databases (KDD) and data mining (DM) synonymously. Data mining can be looked at from a process oriented perspective (e.g., CRISP-DM process model [CC98]) and from a method oriented perspective. In this paper, we take the method oriented point of view as we focus on which methods and algorithms can be beneficial during data warehouse design. There is no common definition when a data analysis algorithm is to be called a data mining algorithm. This stems partly from the fact that data mining is related to a lot of research areas including machine learning, statistics, artificial intelligence, knowledge management, and database technology. In this paper, we define data mining methods in a wider sense, i.e., data mining methods in a narrower sense like association rule methods, machine learning methods like decision tree induction, statistical methods like cluster analysis, and soft computing methods like neural nets are subsumed under this term. 3 The Vehicle Repair Scenario Throughout the rest of this paper, we use the following example to demonstrate the usefulness and limitations of data mining methods during the data warehouse design process. The design of this fictional example is based on experiences from several real world projects in the automotive application area (e.g., [HBD+97]). Nevertheless, it is a simplification in order be able to demonstrate the basic concepts without assuming deeper knowledge of the automotive domain. Let us assume, a car manufacturer like DaimlerChrysler wants to build a data warehouse which contains information about vehicle repairs. Typical analysis tasks in this context are the analysis of the quality of products (cars), processes (handling of warranty claims), and services (e.g., garage repair details), the evaluation and redefinition of warranty policies, the prediction of warranty costs, and the collection and analysis of data on wear and damages of cars. The information about the individual repairs is gathered by the individual garages using a repair accounting system. Due to the distributed nature of the large organization, different repair accounting systems are 2

3 VID VTP ET ST CTY POW WGT PD MIL DAT PRC TX CN CA CB C D / / Müller C D / Ott?? C D / Mütze E D / Scharfe 56? E D / / Schmitt Figure 2: Extract of the example table for data source A VID VTP ET ST CTY POW WGT PD MIL DAT PRC - CN CA CB E GB / / Eaton 34 unknown C GB / / Smith C GB / / Philips Figure 3: Extract of the example table for data source B used in different countries or organizational entities (e.g., as a consequence of the merger). Figure 1 shows an excerpt of the data source systems (for reasons of simplicity, we assume that the database has already been transformed into a relational form). The database contains information about cars, customers, and car repairs done in the garage. In Figure 1, we added a description of the database fields that explain the semantics. This information is not necessarily available to the warehouse designer. In Figure 2, example vehicle repair cases from data source A (Germany), in Figure 3, those from data source B (Great Britain) are shown. 4 Data Source Analysis After having identified the information needs and the corresponding relevant sources of data covering these needs in an enterprise (e.g., in our scenario the different underlying databases of the country specific repair accounting systems), the first step of the warehouse design cycle is the analysis of these sources. During this step, data sources have to be analyzed regarding structure, semantics, and data quality (correctness, consistency etc.). Data sources that are relevant for data warehousing (mostly legacy systems) are often not adequately documented. That makes a reverse engineering approach necessary. To this end, data mining algorithms can be used to discover the implicit information about the semantics of the data structures provided by the data stored in the system. In the following sections, the main issues are sketched where data mining algorithms can support the preparation of a data source selected for building up a data warehouse. Discovering the Meaning of Attributes. Often, the exact meaning of an attribute cannot be deduced from its name and data type. In this case, the existing data can help discovering functional dependencies which give you more information about the semantics. For example, association rules as introduced in [AIS93] are suited for this purpose. Given a set of transactions, where each transaction is a set of items (e.g., items of country like D or GB or items of steering wheel type like left or right), an association rule is an expression of the form X È Y, where X and Y are sets of items. The intuitive meaning of such a rule is that transactions of the database which contain X tend to contain Y. An example of such a rule is: 90% of repaired cars from GB also have steering wheel type 2 (right). 20% of all repaired cars have both properties. In this example, 90% is called the confidence and 20% the support of the rule. Both sides of such a rule X and Y can contain sets of properties. Normally, the result of data mining algorithms for finding association rules consists of all rules with a support greater than a given minimum support and a confidence greater than a given minimum confidence. A generalization of this algorithm for mining association rules is to apply it to data in relational form instead of data consisting of a set of transactions. Such rules have the form (A i1 = a i1 ) (A i2 = a i2 )... È (A j1 = a j1 ) (A j2 = a j2 )..., where A xy are attributes (column names) and a xy are attribute values (contents cells). Applied to the example above and the table in Figure 3 such a rule could be: (CTY = GB) È (ST = 2): 90% of repaired cars from GB also have steering wheel type 2 (right). Let us consider a relation describing component or equipment characteristics of automobiles like engine and steering wheel type. From strong and weak association rules discovered from this kind of data you can conclude in which attribute which type of component or equipment is modeled. But this requires also considering background knowledge about allowed and probable combinations of equipment. In our example, you can conclude with the appropriate background knowledge that the rather unintelligible attribute name ST models the different steering wheel types. 3

4 Other data mining techniques, e.g., decision tree and rule induction, and statistical methods, e.g., multivariate regression, Bayesian networks, can also produce useful hypotheses in this context. In summary, the task of reconstructing the meaning of attributes would be optimally supported by dependency modeling using data mining techniques and mapping this model against expert knowledge, e.g., business models. Discovering Encoding Schemes. Many attribute values are (numerically) encoded. Different vehicle properties (e.g., the installed engine type) might be encoded into the vehicle ID number (e.g., 4 th and 5 th digits are 03 for a specific fuel injection engine). Discovering the encoding scheme is an important step in understanding the semantics of this field. Identifying inter-field dependencies helps to build hypotheses about encoding schemes when the semantics of some fields are known (e.g., if field steering wheel type of an automobile record nearly always contains 2 if the country is GB, this value may encode right steering wheel). Analogously to the previous subsection, domain knowledge about the attribute values has to be considered. To make it more complicated, encoding schemes change over time (e.g., some former car components encoded in the VID are no more part of modern cars whereas some new components have to be encoded. In this case, the digits of the no more existing components are possibly used to encode the new components). Data mining algorithms are useful to identify changes in encoding schemes, the time when they took place, and the part of the code that is effected. As described below, new data mining methods to discover those patterns have to be developed and current methods have to be adapted and extended. Let us assume, the segmentation of the VID code is known and codes for old components are not used for encoding new ones. The 4 th and 5 th digits of the vehicle ID number encode the installed engine type. In former vehicle cases, engine types with the code 03 occur. Since a certain time, this engine type was no longer installed and, thus, 03 does no longer occur in current cases. A possible sequence of codes for vehicle cases is (03, 02, 03, 03, 01, 02, 03, 02, 02, 01) associated with the following sequence of relative frequencies of the code 03 : (1.00, 0.50, 0.66, 0.75, 0.60, 0.50, 0.57, 0.50, 0.44, 0.40). A data mining method can search for the time from which on the relative frequency of a code, e.g., 03, is monotone decreasing. In the example above, we see that this time corresponds with the seventh case. The sequence of relative frequencies above can be seen as a time series. With advanced data mining techniques for time series ([AR99]) more complex types of patterns can be discovered. When codes for old components like 03 can be used to encode new components the approach above is not applicable. You have to search in the space of partitions of the cases (e.g., partition 1 : case 1 to case i, parti- 4 tion 2 : case i+1 to case n ) so that the following rules can be found: partition 1 : (code = 03) È (ET = old engine type) with 100%, partition 2 : (code = 03) È (ET = new engine type) with 100%. When even code segmentation is not known search has to be expanded to possible types of code segmentation. Furthermore, methods which use data sets to train a normal behavior can be adapted to the task. The model learned can be used to evaluate significant changes. Neural nets, for example, are used to discover fraud of credit cards. [AAR96] proposes an algorithm that identifies elements of a list (e.g., numbers or texts) that do not fit into the context of the list. [GMV96] describes an incremental learning algorithm that identifies patterns which are not explicable with the previously learned model. To identify the record number where the encoding scheme changes you have to look where the most informative patterns occur. A further approach would be to partition the data set, to build models on these partitions applying the same data mining algorithms, and to compare the differences between these models. Discovery of Integrity Constraints. Integrity constraints are useful for understanding a data source, but can also be applied to ensure data quality because detailed integrity constraints help to identify incorrect records. Integrity constraints can be domains of attributes (e.g., vehicle type {C180, C220, C240, C250, E290, E430,...}, power {122, 129, 170, 229,...}, or weight [1000, 2000]) or constraints on different attributes (e.g., the production date of a car precedes dates of repair). Furthermore, inter relation integrity constraints carry semantics useful for reverse engineering (e.g., a foreign key relationship in a normalized database like from the vehicle repair relation of our example to an engine relation modeling all characteristics of existing engine types). Data mining and statistical methods can be used to induce integrity constraint candidates from the data. These include, for example, visualization methods to identify distributions for finding domains of attributes or methods for dependency modeling. In our example, a possible result of a data mining method would be the finding that vehicle type, power, and weight correlate. Other data mining methods can find intervals of attribute values which are rather compact and cover a high percentage of the existing values, e.g., in 99% of the cases: WGT [1000, 2000]. The trade-off between the compactness and the coverage can be expressed in a specific interestingness measure for these types of findings, e.g., the finding above would be more interesting than the following finding: in 100% of the cases: WGT [1000, 20000]. These results can be seen as hypotheses for attribute domains. Besides, they give hints for noise in the data. In our example, 1% of the cases with WGT [1000, 2000] could have incorrect weight values.

5 5 Structural Integration of Data Sources Once each single data source is understood, content and structural integration follows. This step involves resolving different kinds of structural and semantic conflicts. This problem has been researched in the context of scheme integration for federated database systems (e.g., [CHS+97]). The task is to integrate two given schemes into a global one. In a data warehouse environment, this situation occurs, e.g., if two data sources contain data about the component structure of a car or if data sources containing information about vehicle repairs from different countries have to be consolidated, as in our example. Among others, the following types of conflicts can be distinguished: n Description conflicts (e.g., naming conflicts, domain conflicts, scaling conflicts) occur if the same real world objects are modeled in different schemes and are described differently. n Structural conflicts occur if the same real world entity is contained in different schemes and represented using different modeling constructs. In our scenario, for example, data source A contains a field taxes which is not present in data source B. n Data conflicts can be the result of wrong data. These include specifically incorrect inputs or outdated data. Another source for such conflicts are different forms of representation on data level. These include conflicts caused by different expressions, different measures, or different accuracy. To a certain degree, data mining methods can be used to identify and resolve these conflicts. To illustrate this, we have to extend our example. Let us assume, that the same vehicle repair cases are stored in two different databases. Take the example, that one data base contains the mileage in kilometers, another in miles (as shown in Figure 2 and 3). A further example is that a price is stored with taxes and the other without taxes or that a price is stored in german marks and the other in pounds. In both cases, data mining methods can discover those functional relationships when they are not too complex. A linear regression method would discover the corresponding conversion factors. If the type of functional dependency (linear, quadratic, exponential etc.) is a priori not known, model search instead of parameter search has to be applied. Furthermore, attributes can contain the exact value (e.g., weight) or an abstraction of the value (e.g., a code for a range of weights). Data mining methods can discover that a group is always associated with a homogeneous interval of values. Here, association rule methods can be deployed if there are not too many singles values e.g., (WGT = 800) È (WGT_GROUP = light): 100% of repaired cars with weight 800 also have weight group light. Grouping those rules by their right hand sides, the attribute value groups can be inferred. 6 Data Cleansing Data cleansing is a non-trivial task in data warehouse environments. It is crucial for the data quality which is in turn a key factor for the success of a warehouse. The main focus is the identification of missing or incorrect data (noise) and conflicts between data of different sources and the correction of these problems. The employed algorithms are very problem type and data source specific. Completion of Missing Values. Typically, missing values are indicated by either blank fields ( ) or special attribute values (e.g.,?, unknown ). A way to handle these records is to replace them by the mean or most frequent value or the value which is most common to similar objects. Advanced data mining methods for completion could be similarity based methods or methods for dependency modeling to get hypotheses for missing values. Correction of Noise and Incorrect Data. On the other hand, data entries might be wrong or noisy. Consider a 10 year old car with a mileage of 123 miles not very likely to be realistic. Those records could be identified by tree or rule induction method. For example, a 99% association rule gives hints to data errors, e.g., (AGE > 10) È (MIL > ): 99% of repaired cars older than 10 years also have a mileage over kilometers. Resolving Conflicts between Data Sources. Different sources can contain ambiguous or inconsistent data, e.g., different values for the address of a customer or the price of the same article. With similarity based methods you might find records which describe the same real-world entity but differ in some attributes which have to be cleaned. 7 Multidimensional Data Modeling Using the multidimensional paradigm (e.g., [SBH99]) to model views on warehouse data is a prerequisite for using OLAP tools for data analysis. Data mining can support the following aspects of this design process: Identification of Orthogonal Dimensions. In our scenario, it is certainly not sensible to model all the fields as dimensions of the cube as some fields are functionally dependent (e.g., the customer birthday and age) and other fields do not strongly influence the measures (e.g., the steering wheel type might not show a strong influence on the number of repairs). Data mining methods can help to rank the variables according to their importance in the domain. Of course, this ranking has to be matched with the user analysis requirements, but a mining algorithm can be used to drive or validate this ranking process. Non-correlated sets of attributes (e.g., vehicle type and mileage) can be found with correlation analysis. Furthermore, using data mining methods to drive this cube design seems promising as they can help to identify 5

6 distance metric query prototype space query space distance metric (e.g., number of navigation steps between prototypes) structural query prototype (e.g., number of repairs, garages per country, year) individual query (e.g., number of repairs for all garages in Germany for year 1998) (possibly weak) functional dependencies which indicate non orthogonal dimension attributes. Identification of Sparse and Dense Areas of the Resulting Cube. To find sparse regions in a data cube the method proposed in [LHM+97] can be deployed, e.g., the combination of vehicle type E230, date values during the year 90, and country GB may not contain values because this vehicle type was not yet sold in this country during this time. Sparse regions should be avoided during modeling. Where this is not possible knowledge about the sparse regions can be used to fine tune the data management of the OLAP tool. Using special cluster methods typical, respectively representative, data points can be identified as the center of dense regions. This information is useful for modeling but also for physical design decisions. Handling of Continuous Attributes. The multidimensional paradigm demands that dimensions are of discrete data type. Therefore, attributes with a continuous domain have to be mapped to discrete values if this attribute is to be modeled as a dimension. Algorithms which find meaningful intervals in numeric attributes help to get discrete values, e.g., building weight classes for the vehicles of our example scenario. 8 Physical DW Design Tuning the physical scheme of a data warehouse database is a non-trivial task. To find the optimal physical structure including clustering schemes, indexing schemes, or pre-aggregation, it is indispensable to possess detailed knowledge about the anticipated query behavior of the user. OLAP tools using a multidimensional query formalism are the predominant front end tools for knowledge worker to access the data warehouse. Using these tools, the business user can interactively formulate queries. This interaction possesses several characteristics that make it feasible to look for navigational patterns [Sap99]: 6 Figure 4: Prototypes subsume conceptually equal queries n session oriented: A warehouse user typically works on a task. Normally, the task is to find an answer to a business question (e.g., How can we make our warranty policies more efficient? ). In order to do so he executes several multidimensional queries against the system. We call the sequence of queries that are executed to answer a business query a session. n explorative, interactive, navigational: The user starts with a query (which often corresponds to a predefined business report) and then successively applies multidimensional operations to the query results. That way, he incrementally explores the multidimensional data space. n task and user specific: Each individual session executed by a certain user has a different individual sequence of queries even if the users are working on the same business question or task (e.g., warranty policy analysis). Nevertheless, all sessions corresponding to the same task or all sessions executed by the same user contain specific patterns. For a warranty analysis e.g., it might be common to start with a report listing the number of warranty cases per vehicle type and country. After identifying a country to look at (which may contain different queries) the user typically changes his view analyzing the price of warranty cases according to the mileage. In this context, data mining algorithms can be used to identify tasks and to find query and interaction patterns that are typical for certain tasks or users. Thus, the input to the data mining process is a set of sessions (each consisting of a sequence of multidimensional queries) that can be extracted from a log file of the data warehouse system. The first step during the data preparation phase of the knowledge discovery process is to find a formal representation of an individual query that contains the interesting properties of this query. This involves mapping different representations of conceptually equal queries to the same query prototype. Figure 4 shows a graphical visualization of this process.

7 A database administrator might be interested in patterns that concern the structural composition of queries (e.g., that certain dimensions are often queried together in certain phases of the analysis process). Among others, this information can be used to plan the indexing structure. In this case, a query prototype would contain information about the structure of the query omitting the values used (e.g., year = 1997). That means that all queries possessing the same structure are represented by the same prototype regardless of the actual values. Mathematically, this first step corresponds to the definition of equivalency classes on the space of all queries. The actual design of the representation and abstraction function is mainly influenced by the type of patterns that should be discovered. The second step is to represent the transformed log file information as a structure that can serve as an input for an existing data mining method. The type of this algorithmic mapping can be used to classify the approaches. Metric Space Approach. A first approach is to map query prototypes to an abstract metric space by defining a distance function for queries. For example, [Sap99] proposes the following distance metric: The number of user interactions necessary to navigate from query 1 to query 2 defines the distance between the two queries (see Figure 4 for a graphical visualization). Following this approach, it is possible to deploy spatial data mining algorithms that work on abstract metric spaces. For example, [EKS+96] describes an algorithm to find clusters in metric spaces. This approach can only discover knowledge about the frequency and distribution of individual queries (not taking the navigational characteristic of OLAP sessions into account). Nevertheless, this knowledge is highly valuable for the physical data warehouse designer. Let us assume, that a relational database system is used for storage. For example, if dimensions (modeled as attributes in the relational representation) are often queried together, this is an indication to cluster the values together on the secondary storage. Furthermore, the metric space approach can be used to partition the users into user groups by means of the queries they execute. This can be done by building a classification of users such that the queries issued by the members of a group form a cluster in the metric space. This information can be used to partition the database into different data marts. Each mart only contains the information mainly relevant for the user group accessing the data mart. It is possible to extend this approach from single queries to sessions by defining meaningful distance metrics for different sessions. Hypertext Approach. Discovering typical interaction patterns for world wide web environments is currently an active topic of research (e.g., [SF99], [MJH+96]). OLAP systems and WWW applications share the common characteristic of navigational and explorative data access. Therefore, techniques from this area can be used for analysis of navigational patterns in OLAP systems. [Sap99] shows a possible transformation such that all possible query prototypes for a given multidimensional scheme form a graph structure containing the query prototypes as nodes and possible user interactions as edges. Thus, an OLAP session corresponds to a path inside this graph. Using this transformation it is possible to deploy the methods proposed for WWW mining (e.g., WUM: Web Utilization Miner [SF99]) for the analysis of the log data. The results of this approach are patterns concerning the navigational behavior of the user, i.e., knowledge about the correlation of successive queries. This information can be used for prediction purposes. An algorithm that predicts possible next queries of the user on the basis of the discovered knowledge and the actual prefix of an ongoing session can improve the caching and prefetching behavior of the OLAP system at runtime. Furthermore, this profile information can in turn be used to drive the user interface design (e.g., supporting or automating frequent interaction sequences) and the design of the conceptual scheme. Sequence of Events Approach. A query can also be seen as an event in time and a session as a sequence of queries. This time sequence can be task or user specific (if only the queries and sessions connected with certain tasks are included in the event sequence). Using this interpretation, it is possible to apply algorithms that were designed for mining of time series data. For example, association rule algorithms on sequence data (e.g., [MT95]) can be used to identify typical sequences of queries. [RS99] contains an upto-date bibliography of temporal data mining approaches. The result of this approach is knowledge about the correlation in time (intra session patterns), i.e., general patterns inherent in all sessions (e.g., after more than three drill down operations in a certain dimension it is likely that a roll up operation occurs). Summing up, the adequate representation enables the use of temporal and spatial data mining methods in order to find groups of users that are similar concerning the time of their queries (e.g., between 8am and 10am), the contents of their queries, their navigational behavior, the granularity of their analysis, or the complexity of their analysis. 9 Conclusions and Future Work In the previous sections we showed that employing data mining methods can be used to support the most important and costly tasks of data warehouse design. Thus, data mining methods incorporated into a ware- 7

8 house designer workbench tool could make the design process more efficient. For those areas where current algorithms do not meet the requirements research work can be devoted to the extension of existing methods. References [AAR96] Arning, A.; Agrawal, R.; Raghavan, P.: A Linear Method for Deviation Detection in Large Databases, in Proc. of the 2 nd International Conference on Knowledge Discovery & Data Mining, AAAI Press, August [AIS93] Agrawal, R.; Imielinski, T.; Swami, A.: Mining Association Rules between Sets of Items in Large Databases, in Proc. of the ACM SIGMOD Conference on Management of Data, [AR99] Abraham, T.; Roddick, J. F.: Incremental Meta-mining from Large Temporal Data Sets, in Kambayashi, Y.; Lee, D. K.; Lim, E.-P. et al. (eds.): Advances in Database Technologies, Proc. of the 1 st International Workshop on Data Warehousing and Data Mining, DWDM 98., Lecture Notes in Computer Science Vol. 1552, Springer Verlag, Berlin, [CC98] Chapman, P.; Clinton, J.: The Current CRISP- DM Process Model for Data Mining, Discussion Paper, [CHS+97] Conrad, S.; Höding, M.; Saake, G.; Schmitt, I.; Türker, C.: Scheme Integration with Integrity Constraints, BNCOD, [CS97] Choenni, S.; Siebes, A.: Query Optimization to Support Data Mining, in Proc. of the 8 th International Conference and Workshops on Database and Expert Systems Applications, IEEE Computer Society Press, [EKS+96] Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, in Proc. of the 2 nd International Conference on Knowledge Discovery and Data Mining (KDD 96), Portland, Oregon, AAAI Press, [GMV96] Guyon, I.; Matic, N.; Vapnik, V.: Discovering Informative Patterns and Data Cleaning, in Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P.; Uthurusamy, R. (eds.): Advances in Knowledge Discovery and Data Mining, AAAI Press, [HBD+97] Höfling, G.; Blaschka, M.; Dinter, B.; Spiegel, P.; Ringel, T.: Data Warehouse Technology for the Management of Diagnosis Data (in German), in Dittrich, Geppert (Hrsg.): Datenbanksysteme in Büro, Technik und Wissenschaft (BTW), Springer Verlag, [HCC98] Han, J.; Chee, S.; Chiang, J. Y.: Issues for On-line Analytical Mining of Data Warehouses, in Proc. of the 1998 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD 98), Washington, June [JQJ98] Jeusfeld, M. A.; Quix, C.; Jarke, M.: Design and Analysis of Quality Information for Data Warehouses, in Proc. of the 17 th International Conference on Conceptual Modeling (ER 98), Singapore, [LHM+97] Liu, B.; Hsu, W.; Mun, L.-F.; Lee, H.-Y.: Identifying Interesting Missing Patterns, in Lu, H.; Motoda, H.; Liu, H. (eds.): Proc. of the 1 st Pacific- Asia Conference on Knowledge Discovery and Data Mining (PAKDD 97), World Scientific, Singapore, [MT95] Mannila, H.; Toivonen, H.: Discovering Generalized Episodes in Sequences, in Proc. of the 1 st International Conference on Knowledge Discovery and Data Mining (KDD 95), Canada, [MJH+96] Mobasher, B.; Jain, N.; Han, E.-H.; Srivastava, J.: Web Mining: Pattern Discovery from World Wide Web Transactions, Technical Report TR96-050, Department of Computer Science, University of Minnesota, [RS99] Roddick, J. F.; Spiliopoulou, M.: A Bibliography of Temporal, Spatial, and Spatio-temporal Data Mining Research, SIGKDD Vol. 1(1), [Sap99] Sapia, C.: On Modeling and Prediction Query Behavior for OLAP Systems, in Proc. of the CaiSE 99 Workshop on Design and Management of Data Warehouses (DMDW 99), Heidelberg, [SBH99] Sapia, C.; Blaschka, M.; Höfling, G.: An Overview of Multidimensional Data Models for OLAP, FORWISS Technical Report , February [SF99] Spiliopoulou, M.; Faulstich, L. C.: WUM: A Tool for Web Utilization Analysis, in extended version of Proc. EDBT Workshop WebDB 98, LNCS 1590, Springer Verlag, [SFW99] Spiliopoulou, M.; Faulstich, L. C.; Winkler, K.: A Data Miner Analyzing the Navigational Behavior of Web Users, in Proc. of the Workshop on Machine Learning in User Modeling of the ACAI 99 International Conference, Creta, Greece, July

Chapter 1, Introduction

Chapter 1, Introduction CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from

More information

1. Inroduction to Data Mininig

1. Inroduction to Data Mininig 1. Inroduction to Data Mininig 1.1 Introduction Universe of Data Information Technology has grown in various directions in the recent years. One natural evolutionary path has been the development of the

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Data Warehouse and Data Mining Li Xiong Department of Mathematics and Computer Science Emory University 1 1960s: Evolution of Database Technology Data collection, database creation,

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA) International Journal of Innovation and Scientific Research ISSN 2351-8014 Vol. 12 No. 1 Nov. 2014, pp. 217-222 2014 Innovative Space of Scientific Research Journals http://www.ijisr.issr-journals.org/

More information

An Approach for Accessing Linked Open Data for Data Mining Purposes

An Approach for Accessing Linked Open Data for Data Mining Purposes An Approach for Accessing Linked Open Data for Data Mining Purposes Andreas Nolle, German Nemirovski Albstadt-Sigmaringen University nolle, nemirovskij@hs-albsig.de Abstract In the recent time the amount

More information

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer

Overview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer Data Mining George Karypis Department of Computer Science Digital Technology Center University of Minnesota, Minneapolis, USA. http://www.cs.umn.edu/~karypis karypis@cs.umn.edu Overview Data-mining What

More information

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

Data Mining Technology Based on Bayesian Network Structure Applied in Learning , pp.67-71 http://dx.doi.org/10.14257/astl.2016.137.12 Data Mining Technology Based on Bayesian Network Structure Applied in Learning Chunhua Wang, Dong Han College of Information Engineering, Huanghuai

More information

Data Warehousing. Ritham Vashisht, Sukhdeep Kaur and Shobti Saini

Data Warehousing. Ritham Vashisht, Sukhdeep Kaur and Shobti Saini Advance in Electronic and Electric Engineering. ISSN 2231-1297, Volume 3, Number 6 (2013), pp. 669-674 Research India Publications http://www.ripublication.com/aeee.htm Data Warehousing Ritham Vashisht,

More information

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44 Data Mining Piotr Paszek piotr.paszek@us.edu.pl Introduction (Piotr Paszek) Data Mining DM KDD 1 / 44 Plan of the lecture 1 Data Mining (DM) 2 Knowledge Discovery in Databases (KDD) 3 CRISP-DM 4 DM software

More information

Mining High Order Decision Rules

Mining High Order Decision Rules Mining High Order Decision Rules Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 e-mail: yyao@cs.uregina.ca Abstract. We introduce the notion of high

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 1 1 Acknowledgement Several Slides in this presentation are taken from course slides provided by Han and Kimber (Data Mining Concepts and Techniques) and Tan,

More information

DATA WAREHOUING UNIT I

DATA WAREHOUING UNIT I BHARATHIDASAN ENGINEERING COLLEGE NATTRAMAPALLI DEPARTMENT OF COMPUTER SCIENCE SUB CODE & NAME: IT6702/DWDM DEPT: IT Staff Name : N.RAMESH DATA WAREHOUING UNIT I 1. Define data warehouse? NOV/DEC 2009

More information

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW Ana Azevedo and M.F. Santos ABSTRACT In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done

More information

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set To Enhance Scalability of Item Transactions by Parallel and Partition using Dynamic Data Set Priyanka Soni, Research Scholar (CSE), MTRI, Bhopal, priyanka.soni379@gmail.com Dhirendra Kumar Jha, MTRI, Bhopal,

More information

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Data Mining. Vera Goebel. Department of Informatics, University of Oslo Data Mining Vera Goebel Department of Informatics, University of Oslo 2012 1 Lecture Contents Knowledge Discovery in Databases (KDD) Definition and Applications OLAP Architectures for OLAP and KDD KDD

More information

DATA MINING AND WAREHOUSING

DATA MINING AND WAREHOUSING DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making

More information

Dta Mining and Data Warehousing

Dta Mining and Data Warehousing CSCI645 Fall 23 Dta Mining and Data Warehousing Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: qggao@cs.dal.ca Teaching Assistant: Christopher Jordan, Email: cjordan@cs.dal.ca Office Hours:

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

Discovering interesting rules from financial data

Discovering interesting rules from financial data Discovering interesting rules from financial data Przemysław Sołdacki Institute of Computer Science Warsaw University of Technology Ul. Andersa 13, 00-159 Warszawa Tel: +48 609129896 email: psoldack@ii.pw.edu.pl

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal

Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal Mohd Helmy Ab Wahab 1, Azizul Azhar Ramli 2, Nureize Arbaiy 3, Zurinah Suradi 4 1 Faculty of Electrical

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

Structure of Association Rule Classifiers: a Review

Structure of Association Rule Classifiers: a Review Structure of Association Rule Classifiers: a Review Koen Vanhoof Benoît Depaire Transportation Research Institute (IMOB), University Hasselt 3590 Diepenbeek, Belgium koen.vanhoof@uhasselt.be benoit.depaire@uhasselt.be

More information

Chapter 4 Data Mining A Short Introduction

Chapter 4 Data Mining A Short Introduction Chapter 4 Data Mining A Short Introduction Data Mining - 1 1 Today's Question 1. Data Mining Overview 2. Association Rule Mining 3. Clustering 4. Classification Data Mining - 2 2 1. Data Mining Overview

More information

Full file at

Full file at Chapter 2 Data Warehousing True-False Questions 1. A real-time, enterprise-level data warehouse combined with a strategy for its use in decision support can leverage data to provide massive financial benefits

More information

Requirements Engineering for Enterprise Systems

Requirements Engineering for Enterprise Systems Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2001 Proceedings Americas Conference on Information Systems (AMCIS) December 2001 Requirements Engineering for Enterprise Systems

More information

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)? Introduction to Data Warehousing and Business Intelligence Overview Why Business Intelligence? Data analysis problems Data Warehouse (DW) introduction A tour of the coming DW lectures DW Applications Loosely

More information

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis.

R07. FirstRanker. 7. a) What is text mining? Describe about basic measures for text retrieval. b) Briefly describe document cluster analysis. www..com www..com Set No.1 1. a) What is data mining? Briefly explain the Knowledge discovery process. b) Explain the three-tier data warehouse architecture. 2. a) With an example, describe any two schema

More information

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems Management Information Systems Management Information Systems B10. Data Management: Warehousing, Analyzing, Mining, and Visualization Code: 166137-01+02 Course: Management Information Systems Period: Spring

More information

Dynamic Data in terms of Data Mining Streams

Dynamic Data in terms of Data Mining Streams International Journal of Computer Science and Software Engineering Volume 1, Number 1 (2015), pp. 25-31 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

Overview of Web Mining Techniques and its Application towards Web

Overview of Web Mining Techniques and its Application towards Web Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous

More information

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei

Data Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei Data Mining Chapter 1: Introduction Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei 1 Any Question? Just Ask 3 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional

More information

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 20 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(20), 2014 [12526-12531] Exploration on the data mining system construction

More information

Introduction to Data Mining and Data Analytics

Introduction to Data Mining and Data Analytics 1/28/2016 MIST.7060 Data Analytics 1 Introduction to Data Mining and Data Analytics What Are Data Mining and Data Analytics? Data mining is the process of discovering hidden patterns in data, where Patterns

More information

Development of Efficient & Optimized Algorithm for Knowledge Discovery in Spatial Database Systems

Development of Efficient & Optimized Algorithm for Knowledge Discovery in Spatial Database Systems Development of Efficient & Optimized Algorithm for Knowledge Discovery in Spatial Database Systems Kapil AGGARWAL, India Key words: KDD, SDBS, neighborhood graph, neighborhood path, neighborhood index

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SHRI ANGALAMMAN COLLEGE OF ENGINEERING & TECHNOLOGY (An ISO 9001:2008 Certified Institution) SIRUGANOOR,TRICHY-621105. DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year / Semester: IV/VII CS1011-DATA

More information

A Framework for Personal Web Usage Mining

A Framework for Personal Web Usage Mining A Framework for Personal Web Usage Mining Yongjian Fu Ming-Yi Shih Department of Computer Science Department of Computer Science University of Missouri-Rolla University of Missouri-Rolla Rolla, MO 65409-0350

More information

Data Mining and Warehousing

Data Mining and Warehousing Data Mining and Warehousing Sangeetha K V I st MCA Adhiyamaan College of Engineering, Hosur-635109. E-mail:veerasangee1989@gmail.com Rajeshwari P I st MCA Adhiyamaan College of Engineering, Hosur-635109.

More information

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA. Data Mining Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA January 13, 2011 Important Note! This presentation was obtained from Dr. Vijay Raghavan

More information

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm Expert Systems: Final (Research Paper) Project Daniel Josiah-Akintonde December

More information

Mining Association Rules in Temporal Document Collections

Mining Association Rules in Temporal Document Collections Mining Association Rules in Temporal Document Collections Kjetil Nørvåg, Trond Øivind Eriksen, and Kjell-Inge Skogstad Dept. of Computer and Information Science, NTNU 7491 Trondheim, Norway Abstract. In

More information

Web Usage Mining: How to Efficiently Manage New Transactions and New Clients

Web Usage Mining: How to Efficiently Manage New Transactions and New Clients Web Usage Mining: How to Efficiently Manage New Transactions and New Clients F. Masseglia 1,2, P. Poncelet 2, and M. Teisseire 2 1 Laboratoire PRiSM, Univ. de Versailles, 45 Avenue des Etats-Unis, 78035

More information

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing. About the Tutorial A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. This

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Introduction to Trajectory Clustering. By YONGLI ZHANG

Introduction to Trajectory Clustering. By YONGLI ZHANG Introduction to Trajectory Clustering By YONGLI ZHANG Outline 1. Problem Definition 2. Clustering Methods for Trajectory data 3. Model-based Trajectory Clustering 4. Applications 5. Conclusions 1 Problem

More information

9. Conclusions. 9.1 Definition KDD

9. Conclusions. 9.1 Definition KDD 9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]

More information

Information mining and information retrieval : methods and applications

Information mining and information retrieval : methods and applications Information mining and information retrieval : methods and applications J. Mothe, C. Chrisment Institut de Recherche en Informatique de Toulouse Université Paul Sabatier, 118 Route de Narbonne, 31062 Toulouse

More information

Research on Data Mining Technology Based on Business Intelligence. Yang WANG

Research on Data Mining Technology Based on Business Intelligence. Yang WANG 2018 International Conference on Mechanical, Electronic and Information Technology (ICMEIT 2018) ISBN: 978-1-60595-548-3 Research on Data Mining Technology Based on Business Intelligence Yang WANG Communication

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 05(b) : 23/10/2012 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts Chapter 28 Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data

More information

This proposed research is inspired by the work of Mr Jagdish Sadhave 2009, who used

This proposed research is inspired by the work of Mr Jagdish Sadhave 2009, who used Literature Review This proposed research is inspired by the work of Mr Jagdish Sadhave 2009, who used the technology of Data Mining and Knowledge Discovery in Databases to build Examination Data Warehouse

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2011 Han, Kamber & Pei. All rights

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 4, April 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Discovering Knowledge

More information

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16

International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 The Survey Of Data Mining And Warehousing Architha.S, A.Kishore Kumar Department of Computer Engineering Department of computer engineering city engineering college VTU Bangalore, India ABSTRACT: Data

More information

Code No: R Set No. 1

Code No: R Set No. 1 Code No: R05321204 Set No. 1 1. (a) Draw and explain the architecture for on-line analytical mining. (b) Briefly discuss the data warehouse applications. [8+8] 2. Briefly discuss the role of data cube

More information

MetaData for Database Mining

MetaData for Database Mining MetaData for Database Mining John Cleary, Geoffrey Holmes, Sally Jo Cunningham, and Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand. Abstract: At present, a machine

More information

Towards Rule Learning Approaches to Instance-based Ontology Matching

Towards Rule Learning Approaches to Instance-based Ontology Matching Towards Rule Learning Approaches to Instance-based Ontology Matching Frederik Janssen 1, Faraz Fallahi 2 Jan Noessner 3, and Heiko Paulheim 1 1 Knowledge Engineering Group, TU Darmstadt, Hochschulstrasse

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Combined Intra-Inter transaction based approach for mining Association among the Sectors in Indian Stock Market

Combined Intra-Inter transaction based approach for mining Association among the Sectors in Indian Stock Market Ranjeetsingh BParihar et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol 3 (3), 01,3895-3899 Combined Intra-Inter transaction based approach for mining Association

More information

The University of Iowa Intelligent Systems Laboratory The University of Iowa Intelligent Systems Laboratory

The University of Iowa Intelligent Systems Laboratory The University of Iowa Intelligent Systems Laboratory Warehousing Outline Andrew Kusiak 2139 Seamans Center Iowa City, IA 52242-1527 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak Tel. 319-335 5934 Introduction warehousing concepts Relationship

More information

Database and Knowledge-Base Systems: Data Mining. Martin Ester

Database and Knowledge-Base Systems: Data Mining. Martin Ester Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications Daniel Mican, Nicolae Tomai Babes-Bolyai University, Dept. of Business Information Systems, Str. Theodor

More information

Multidimensional Process Mining with PMCube Explorer

Multidimensional Process Mining with PMCube Explorer Multidimensional Process Mining with PMCube Explorer Thomas Vogelgesang and H.-Jürgen Appelrath Department of Computer Science University of Oldenburg, Germany thomas.vogelgesang@uni-oldenburg.de Abstract.

More information

Discovering Periodic Patterns in System Logs

Discovering Periodic Patterns in System Logs Discovering Periodic Patterns in System Logs Marcin Zimniak 1, Janusz R. Getta 2, and Wolfgang Benn 1 1 Faculty of Computer Science, TU Chemnitz, Germany {marcin.zimniak,benn}@cs.tu-chemnitz.de 2 School

More information

Knowledge Discovery in Data Bases

Knowledge Discovery in Data Bases Knowledge Discovery in Data Bases Chien-Chung Chan Department of CS University of Akron Akron, OH 44325-4003 2/24/99 1 Why KDD? We are drowning in information, but starving for knowledge John Naisbett

More information

Clustering Algorithms In Data Mining

Clustering Algorithms In Data Mining 2017 5th International Conference on Computer, Automation and Power Electronics (CAPE 2017) Clustering Algorithms In Data Mining Xiaosong Chen 1, a 1 Deparment of Computer Science, University of Vermont,

More information

Enhancing Preprocessing in Data-Intensive Domains using Online-Analytical Processing

Enhancing Preprocessing in Data-Intensive Domains using Online-Analytical Processing Enhancing Preprocessing in Data-Intensive Domains using Online-Analytical Processing Alexander Maedche 1, Andreas Hotho 1, and Markus Wiese 2 1 Institute AIFB, Karlsruhe University, D-76128 Karlsruhe,

More information

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective B.Manivannan Research Scholar, Dept. Computer Science, Dravidian University, Kuppam, Andhra Pradesh, India

More information

Data Mining Course Overview

Data Mining Course Overview Data Mining Course Overview 1 Data Mining Overview Understanding Data Classification: Decision Trees and Bayesian classifiers, ANN, SVM Association Rules Mining: APriori, FP-growth Clustering: Hierarchical

More information

Data Warehousing and OLAP Technologies for Decision-Making Process

Data Warehousing and OLAP Technologies for Decision-Making Process Data Warehousing and OLAP Technologies for Decision-Making Process Hiren H Darji Asst. Prof in Anand Institute of Information Science,Anand Abstract Data warehousing and on-line analytical processing (OLAP)

More information

1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar

1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar 1 DATAWAREHOUSING QUESTIONS by Mausami Sawarkar 1) What does the term 'Ad-hoc Analysis' mean? Choice 1 Business analysts use a subset of the data for analysis. Choice 2: Business analysts access the Data

More information

Knowledge Modelling and Management. Part B (9)

Knowledge Modelling and Management. Part B (9) Knowledge Modelling and Management Part B (9) Yun-Heh Chen-Burger http://www.aiai.ed.ac.uk/~jessicac/project/kmm 1 A Brief Introduction to Business Intelligence 2 What is Business Intelligence? Business

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 01 Databases, Data warehouse Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University

Data Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University Data Mining Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce

More information

Data warehousing in telecom Industry

Data warehousing in telecom Industry Data warehousing in telecom Industry Dr. Sanjay Srivastava, Kaushal Srivastava, Avinash Pandey, Akhil Sharma Abstract: Data Warehouse is termed as the storage for the large heterogeneous data collected

More information

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV Subject Name: Elective I Data Warehousing & Data Mining (DWDM) Subject Code: 2640005 Learning Objectives: To understand

More information

Data Warehouse and Data Mining

Data Warehouse and Data Mining Data Warehouse and Data Mining Lecture No. 02 Lifecycle of Data warehouse Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda Agenda Oracle9i Warehouse Review Dulcian, Inc. Oracle9i Server OLAP Server Analytical SQL Mining ETL Infrastructure 9i Warehouse Builder Oracle 9i Server Overview E-Business Intelligence Platform 9i Server:

More information

DATA WAREHOUSING AND MINING UNIT-V TWO MARK QUESTIONS WITH ANSWERS

DATA WAREHOUSING AND MINING UNIT-V TWO MARK QUESTIONS WITH ANSWERS DATA WAREHOUSING AND MINING UNIT-V TWO MARK QUESTIONS WITH ANSWERS 1. NAME SOME SPECIFIC APPLICATION ORIENTED DATABASES. Spatial databases, Time-series databases, Text databases and multimedia databases.

More information

C-NBC: Neighborhood-Based Clustering with Constraints

C-NBC: Neighborhood-Based Clustering with Constraints C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is

More information

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA

Knowledge Discovery. Javier Béjar URL - Spring 2019 CS - MIA Knowledge Discovery Javier Béjar URL - Spring 2019 CS - MIA Knowledge Discovery (KDD) Knowledge Discovery in Databases (KDD) Practical application of the methodologies from machine learning/statistics

More information

Tribhuvan University Institute of Science and Technology MODEL QUESTION

Tribhuvan University Institute of Science and Technology MODEL QUESTION MODEL QUESTION 1. Suppose that a data warehouse for Big University consists of four dimensions: student, course, semester, and instructor, and two measures count and avg-grade. When at the lowest conceptual

More information

After completing this course, participants will be able to:

After completing this course, participants will be able to: Designing a Business Intelligence Solution by Using Microsoft SQL Server 2008 T h i s f i v e - d a y i n s t r u c t o r - l e d c o u r s e p r o v i d e s i n - d e p t h k n o w l e d g e o n d e s

More information

TDWI Data Modeling. Data Analysis and Design for BI and Data Warehousing Systems

TDWI Data Modeling. Data Analysis and Design for BI and Data Warehousing Systems Data Analysis and Design for BI and Data Warehousing Systems Previews of TDWI course books offer an opportunity to see the quality of our material and help you to select the courses that best fit your

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 1 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights

More information

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University

CS423: Data Mining. Introduction. Jakramate Bootkrajang. Department of Computer Science Chiang Mai University CS423: Data Mining Introduction Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS423: Data Mining 1 / 29 Quote of the day Never memorize something that

More information

Handout 12 Data Warehousing and Analytics.

Handout 12 Data Warehousing and Analytics. Handout 12 CS-605 Spring 17 Page 1 of 6 Handout 12 Data Warehousing and Analytics. Operational (aka transactional) system a system that is used to run a business in real time, based on current data; also

More information

Data Mining Concepts

Data Mining Concepts Data Mining Concepts Outline Data Mining Data Warehousing Knowledge Discovery in Databases (KDD) Goals of Data Mining and Knowledge Discovery Association Rules Additional Data Mining Algorithms Sequential

More information

DATA MINING TRANSACTION

DATA MINING TRANSACTION DATA MINING Data Mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. It is

More information

Materialized Data Mining Views *

Materialized Data Mining Views * Materialized Data Mining Views * Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland tel. +48 61

More information

The Data Mining usage in Production System Management

The Data Mining usage in Production System Management The Data Mining usage in Production System Management Pavel Vazan, Pavol Tanuska, Michal Kebisek Abstract The paper gives the pilot results of the project that is oriented on the use of data mining techniques

More information

Data warehouse and Data Mining

Data warehouse and Data Mining Data warehouse and Data Mining Lecture No. 14 Data Mining and its techniques Naeem A. Mahoto Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

The Procedure Proposal of Manufacturing Systems Management by Using of Gained Knowledge from Production Data

The Procedure Proposal of Manufacturing Systems Management by Using of Gained Knowledge from Production Data The Procedure Proposal of Manufacturing Systems Management by Using of Gained Knowledge from Production Data Pavol Tanuska Member IAENG, Pavel Vazan, Michal Kebisek, Milan Strbo Abstract The paper gives

More information

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad - 500 043 INFORMATION TECHNOLOGY DEFINITIONS AND TERMINOLOGY Course Name : DATA WAREHOUSING AND DATA MINING Course Code : AIT006 Program

More information