INTEGRATING DATA CUBE COMPUTATION AND EMERGING PATTERN MINING FOR MULTIDIMENSIONAL DATA ANALYSIS

Size: px
Start display at page:

Download "INTEGRATING DATA CUBE COMPUTATION AND EMERGING PATTERN MINING FOR MULTIDIMENSIONAL DATA ANALYSIS"

Transcription

1 INTEGRATING DATA CUBE COMPUTATION AND EMERGING PATTERN MINING FOR MULTIDIMENSIONAL DATA ANALYSIS by Wei Lu a Report submitted in partial fulfillment of the requirements for the SFU-ZU dual degree of Bachelor of Science in the School of Computing Science Simon Fraser University and the College of Computer Science and Technology Zhejiang University c Wei Lu 2010 SIMON FRASER UNIVERSITY AND ZHEJIANG UNIVERSITY April 2010 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author.

2 APPROVAL Name: Degree: Title of Report: Wei Lu Bachelor of Science Integrating Data Cube Computation and Emerging Pattern Mining for Multidimensional Data Analysis Examining Committee: Dr. Jian Pei Associate Professor, Computing Science Simon Fraser University Supervisor Dr. Qianping Gu Professor, Computing Science Simon Fraser University Supervisor Dr. Ramesh Krishnamurti Professor, Computing Science Simon Fraser University SFU Examiner Date Approved: ii

3 Abstract Online analytical processing (OLAP) in multidimensional text databases has recently become an effective tool for analyzing text-rich data such as web documents. In this capstone project, we follow the trend of using OLAP and the data cube to analyze web documents, but want to address a new problem from the data mining perspective. In particular, we wish to find contrast patterns in documents of different classes and then use those patterns in OLAP style text data and web document analysis. To this end, we propose to integrate the data cube with an important kind of contrast pattern called the emerging pattern, to build a new data model for solving the document analysis problem. Specifically, this novel data model is implemented on top of the traditional data cube by seamlessly integrating the bottom-up cubing (BUC) algorithm with two different emerging pattern mining algorithms, the Border-Differential and the DPMiner. The processes of cube construction and emerging pattern mining are merged together and carried out simultaneously; patterns are stored into the cube as cell measures. Moreover, we study and compare the performance of those two integrations by conducting experiments on datasets derived from the Frequent Itemset Mining Implementations Repository (FIMI). Finally, we suggest improvements and optimizations that can be done in future work. iii

4 iv To my family

5 For those who believe, no proof is necessary; for those who don t believe, no proof is possible. Stuart Chase, Writer and Economist, 1888 v

6 Acknowledgments First of all, I would like to express my deepest appreciation to Dr. Jian Pei, for his support and guidance during my studies at Simon Fraser University. In various courses I took with him and particularly this capstone project, Dr. Pei showed me his broad knowledge and deep insights in the area of data management and mining, as well as his great personality and patience to a research beginner like me. In his precious time, he provided me with lots of help and advice for the project and other concerns (especially my graduate school applications). This work would not be possible without his supervision. I would love to thank Dr. Qianping Gu and Dr. Ramesh Krishnamurti for reviewing my report and directing the capstone projects for this amazing dual degree program. My gratitude also goes to Dr. Ze-Nian Li, Dr. Stella Atkins, Dr. Greg Mori and Dr. Ted Kirkpatrick for their wonderful classes I took at SFU and their good advice for my studies and career development. Also thanks to Dr. Guozhu Dong at Wright State University and Dr. Guimei Liu at National University of Singapore for making useful resources available for my work. I would also like to thank Mr. Thusjanthan Kubendranathan at SFU for his time and help in our discussions about this project. Deepest gratefulness to my family and friends who make my life enjoyable. In particular, I am greatly indebted to my beloved parents, for their unconditional support and encouragement. Their love accompany me wherever I go. This work is dedicated to them and I hope they are proud of me, as I am always proud of them. vi

7 Contents Approval Abstract Dedication Quotation Acknowledgments Contents List of Tables List of Figures ii iii iv v vi vii x xi 1 Introduction Overview of Text Data Mining Related Work on Multidimensional Text Data Analysis Contrast Pattern Based Document Analysis Structure of the Report Literature Review Data Cubes and Online Analytical Processing An Example of The Data Cube Data Cubing Algorithms BUC: Bottom-Up Computation for Data Cubing vii

8 2.2 Frequent Pattern Mining Emerging Pattern Mining The Border-Differential Algorithm The DPMiner Algorithm Summary Motivation Motivation for Mining Contrast Patterns Motivation for Utilizing The Data Cube Summary Our Methodology Problem Formulation Normalizing Data Schema for Text Databases Problem Modeling with Normalized Data Schema Processing Framework Integrating Data Integrating Algorithms Implementations BUC with DPMiner BUC with Border-Differential and PADS Summary Experimental Results and Performance Study The Test Dataset Comparative Performance Study and Analysis Evaluating the BUC Implementation Comparing Border-Differential with DPMiner Summary Conclusions Summary of The Project Limitations and Future Work Bibliography 33 viii

9 Index 36 ix

10 List of Tables 2.1 A base table storing sales data [15] Aggregates computed by group-by Branch The full data cube based on Table An sample transaction database [21] A multidimensional text database concerning Olympic news A normalized dataset derived from the Olympic news database A normalized dataset reproduced from Table Sizes of synthetic datasets for experiments The complete experimental results x

11 List of Figures 2.1 BUC Algorithm [5, 27] A example of FP-tree based on Table 2.4 [21] Running time and cube size of our BUC implementation Comparing running time of the two integration algorithms xi

12 Chapter 1 Introduction 1.1 Overview of Text Data Mining Analysis of documents in text databases and on the World Wide Web has been attracting researchers from various areas, such as data mining, machine learning, information retrieval, database systems, and natural language processing. In general, studies in different areas have different emphases. Traditional information retrieval techniques (e.g., the inverted index and vector-space model) prove to be efficient and effective in searching relevant documents to answer unstructured keyword-based queries. Machine learning approaches are also widely used in text mining, providing with effective solutions to various problems. For example, the Naive Bayes model and the Support Vector Machines (SVMs) are used in document classification; K-means and the Expectation- Maximization (EM) algorithms are used in document clustering. The textbook by Manning et al. [19] covers topics summarized above and much more in both traditional information retrieval and machine learning based document analysis. On the other hand, data warehousing and data mining also play important roles in analyzing documents, especially those stored in a special kind of databases called multidimensional text databases (ones with both relational dimensions and text fields). While information retrieval mainly addresses searching for documents and for information within documents according to users information needs, the goal of text mining differs in the following sense: it focuses on finding and extracting useful patterns and hidden knowledge from the information in documents and/or text databases, so as to improve the decision making process based on the text information. 1

13 CHAPTER 1. INTRODUCTION 2 Currently, many real-life business, administration and scientific databases are multidimensional text databases, containing both structured attributes and unstructured text attributes. An example of these databases can be found in Table 3.1. Since data warehousing and online analytical processing (OLAP) have proven their great usefulness in managing and mining multidimensional data of varied granularities [11], they have recently become important tools in analyzing such text databases [6, 17, 24, 26]. 1.2 Related Work on Multidimensional Text Data Analysis A data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making [13]. Online analytical processing (OLAP), which is dominated by stylized queries that involve group-by and aggregate operators [27], is a powerful tool in data warehousing. Being a multidimensional data model with various features, the data cube [10] has become an essential OLAP facility in data warehousing. Conceptually, the data cube is an extended database with aggregates in multiple levels and multiple dimensions [15]. It generalizes the group-by operator, by precomputing and storing group-bys with regard to all possible combinations of dimensions. Data cubes are widely used in data warehousing for analyzing multidimensional data. Applying OLAP techniques, especially data cubes, to analyze documents in multidimensional text databases has made significant advances. Important information retrieval measures, i.e., term frequencies and inverted indexes, have been integrated into the traditional data cube, leading to the text cube [17]. It explores both dimension hierarchy and term hierarchy in the text data, and is able to answer OLAP queries by navigating to a specific cell via roll-up and drill-down operations. More recently, the work in [6] proposes a query answering technique called TopCells to address the top-k query answering in the text cube. Given a keyword query, TopCells is able to find the top-k ranked cells containing aggregated documents that are most relevant to the query. Another OLAP-based model dealing with multidimensional text data is the topic cube [26]. Topic cube combines OLAP with probabilistic topic modeling. It explores topic hierarchy of documents and stores probability-based measures learned through a probabilistic model. Moreover, text cubes and topic cubes have been applied to information network analysis. They are combined into an information-network-enhanced text cube called inextcube

14 CHAPTER 1. INTRODUCTION 3 [24]. Most previous works emphasize data warehousing more than data mining. They mainly deal with problems such as how to explore and establish dimensional hierarchies within the text data, and how to efficiently answer OLAP queries using cubes built on text data. 1.3 Contrast Pattern Based Document Analysis We follow the trend of using data cubes to analyze documents in multidimensional text databases. But as the previous works are more data warehousing oriented, we intend to address a more data mining oriented problem called contrast pattern based document analysis. More specifically, we wish to find contrast patterns in documents of different classes and then use those patterns in OLAP style document analysis (like the work in [6, 17]). This application is promising and has real-life demands. For example, from a large collection of documents containing information and reviews of laptop computers of various brands, a user interested in comparing Dell and Sony laptops might wish to find text information describing Dell s special features that do not characterize Sony. These features contrast the two brands effectively, and would probably make the user s decision to select Dell easier. To achieve this goal, we propose to integrate frequent pattern mining, especially the emerging pattern mining, and data cubing in an efficient and effective way. Frequent pattern mining [2] aims to find itemsets, that is, sets of items that frequently occur in a dataset. Furthermore, for patterns that can contrast different classes of data, intuitively they must be frequent patterns in one class, but are comparatively infrequent in other classes. There is one important class of contrast patterns, called the emerging pattern [7], defined as itemsets whose supports increase significantly from dataset D 1 to dataset D 2. That said, those patterns are frequent in D 2 but infrequent in D 1. Because of the sharp change of their supports among different datasets, such patterns meet our needs of showing contrasts in different classes of web documents. Our Contributions To tackle the contrast pattern based document analysis problem, we propose a novel data model by integrating efficient emerging pattern algorithms (e.g., the Border-Differential [7]

15 CHAPTER 1. INTRODUCTION 4 and the state-of-the-art, DPMiner [16]) with the traditional data cube. This integrated model is novel, but also preserves features of traditional data cubes: 1. It is based on the data cube, and is constructed through a classical data cubing algorithm called BUC (the Bottom-Up Computation for data cubing) [5]. 2. It contains multidimensional text data and multiple granularity aggregates of such data, in order to support fast OLAP operations (such as roll-up and drill-down) and query answering. 3. Each cell in the cube contains a set of aggregated documents in the multidimensional text database with matched dimension attributes. 4. The measure of each cell is the emerging patterns whose support rises rapidly from the documents not aggregated in the cell to the documents aggregated in the cell. In this capstone project, we implement this integrated data model by incorporating emerging pattern mining seamlessly into the data cubing process. We choose BUC as our cubing algorithm to build the cube on structured dimensions. While aggregating documents and materializing cells, we simultaneously mine emerging patterns in documents aggregated in each particular cell, and store such patterns as the measure of this cell. Two widely used emerging pattern mining algorithms, the Border-Differential and the DPMiner are integrated with BUC cubing so as to compare their performance. We tested these two different integrations on synthetic datasets to evaluate their performance on different sizes of input data. The datasets are derived based on the Frequent Itemset Mining Implementations Repository (FIMI) [9]. Experimental results show that the state-of-the-art emerging pattern mining algorithm, the DPMiner, is a better choice over the Border-Differential. Our cube-based model shares similarity with the text cube [17] and the topic cube [26] at the level of data structure, since all three cubes are built based on multidimensional text data. The similarity of cube-based structure allows OLAP query answering techniques developed in [6, 17, 24, 26] to be directly applied to our cube. In that sense, point queries (seeking a cell), sub-cube queries (seeking an entire group-by) and top-k queries (seeking k most relevant cells) can be answered in contrast pattern based document analysis using our model.

16 CHAPTER 1. INTRODUCTION 5 Major Differences with Existing Works This cube-based data model with emerging patterns as cell measures differs from all previous related work. It is unlike traditional data cubes using simple aggregate functions as cell measures, which are only adequate for relational databases. Also, our approach differs from the text cube which uses term frequencies and inverted indexes as cell measures, and the topic cube which uses probabilistic measures. Most importantly, to the best of our knowledge, our data model is novel in comparison to previous emerging pattern applications in OLAP. Specifically, a previous work in [20] used the Border-Differential algorithm to perform cube comparisons and capture trend changes between two precomputed data cubes. However, that work is of limited use and cannot be applied to multidimensional text data analysis. First, their approach worked on datasets different in kind from ours. The previous method only works on traditional data cubes built upon relational databases with categorical dimension attributes, while ours is designed for multidimensional text databases. Second, their approach is to find cells with supports growing significantly from one cube to another, but ours is able to determine emerging patterns for every single cell in the cube. Last but not least, their approach performs the Border-Differential algorithm after two data cubes were completely built, but our approach introduces a seamless integration: the data cubing and emerging pattern mining are carried out simultaneously. 1.4 Structure of the Report The rest of this capstone project report is organized as follows: Chapter 2 conducts a literature review on previous work and background knowledge that lays the foundation for this project. Chapter 3 motivates the contrast pattern based document analysis problem. Chapter 4 describes our methodology to tackle the problem. This chapter formulates the problem and proposes algorithms for constructing the integrated data model. Chapter 5 reports experimental results and studies the performance of our algorithm. Lastly, Chapter 6 concludes this capstone project and suggests improvements and optimizations that can be done in future work.

17 Chapter 2 Literature Review This chapter reviews three categories of previous research that are related to this capstone project: data cubes and OLAP, frequent pattern mining, and emerging pattern mining. In Section 2.1 we talk about fundamentals of data warehousing, online analytical processing (OLAP), and data cubing. We highlight BUC [5], a bottom-up approach for data cubing. Section 2.2 introduces frequent pattern mining and an important mining algorithm called FP-Growth [12]. Section 2.3 reviews emerging pattern mining algorithms (Border- Differential [7] and DPMiner [16]) that are particularly useful to our work. 2.1 Data Cubes and Online Analytical Processing A data warehouse is a subject oriented, integrated, time-varying, non-volatile collection of data in support of management s decision-making process [13]. A powerful tool of exploiting data warehouses is the so-called online analytical processing (OLAP). Typically, OLAP systems are dominated by stylized queries involving many group-by and aggregation operations [27]. The data cube was introduced in [10] to facilitate answering OLAP queries on multidimensional data stored in data warehouses. A data cube can be viewed as an extended multi-level and multidimensional database with various multiple granularity aggregates [15]. The term data cubing refers to the process of constructing a data cube based on a relational database table, which is often referred to as the base table. In a cubing process, cells with non-empty aggregates will be materialized. Given a base table, we precompute group-bys and the corresponding aggregate values with respect to all possible combinations 6

18 CHAPTER 2. LITERATURE REVIEW 7 of dimensions in this table. Each group-by corresponds to a set of cells. The aggregate value for that group-by is stored as the measure of that cell. Cell measures provide with a good and concise summary of information aggregated in the cube. In light of the above, the data cube is a powerful data model allowing fast retrieval and analysis of multidimensional data for decision making processes based on data warehouses. It generalizes the group-by operator in SQL (Structured Query Language), and enable data analysts to avoid long and complicated SQL queries when searching for unusual data patterns in multidimensional databases [10] An Example of The Data Cube Example (Data Cube): Table 2.1 is a sample base table in a marketing management data warehouse [15]. It shows data organized under the schema (Branch, Product, Season, Sales). Branch Product Season Sales B 1 P 1 spring 6 B 1 P 2 spring 12 B 2 P 1 fall 9 Table 2.1: A base table storing sales data [15]. To build a data cube upon this table, group-bys are computed on three dimensions Branch, Product and Season. Aggregate values of Sales will be cell measures. In this example, we choose Average(Sales) as the aggregate function for this example. Since most intermediate steps of a data cubing process are basically computing group-bys and aggregate values to form cells, we illustrate the two cells computed by group-by Branch in Table 2.2. Cell No. Branch Product Season AVG(Sales) 1 B B 2 9 Table 2.2: Aggregates computed by group-by Branch. In the same manner, the full data cube contains all possible group-bys on Branch, Product and Season. It is shown in Table 2.3. Note that cells 1, 2 and 3 are derived from the least aggregated group-by: group-by Branch, Product, Season. Such cells are

19 CHAPTER 2. LITERATURE REVIEW 8 called base cells. On the other hand, cell 18 (,, ) is the apex cuboid aggregating all tuples in the base table. Cell No. Branch Product Season AVG(Sales) 1 B 1 P 1 spring 6 2 B 1 P 2 spring 12 3 B 2 P 1 fall 9 4 B 1 P B 1 P B 1 spring 9 7 B 2 P B 2 fall 9 9 P 1 spring 6 10 P 1 fall 9 11 P 2 spring spring 9 13 fall 9 14 B B P P Table 2.3: The full data cube based on Table Data Cubing Algorithms Efficient and scalable data cubing is challenging. When a base table has a large number of dimensions and each dimension has high cardinality, time and space complexity grows exponentially. In general, there are three approaches of cubing in terms of the order to materialize cells: top-down, bottom-up and a mix of both. A top-down approach (e.g., the Multiway Array Aggregation [28]) constructs the cube from the least aggregated base cells towards the most aggregated apex cuboid. On the contrary, a bottom-up approach such as BUC [5] computes cells in the opposite order. Other methods, such as Star-Cubing [23], combines the top-down and bottom-up mechanisms together to carry out the cubing process. On fast computation of multidimensional aggregates, [11] summarizes the following optimization principles: (1). Sorting or hashing dimension attributes to cluster related tuples

20 CHAPTER 2. LITERATURE REVIEW 9 that are likely to be aggregated together in certain group-bys. (2). Computing higher-level aggregates from previously computed lower-level aggregates, and caching intermediate results in memory to reduce expensive I/O operations. (3). Computing a group-by from the smallest previously-computed group-by. (4). Mapping dimension attributes in various kinds of formats to integers ranging between zero and the cardinality of the dimension. There are also many other heuristics being proposed to improve the efficiency of data cubing [1, 5, 11] BUC: Bottom-Up Computation for Data Cubing BUC [5] constructs the data cube bottom-up, from the most aggregated apex cuboid to group-bys on a single dimension, then on a pair of dimensions, and so on. It also uses many optimization techniques introduced in the previous section. Figure 2.1 illustrates the processing tree and the partition method used in BUC on a 4-dimensional base table. Subfigure (b) shows the recursive nature of BUC: after sorting and partitioning data on dimension A, we deal with the partition (a 1,,, ) first and recursively partition it on dimension B to proceed to its parent cell (a 1, b 1,, ) and then the ancestor (a 1, b 1, c 1, ) and so on. After dealing with partition a 1, BUC continues on to process partitions a 2, a 3 and a 4 in the same manner until all cells are materialized. Figure 2.1: BUC Algorithm [5, 27]. The depth-first search process for building our integrated data model (covered in Chapter

21 CHAPTER 2. LITERATURE REVIEW 10 4) follows the basic framework of BUC. 2.2 Frequent Pattern Mining Frequent patterns are patterns (sets of items, sequence, etc.) that occur frequently in a database [2]. The supports of frequent patterns must exceed a pre-defined minimal support threshold. Frequent pattern mining has been studied extensively in the past two decades. It lays the foundation for many data mining tasks such as association rules [3] and emerging pattern mining. Although its definition is concise, the mining algorithms are not trivial. Two notable algorithms are Apriori [3] and FP-Growth [12]. FP-Growth is more important to our work as efficient emerging pattern mining algorithms such as [4, 16] use the FP-tree proposed in FP-Growth as data structures. FP-Growth addressed the limitations of the breadth-first-search-based Apriori such as multiple database scans, large amounts of candidate generations and support counting. It is a depth-first search algorithm. The first scan of a database finds all frequent items, ranks them in frequency-descending order, and puts them into a head table. Then it compresses the database into a prefix tree called FP-tree. A complete set of frequent patterns can be mined by recursively constructing projected databases and the FP-trees based on them. For example, given a transaction database in Table 2.4 [21], we can build a FP-tree accordingly (shown in Figure 2.2). TID Items (Ordered) Frequent Items 100 f, a, c, d, g, i, m, p f, a, c, m, p 200 a, b, c, f, l, m, o f, c, a, b, m 300 b, f, h, j, o f, b 400 b, c, k, s, p c, b, p 500 a, f, c, e, l, p, m, n f, c, a, m, p Table 2.4: An sample transaction database [21]. Next, we define three special types of frequent patterns: the maximal frequent patterns (max-patterns for short), the closed frequent patterns and frequent generators, as they are closely related to emerging pattern mining. Definition (Max-Pattern): An itemset X is a maximal frequent pattern, or maxpattern, in dataset D if X is frequent in D, and for every proper super-itemset Y such that

22 CHAPTER 2. LITERATURE REVIEW 11 Figure 2.2: A example of FP-tree based on Table 2.4 [21]. X Y, Y is infrequent in D [11]. Definition (Closed Pattern and Generator): An itemset X is closed in dataset D if there exists no proper super-itemset Y s.t. X Y and support(x) = support(y ) in D. X is a closed frequent pattern in D if it is both closed and frequent in D [11]. An itemset Z is a generator in D if there exists no proper sub-itemset Z such that Z Z and support(z ) = support(z) [18]. The state-of-the-art max-pattern mining algorithm is called the Pattern-Aware Dynamic Search (PADS) [25]. The DPMiner, the state-of-the-art emerging pattern mining algorithm, is also the most powerful algorithm for mining closed frequent patterns and frequent generators. 2.3 Emerging Pattern Mining Emerging patterns [7] are patterns whose supports increase significantly from one class of data to another. Mathematical details can be found Section 4.1 (Problem Formulation) of this report and [4, 7, 8, 16]. The original work on emerging pattern in [7] gives an algorithm called the Border-Differential for mining such patterns. It uses borders to succinctly represent patterns and mines the patterns by manipulating the borders only. The work in [4]

23 CHAPTER 2. LITERATURE REVIEW 12 used the FP-tree introduced in [12] for emerging pattern mining. Following that, the work in [16] improves the FP-tree-based algorithm by simultaneously generating closed frequent patterns and frequent generators to form emerging patterns. This algorithm is called the DPMiner and is considered as the state-of-the-art for emerging pattern mining The Border-Differential Algorithm Border-Differential uses borders to represent patterns. It involves mining max-patterns and manipulating borders initiated by the patterns to derive the border representation of emerging patterns. A border is an ordered pair L, R, where L and R are the left and right bounds of the border respectively. Both L and R are collections of itemsets, but are much smaller than the original patterns in size. Emerging patterns represented by L, R are the intervals of L, R, defined as [L, R] = {Y X L, Z R, s.t. X Y Z}. For example, suppose [L, R] = {{1}, {1, 2}, {1, 3}, {1, 2, 3}, {2, 3}, {2, 3, 4}}, it has border L = {{1}, {2, 3}}, R = {{1, 2, 3}, {2, 3, 4}}. Itemsets other than those in L and R (e.g., {1, 3}) are intervals of L, R. Given a pair of borders {φ}, R 1 and {φ}, R 2 whose left bounds are initially empty, the differential border L 1, R 1 is derived to satisfy [L 1, R 1 ] = [{φ}, R 1 ] [{φ}, R 2 ]. This operation is the so-called Border-Differential. Furthermore, given two datasets D 1 and D 2, to determine emerging patterns using the Border-Differential operation, first we determine the max-patterns U 1 of D 1 and U 2 of D 2 using PADS, and initiate two borders {φ}, U 1 and {φ}, U 2. Then, we make the differential between those two borders. Let U 1 = {X 1, X 2,..., X n } and U 2 = {Y 1, Y 2,..., Y m } where X i and Y j are itemsets, the left bound of the differential border is computed by L 1 = ni (PowerSet(X i ) m j (PowerSet(Y j ))). The right bound U 1 remains the same. Lastly, form a border L 1, U 1, and the set intervals [L 1, U 1 ] of L 1, U 1 are emerging patterns in D 1. As the size of datasets grow, the Border-Differential would become problematic because it involves set enumerations, resulting in exponential computational costs. The work in [8], a more recent version of [7], proposed several optimization techniques to improve the efficiency of Border-Differential. However, in fact, the complexity of finding emerging patterns is MAX SNP-hard, which means that polynomial time approximation schemes do not exist unless P = NP [22].

24 CHAPTER 2. LITERATURE REVIEW The DPMiner Algorithm The work in [4] used the FP-tree and patten-growth methods to mine emerging patterns, but it still needs to call Border-Differential to find emerging patterns. The DPMiner (stands for Discriminative Pattern Miner) in [16] also uses FP-tree but mines emerging patterns in a different way. It finds closed frequent patterns and frequent generators simultaneously to form equivalent classes of such patterns, and then determine emerging patterns as nonredundant δ-discriminative equivalent classes [16]. An equivalent class EC is a set of itemsets that always occur together in some transactions of dataset D [16]. It can be uniquely represented by its set of frequent generators G and closed frequent patterns C, in the form of EC = [G, C]. Suppose D can be divided into various classes, denoted as D = D 1 D 2... D n. Let δ be a small integer (usually 1 or 2) and θ be a minimal support threshold. An equivalent class EC is a δ-discriminative equivalent class, provided that its closed pattern C s support is greater than θ in D 1 but smaller than δ in D D 1 = D 2... D n. Furthermore, EC is a non-redundant δ-discriminative equivalent class if and only if (1) it is δ-discriminative, (2) there exists no ÊC such that Ĉ C, where Ĉ and C are the closed patterns of ÊC and EC respectively. The closed frequent patterns of a non-redundant δ-discriminative equivalent class are emerging patterns in D 1. Data Structures and Computational Steps of The DPMiner The high efficiency of the DPMiner is mainly attributed to its revised FP-tree structure. Unlike traditional FP-trees, it does not store items appearing in every transaction and hence have a full support in D. These items are removed because they cannot form generators. Such modification results in a much smaller FP-tree compared to the original. The computational framework of the DPMiner consists of the following five steps: (1). Given k classes of data D 1, D 2,..., D k as input, obtain a union of them to get D = D 1 D 2... D k. Also specify a minimal support threshold θ and a maximal threshold δ (thus, patterns with supports above θ in D i but below δ in D D i are candidate emerging patterns in D i ). (2). Construct a FP-tree based on D and run a depth-first search on the tree to find frequent generators and closed patterns simultaneously. For each search path along the tree, the search terminates whenever a δ-discriminative equivalent class is reached.

25 CHAPTER 2. LITERATURE REVIEW 14 (3). Determine the class label distribution for every closed pattern, i.e., find in which class a closed pattern has the highest support. This step is necessary because patterns are not mined separately for each D i (1 i k), but rather on the entire D. (4). Pair up generators and closed frequent patterns to form δ-discriminative equivalent classes. (5). Output the non-redundant δ-discriminative equivalent classes as emerging patterns. If a pattern is labeled as i (1 i k), then it is an emerging pattern in D i. 2.4 Summary In this chapter, we discussed previous research addressing data cubing, frequent pattern mining and emerging pattern mining, all of which are essential for our project. Algorithms (the Bottom-Up Cubing, the Border-Differential and the DPMiner) closely related to our work have been described in detail.

26 Chapter 3 Motivation In this chapter, we motivate the problem of contrast pattern based document analysis. We explain why contrast patterns (in particular, the emerging patterns) are useful, and why data cubes should be used in analyzing documents in multidimensional text databases. 3.1 Motivation for Mining Contrast Patterns This section answers the following two questions: (1) Why we need to mine and use contrast patterns to analyze web documents? (2) How useful are those patterns? In other words, can they make a significant contribution to a good text mining application? We answer these questions by introducing motivating scenarios in real life. Example (Contrast Patterns in Documents) Since the Calgary 1988 Olympic Winter Games, Canada has not been a host country for the Olympic Games for 22 years. Therefore, people may want to know what are the most attractive and discriminative features of the Vancouver 2010 Winter Olympics, compared to all previous Olympic Games. Indeed, there are exciting and touching stories in almost all Olympics and Vancouver certainly has its unique moments. For example, the Canadian figure skater Joannie Rochette won a bronze medal under the keenly felt pain of losing her mother a day before her event started. Suppose a user searches the web and Google returns her a collection of documents on Olympics, consisting of many online sports news and commentaries. There may be too much information for her to read through and find unique stories about Vancouver Although there is no doubt that Joannie Rochette s accomplishment will occur frequently in articles related to Vancouver 2010, a user who is previously unaware about Rochette may 15

27 CHAPTER 3. MOTIVATION 16 not be able to learn about her quickly from the search results. Similar situations may also happen when users compare products online by searching and reading reviews by previous buyers. Here is an example we have seen in Section 1.3: Suppose a user is comparing Dell s laptop computers with Sony s. She probably wants to know the special features of Dell which are not owned by Sony s. For example, many reviewers would speak in favor of Dell by commenting high performance-price ratio but would not do that for Sony as it is not the case. Then high performance-price ratio is a pattern contrasting Dell laptops with Sony laptops. To let the users manually determine such contrast patterns is not feasible. Therefore, given a collection of documents, which are ideally pre-classified and stored into a multidimensional text database, we need to develop efficient data models and corresponding algorithms to determine contrast patterns in documents of different classes. As mentioned in Section 1.3, we choose the emerging pattern [7] since it is a representative class of contrast patterns widely used in data mining. Also, there are good algorithms [4, 7, 16] available for efficient mining of such patterns. Moreover, emerging patterns can make a contribution to some other problems in text mining. A novel document classifier could be constructed based on those patterns as they are claimed useful in building accurate classifiers [8]. Also, since emerging patterns are able to capture discriminative features of a class of data, they may be helpful in extracting keywords to summarize the given text. 3.2 Motivation for Utilizing The Data Cube In many real-life database applications, documents and the text data within them are stored in multidimensional text databases [24]. These kinds of databases are distinct from traditional data sources we deal with, including relational databases, transaction databases, and text corpora. Formally, a multidimensional text database is defined as a relational database with text fields. A sample text database is shown in Table 3.1. The first three dimensions (Event, Time, and Publisher) are standard dimensions, just like those in relational databases. The last column contains text dimensions which are documents with text terms. Text databases provide structured attributes of documents, and the information needs of users vary where such needs can be modeled hierarchically. This makes OLAP and data cubes applicable. For instance (using Table 3.1), if a user wants to read news on the ice hockey games reported by the Vancouver Sun on February 20, 2010, then two documents d 1

28 CHAPTER 3. MOTIVATION 17 Event Time Publisher... Text Data: Documents Ice hockey 2010/2/20 Vancouver Sun... d 1 = {t 1, t 2, t 3, t 4 } Ice hockey 2010/2/23 Global and Mail... d 2 = {t 2, t 3, t 7, t 8 } Ice hockey 2010/2/20 Vancouver Sun... d 3 = {t 1, t 2, t 3, t 6 } Figure skating 2010/2/20 Global and Mail... d 4 = {t 2, t 4, t 6, t 7 } Figure skating 2010/2/20 Vancouver Sun... d 5 = {t 1, t 3, t 5, t 7 } Curling 2010/2/23 New York Times... d 6 = {t 2, t 5, t 7, t 9 } Curling 2010/2/28 Global and Mail... d 7 = {t 3, t 6, t 8, t 9 } Table 3.1: A multidimensional text database concerning Olympic news. and d 3 matching the query {Event = Ice hockey, Time = 2010/2/20, Publisher = Vancouver Sun} will be returned to her. If another user wants to skim all Olympic news reported by the Vancouver Sun on that day, we shall roll up to query {Event =, Time = 2010/2/20, Publisher = Vancouver Sun} and return documents d 1, d 3 and d 5 to her. The opposite operation of roll-up is called drill-down. In fact, roll-up and drill-down are two OLAP operations of great importance [11]. Therefore, to meet different levels of information needs, it is natural for us to apply the data cube to model and extend this text database. This is exactly what the previous work in [17, 24, 26] did. 3.3 Summary In light of the above, this chapter shows that contrast patterns are useful in analyzing large scale text data and they are able to give concise information about the data. Also, the nature of multidimensional text databases makes OLAP and the most essential OLAP tool, the data cube, particularly suitable for modeling and analyzing text data in documents.

29 Chapter 4 Our Methodology In this chapter, we describe our methodology to tackle the contrast pattern based document analysis by building a novel integrated data model through BUC data cubing [5] and two emerging pattern mining algorithms, the Border-Differential [7] and the DPMiner [16]. Section 4.1 formulates the problem we try to address in this work. Section 4.2 describes the processing framework and our algorithms, from both data integration level and algorithm integration level. Section 4.3 discusses issues related to implementation. 4.1 Problem Formulation Normalizing Data Schema for Text Databases Suppose a collection of web documents are stored in a multidimensional text database. The text data in documents are collected under the schema containing a set of standard non-text dimensions {SD 1, SD 2,..., SD n }, and a set of text dimensions (terms) {TD 1, TD 2,..., TD m }, where m is the number of distinct text terms in this collection. For simplicity, text terms can be mapped to items, so documents can be mapped to transactions, or itemsets (sets of items that appear together). This mapping is similar to the bag-of-words model, which represents text data as an unordered collection of words, disregarding word order and count. In that sense, a multidimensional text database can be mapped to a relational base table with a transaction database. Under the above mapping mechanism, each tuple in a text database corresponds to a certain document, in the form of S, T, where S is the set of standard dimension attributes 18

30 CHAPTER 4. OUR METHODOLOGY 19 and T is a transaction. The dimension attributes can be learned through a certain classifier or labeled artificially. Words in the document are tokenized and each distinct token will be treated as an item in the transaction. For example, the tuple corresponding to the first row in Table 3.1 is Ice hockey, 2010/2/20, Vancouver Sun,..., d 1 = {t 1, t 2, t 3, t 4 }, with d 1 = {t 1, t 2, t 3, t 4 } being the transaction. Furthermore, we normalize text database tuples to derive a simplified data schema. We map standard dimensions to letters, e.g, Event to A, Time to B and Publisher to C, to make them unified. Likewise, dimension attributes are mapped to items in the same manner: Ice hockey is mapped to a 1, Figure skating is mapped to a 2 and so on. Table 4.1 shows a normalized dataset derived from the Olympic news database (Table 3.1). A B C... Transactions a 1 b 1 c 1... d 1 = {t 1, t 2, t 3, t 4 } a 1 b 2 c 2... d 2 = {t 2, t 3, t 7, t 8 } a 1 b 1 c 1... d 3 = {t 1, t 2, t 3, t 6 } a 2 b 1 c 2... d 4 = {t 2, t 4, t 6, t 7 } a 2 b 1 c 1... d 5 = {t 1, t 3, t 5, t 7 } a 3 b 2 c 3... d 6 = {t 2, t 5, t 7, t 9 } a 3 b 3 c 2... d 7 = {t 3, t 6, t 8, t 9 } Table 4.1: A normalized dataset derived from the Olympic news database Problem Modeling with Normalized Data Schema Given a normalized dataset as a base table, we build our integrated cube-based data model by computing a full data cube grouped by all standard dimensions (e.g., {A, B, C} in the above table). In the data cubing process, every subset of {A, B, C} will be gone through to form a group-by corresponding to a set of cells. Emerging patterns in each cell will be mined simultaneously and stored as cell measures. When materializing each cell, we aggregate tuples whose dimension attributes match this particular cell. The transactions of matched tuples form the target class (or positive class), denoted as T C. We also virtually aggregate all unmatched tuples and extract their transactions to form the background class (or negative class), denoted as BC. The membership in T C and BC varies from cell to cell; both classes are dynamically computed and formed for each cell.

31 CHAPTER 4. OUR METHODOLOGY 20 A transaction T is a full itemset in a tuple. A pattern X is a sub-itemset of T having a non-zero support (i.e., the number of times X appears) in the given dataset. Let θ be the minimal support threshold for T C and δ be the maximal support threshold for BC. Pattern X is an emerging pattern in T C if and only if support(x, T C) θ and support(x, BC) δ. In other words, the support of X grows significantly from BC to T C, exceeding a minimal growth rate threshold ρ = θ/δ. Mathematically, growth rate(x) = support(x, T C) / support(x, BC) ρ. Note that δ can be 0, hence ρ = θ/δ =. If growth rate(x) =, X is a jumping emerging pattern [7] which does not appear in BC at all. Given predefined support thresholds θ and δ, for each cell in this cube-based model, we mine all patterns whose support is above θ in the target class T C and below δ in its background class BC. Thus, such patterns automatically exceed the minimal growth rate threshold ρ, and become a measure of this cell. Upon obtaining all cells and corresponding emerging patterns, the model building process is complete. The entire process is based on data cubing and also requires a seamless integration of cubing and emerging pattern mining. Example: Now let us consider a simple example regarding the base table in Table 4.1. Let θ = 2 and δ = 1. Suppose at a certain stage, we are carrying out the group-by operation on dimension A. We get three cells: (a 1,, ), (a 2,, ) and (a 3,, ). For cell (a 1,, ) aggregating the first three tuples in Table 4.1, T C = {d 1, d 2, d 3 }, BC = {d 4, d 5, d 6, d 7 }. Then consider pattern X = (t 1, t 2, t 3 ). It appears twice in T C (in d 1 and d 3 ) but zero times in BC, so support(x, T C) θ and support(x, BC) < δ. In that sense, X = (t 1, t 2, t 3 ) is an (jumping) emerging pattern in T C and hence is a measure of cell (a 1,, ). 4.2 Processing Framework To recapitulate, Chapter 1 introduced the contrast pattern based document analysis in multidimensional text databases. We follow the idea of using data cubes and OLAP to analyze multidimensional text data, and propose to merge the BUC data cubing process with two different emerging pattern mining algorithms (the Border-Differential and the DPMiner) to build an integrated data model based on the data cube. This model is designed to support the contrast pattern based document analysis. In this section, following the problem formulation in Section 4.1, we propose our algorithm to integrate emerging pattern mining into data cubing. The entire processing framework includes both data integration and algorithm integration.

32 CHAPTER 4. OUR METHODOLOGY Integrating Data To begin with, we reproduce Table 4.1 (with slight revisions) to make the following discussion clear. It shows a standard and ideal format of data that simplifies a multidimensional text database. The data used in our testing will strictly follow this format: each row in a certain dataset D is a tuple in the form of S, T, where S is the set of dimension attributes and T is a transaction. Tuple No. A B C F Transactions 1 a 1 b 1 c 1 f 1 d 1 = {t 1, t 2, t 3, t 4 } 2 a 1 b 2 c 2 f 1 d 2 = {t 2, t 3, t 7, t 8 } 3 a 1 b 1 c 2 f 2 d 3 = {t 1, t 2, t 3, t 6 } 4 a 2 b 1 c 2 f 2 d 4 = {t 2, t 4, t 6, t 7 } 5 a 2 b 1 c 1 f 1 d 5 = {t 1, t 3, t 5, t 7 } 6 a 3 b 2 c 3 f 3 d 6 = {t 2, t 5, t 7, t 9 } 7 a 3 b 3 c 2 f 3 d 7 = {t 3, t 6, t 8, t 9 } 8 a 4 b 2 c 3 f 1 d 8 = {t 6, t 8, t 11, t 12 } Table 4.2: A normalized dataset reproduced from Table 4.1. The integration of data is indispensable because of the nature of the multidimensional text mining problem. In addition, data cubing and emerging patten mining algorithms work with data from heterogeneous sources originally. Data cubing mainly deals with relational base tables in data warehouses, while emerging pattern mining concerns transaction databases (see an example in Table 2.4). Therefore, we should unify heterogeneous data first and then develop algorithms for a seamless integration. Thus, we model the text database and its normalized schema (Table 4.2) by appending transaction database tuples to relational base table tuples. Moreover, for the integrated data, we also apply one of the optimization techniques discussed in Section 2.1.2: mapping all dimension attributes in various kinds of formats to integers between zero and the cardinality of the attribute [11]. For example, in Table 4.2, dimension A has the cardinality A = 3, so in implementation and testing, we map a 1 to 0, a 2 to 1 and a 3 to 2. Similarly, items in transactions are also mapped to integers ranging between one to the total number of items in this dataset. For instance, if all items in a dataset are labeled from t 1 to t 100, we can represent them by integers ranging from 1 to 100. This kind of mapping facilitates sorting and hashing in data cubing. Particularly for BUC, such mapping allows the use of the linear counting sort algorithm to reorder input tuples

33 CHAPTER 4. OUR METHODOLOGY 22 efficiently Integrating Algorithms Our algorithm integrates data cubing and emerging pattern mining seamlessly. It carries out a depth-first search (DFS) to build data cubes and mine emerging patterns as cell measures simultaneously. The algorithm is designed to work on any valid integrated datasets like Table 4.2 (both dimension attributes and transactions should be non-empty for tuples). We outline the algorithm in the following pseudo-code (adapted from [5]). Algorithm Procedure ButtomUpCubeWithDPMiner(data, dim, theta, delta) Inputs: data: the dataset upon which we build our integrated model. dim: number of standard dimensions in input data. theta: the minimal support threshold of candidate emerging patterns in the target class. delta: the maximal support threshold of candidate emerging patterns in the background class. Outputs: cells with their measures (patterns) Method: 1: aggregate(data); 2: if (data.count == 1) then 3: writeancestors(data, dim); 4: return; 5: endif 6: for each dimension d (from 0 to (dim - 1)) do 7: C := cardinality(d); 8: newdata := partition(data, d); // counting sort. 9: for each partition i (from 0 to (C - 1)) do 10: cell := createemptycell(); 11: posdata := newdata.gatherpositivetransactions(); 12: negdata := newdata.gathernegativetransactions(); 13: isduplicate := determinecoverage(posdata, negdata); 14: if (!isduplicate) then

Data Warehousing and Data Mining

Data Warehousing and Data Mining Data Warehousing and Data Mining Lecture 3 Efficient Cube Computation CITS3401 CITS5504 Wei Liu School of Computer Science and Software Engineering Faculty of Engineering, Computing and Mathematics Acknowledgement:

More information

2 CONTENTS

2 CONTENTS Contents 4 Data Cube Computation and Data Generalization 3 4.1 Efficient Methods for Data Cube Computation............................. 3 4.1.1 A Road Map for Materialization of Different Kinds of Cubes.................

More information

OLAP Introduction and Overview

OLAP Introduction and Overview 1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Comparison of FP tree and Apriori Algorithm

Comparison of FP tree and Apriori Algorithm International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.78-82 Comparison of FP tree and Apriori Algorithm Prashasti

More information

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets Jianyong Wang, Jiawei Han, Jian Pei Presentation by: Nasimeh Asgarian Department of Computing Science University of Alberta

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SHRI ANGALAMMAN COLLEGE OF ENGINEERING & TECHNOLOGY (An ISO 9001:2008 Certified Institution) SIRUGANOOR,TRICHY-621105. DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year / Semester: IV/VII CS1011-DATA

More information

Mining Frequent Patterns without Candidate Generation

Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview

More information

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS by Ramin Afshar B.Sc., University of Alberta, Alberta, 2000 THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

The Discovery and Retrieval of Temporal Rules in Interval Sequence Data

The Discovery and Retrieval of Temporal Rules in Interval Sequence Data The Discovery and Retrieval of Temporal Rules in Interval Sequence Data by Edi Winarko, B.Sc., M.Sc. School of Informatics and Engineering, Faculty of Science and Engineering March 19, 2007 A thesis presented

More information

Association Pattern Mining. Lijun Zhang

Association Pattern Mining. Lijun Zhang Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms

More information

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM G.Amlu #1 S.Chandralekha #2 and PraveenKumar *1 # B.Tech, Information Technology, Anand Institute of Higher Technology, Chennai, India

More information

Guide Users along Information Pathways and Surf through the Data

Guide Users along Information Pathways and Surf through the Data Guide Users along Information Pathways and Surf through the Data Stephen Overton, Overton Technologies, LLC, Raleigh, NC ABSTRACT Business information can be consumed many ways using the SAS Enterprise

More information

MINING MULTIDIMENSIONAL DISTINCT PATTERNS

MINING MULTIDIMENSIONAL DISTINCT PATTERNS MINING MULTIDIMENSIONAL DISTINCT PATTERNS by Thusjanthan Kubendranathan B.Sc. (Hons.), University of Toronto, 2005 a Thesis submitted in partial fulfillment of the requirements for the degree of Master

More information

Chapter 5, Data Cube Computation

Chapter 5, Data Cube Computation CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Young-Rae Cho Associate Professor Department of Computer Science Baylor University A Roadmap for Data Cube Computation Full Cube Full

More information

Frequent Itemsets Melange

Frequent Itemsets Melange Frequent Itemsets Melange Sebastien Siva Data Mining Motivation and objectives Finding all frequent itemsets in a dataset using the traditional Apriori approach is too computationally expensive for datasets

More information

Evolution of Database Systems

Evolution of Database Systems Evolution of Database Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies, second

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Association Rules. Berlin Chen References:

Association Rules. Berlin Chen References: Association Rules Berlin Chen 2005 References: 1. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 8 2. Data Mining: Concepts and Techniques, Chapter 6 Association Rules: Basic Concepts A

More information

1. Inroduction to Data Mininig

1. Inroduction to Data Mininig 1. Inroduction to Data Mininig 1.1 Introduction Universe of Data Information Technology has grown in various directions in the recent years. One natural evolutionary path has been the development of the

More information

FP-Growth algorithm in Data Compression frequent patterns

FP-Growth algorithm in Data Compression frequent patterns FP-Growth algorithm in Data Compression frequent patterns Mr. Nagesh V Lecturer, Dept. of CSE Atria Institute of Technology,AIKBS Hebbal, Bangalore,Karnataka Email : nagesh.v@gmail.com Abstract-The transmission

More information

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems Data Warehousing and Data Mining CPS 116 Introduction to Database Systems Announcements (December 1) 2 Homework #4 due today Sample solution available Thursday Course project demo period has begun! Check

More information

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems Data Warehousing & Mining CPS 116 Introduction to Database Systems Data integration 2 Data resides in many distributed, heterogeneous OLTP (On-Line Transaction Processing) sources Sales, inventory, customer,

More information

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph. Trees 1 Introduction Trees are very special kind of (undirected) graphs. Formally speaking, a tree is a connected graph that is acyclic. 1 This definition has some drawbacks: given a graph it is not trivial

More information

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the Chapter 6: What Is Frequent ent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc) that occurs frequently in a data set frequent itemsets and association rule

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Computing Data Cubes Using Massively Parallel Processors

Computing Data Cubes Using Massively Parallel Processors Computing Data Cubes Using Massively Parallel Processors Hongjun Lu Xiaohui Huang Zhixian Li {luhj,huangxia,lizhixia}@iscs.nus.edu.sg Department of Information Systems and Computer Science National University

More information

Efficient Computation of Data Cubes. Network Database Lab

Efficient Computation of Data Cubes. Network Database Lab Efficient Computation of Data Cubes Network Database Lab Outlines Introduction Some CUBE Algorithms ArrayCube PartitionedCube and MemoryCube Bottom-Up Cube (BUC) Conclusions References Network Database

More information

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN

More information

SIDDHARTH GROUP OF INSTITUTIONS :: PUTTUR Siddharth Nagar, Narayanavanam Road QUESTION BANK (DESCRIPTIVE)

SIDDHARTH GROUP OF INSTITUTIONS :: PUTTUR Siddharth Nagar, Narayanavanam Road QUESTION BANK (DESCRIPTIVE) SIDDHARTH GROUP OF INSTITUTIONS :: PUTTUR Siddharth Nagar, Narayanavanam Road 517583 QUESTION BANK (DESCRIPTIVE) Subject with Code : Data Warehousing and Mining (16MC815) Year & Sem: II-MCA & I-Sem Course

More information

IT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS

IT DATA WAREHOUSING AND DATA MINING UNIT-2 BUSINESS ANALYSIS PART A 1. What are production reporting tools? Give examples. (May/June 2013) Production reporting tools will let companies generate regular operational reports or support high-volume batch jobs. Such

More information

Dta Mining and Data Warehousing

Dta Mining and Data Warehousing CSCI6405 Fall 2003 Dta Mining and Data Warehousing Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: q.gao@dal.ca Teaching Assistant: Christopher Jordan, Email: cjordan@cs.dal.ca Office Hours:

More information

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:- UNIT III: Data Warehouse and OLAP Technology: An Overview : What Is a Data Warehouse? A Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation, From Data Warehousing to

More information

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad - 500 043 INFORMATION TECHNOLOGY DEFINITIONS AND TERMINOLOGY Course Name : DATA WAREHOUSING AND DATA MINING Course Code : AIT006 Program

More information

An Algorithm for Mining Large Sequences in Databases

An Algorithm for Mining Large Sequences in Databases 149 An Algorithm for Mining Large Sequences in Databases Bharat Bhasker, Indian Institute of Management, Lucknow, India, bhasker@iiml.ac.in ABSTRACT Frequent sequence mining is a fundamental and essential

More information

Parallelizing Frequent Itemset Mining with FP-Trees

Parallelizing Frequent Itemset Mining with FP-Trees Parallelizing Frequent Itemset Mining with FP-Trees Peiyi Tang Markus P. Turkia Department of Computer Science Department of Computer Science University of Arkansas at Little Rock University of Arkansas

More information

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures) CS614- Data Warehousing Solved MCQ(S) From Midterm Papers (1 TO 22 Lectures) BY Arslan Arshad Nov 21,2016 BS110401050 BS110401050@vu.edu.pk Arslan.arshad01@gmail.com AKMP01 CS614 - Data Warehousing - Midterm

More information

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A Data Warehousing and Decision Support [R&G] Chapter 23, Part A CS 432 1 Introduction Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business

More information

Data Warehousing and Decision Support

Data Warehousing and Decision Support Data Warehousing and Decision Support Chapter 23, Part A Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1 Introduction Increasingly, organizations are analyzing current and historical

More information

Parallel Approach for Implementing Data Mining Algorithms

Parallel Approach for Implementing Data Mining Algorithms TITLE OF THE THESIS Parallel Approach for Implementing Data Mining Algorithms A RESEARCH PROPOSAL SUBMITTED TO THE SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT, FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

More information

Novel Materialized View Selection in a Multidimensional Database

Novel Materialized View Selection in a Multidimensional Database Graphic Era University From the SelectedWorks of vijay singh Winter February 10, 2009 Novel Materialized View Selection in a Multidimensional Database vijay singh Available at: https://works.bepress.com/vijaysingh/5/

More information

Data Mining Part 3. Associations Rules

Data Mining Part 3. Associations Rules Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets

More information

Data Warehousing and Decision Support

Data Warehousing and Decision Support Data Warehousing and Decision Support [R&G] Chapter 23, Part A CS 4320 1 Introduction Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business

More information

FlowCube: Constructing RFID FlowCubes for Multi-Dimensional Analysis of Commodity Flows

FlowCube: Constructing RFID FlowCubes for Multi-Dimensional Analysis of Commodity Flows FlowCube: Constructing RFID FlowCubes for Multi-Dimensional Analysis of Commodity Flows Hector Gonzalez Jiawei Han Xiaolei Li University of Illinois at Urbana-Champaign, IL, USA {hagonzal, hanj, xli10}@uiuc.edu

More information

Basics of Dimensional Modeling

Basics of Dimensional Modeling Basics of Dimensional Modeling Data warehouse and OLAP tools are based on a dimensional data model. A dimensional model is based on dimensions, facts, cubes, and schemas such as star and snowflake. Dimension

More information

Improving the Performance of OLAP Queries Using Families of Statistics Trees

Improving the Performance of OLAP Queries Using Families of Statistics Trees Improving the Performance of OLAP Queries Using Families of Statistics Trees Joachim Hammer Dept. of Computer and Information Science University of Florida Lixin Fu Dept. of Mathematical Sciences University

More information

Correlation Based Feature Selection with Irrelevant Feature Removal

Correlation Based Feature Selection with Irrelevant Feature Removal Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Implementation Techniques

Implementation Techniques V Implementation Techniques 34 Efficient Evaluation of the Valid-Time Natural Join 35 Efficient Differential Timeslice Computation 36 R-Tree Based Indexing of Now-Relative Bitemporal Data 37 Light-Weight

More information

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective B.Manivannan Research Scholar, Dept. Computer Science, Dravidian University, Kuppam, Andhra Pradesh, India

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data Qiankun Zhao Nanyang Technological University, Singapore and Sourav S. Bhowmick Nanyang Technological University,

More information

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and CHAPTER-13 Mining Class Comparisons: Discrimination between DifferentClasses: 13.1 Introduction 13.2 Class Comparison Methods and Implementation 13.3 Presentation of Class Comparison Descriptions 13.4

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value KNOWLEDGENT INSIGHTS volume 1 no. 5 October 7, 2011 Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value Today s growing commercial, operational and regulatory

More information

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube OLAP2 outline Multi Dimensional Data Model Need for Multi Dimensional Analysis OLAP Operators Data Cube Demonstration Using SQL Multi Dimensional Data Model Multi dimensional analysis is a popular approach

More information

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset M.Hamsathvani 1, D.Rajeswari 2 M.E, R.Kalaiselvi 3 1 PG Scholar(M.E), Angel College of Engineering and Technology, Tiruppur,

More information

Question Bank. 4) It is the source of information later delivered to data marts.

Question Bank. 4) It is the source of information later delivered to data marts. Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

DATA WAREHOUING UNIT I

DATA WAREHOUING UNIT I BHARATHIDASAN ENGINEERING COLLEGE NATTRAMAPALLI DEPARTMENT OF COMPUTER SCIENCE SUB CODE & NAME: IT6702/DWDM DEPT: IT Staff Name : N.RAMESH DATA WAREHOUING UNIT I 1. Define data warehouse? NOV/DEC 2009

More information

A Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin

A Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin A Fast Algorithm for Data Mining Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin Our Work Interested in finding closed frequent itemsets in large databases Large

More information

Data Warehousing. Overview

Data Warehousing. Overview Data Warehousing Overview Basic Definitions Normalization Entity Relationship Diagrams (ERDs) Normal Forms Many to Many relationships Warehouse Considerations Dimension Tables Fact Tables Star Schema Snowflake

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 01 Databases, Data warehouse Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

Effectiveness of Freq Pat Mining

Effectiveness of Freq Pat Mining Effectiveness of Freq Pat Mining Too many patterns! A pattern a 1 a 2 a n contains 2 n -1 subpatterns Understanding many patterns is difficult or even impossible for human users Non-focused mining A manager

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

This paper proposes: Mining Frequent Patterns without Candidate Generation

This paper proposes: Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation a paper by Jiawei Han, Jian Pei and Yiwen Yin School of Computing Science Simon Fraser University Presented by Maria Cutumisu Department of Computing

More information

Answering Aggregate Queries Over Large RDF Graphs

Answering Aggregate Queries Over Large RDF Graphs 1 Answering Aggregate Queries Over Large RDF Graphs Lei Zou, Peking University Ruizhe Huang, Peking University Lei Chen, Hong Kong University of Science and Technology M. Tamer Özsu, University of Waterloo

More information

Notes for Lecture 24

Notes for Lecture 24 U.C. Berkeley CS170: Intro to CS Theory Handout N24 Professor Luca Trevisan December 4, 2001 Notes for Lecture 24 1 Some NP-complete Numerical Problems 1.1 Subset Sum The Subset Sum problem is defined

More information

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data. Code No: M0502/R05 Set No. 1 1. (a) Explain data mining as a step in the process of knowledge discovery. (b) Differentiate operational database systems and data warehousing. [8+8] 2. (a) Briefly discuss

More information

Cache-Oblivious Traversals of an Array s Pairs

Cache-Oblivious Traversals of an Array s Pairs Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious

More information

ETL and OLAP Systems

ETL and OLAP Systems ETL and OLAP Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first semester

More information

Lectures for the course: Data Warehousing and Data Mining (IT 60107)

Lectures for the course: Data Warehousing and Data Mining (IT 60107) Lectures for the course: Data Warehousing and Data Mining (IT 60107) Week 1 Lecture 1 21/07/2011 Introduction to the course Pre-requisite Expectations Evaluation Guideline Term Paper and Term Project Guideline

More information

FlowCube: Constructing RFID FlowCubes for Multi-Dimensional Analysis of Commodity Flows

FlowCube: Constructing RFID FlowCubes for Multi-Dimensional Analysis of Commodity Flows FlowCube: Constructing RFID FlowCubes for Multi-Dimensional Analysis of Commodity Flows ABSTRACT Hector Gonzalez Jiawei Han Xiaolei Li University of Illinois at Urbana-Champaign, IL, USA {hagonzal, hanj,

More information

TDWI Data Modeling. Data Analysis and Design for BI and Data Warehousing Systems

TDWI Data Modeling. Data Analysis and Design for BI and Data Warehousing Systems Data Analysis and Design for BI and Data Warehousing Systems Previews of TDWI course books offer an opportunity to see the quality of our material and help you to select the courses that best fit your

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

SCHEME OF COURSE WORK. Data Warehousing and Data mining

SCHEME OF COURSE WORK. Data Warehousing and Data mining SCHEME OF COURSE WORK Course Details: Course Title Course Code Program: Specialization: Semester Prerequisites Department of Information Technology Data Warehousing and Data mining : 15CT1132 : B.TECH

More information

EFFICIENT AND EFFECTIVE AGGREGATE KEYWORD SEARCH ON RELATIONAL DATABASES

EFFICIENT AND EFFECTIVE AGGREGATE KEYWORD SEARCH ON RELATIONAL DATABASES EFFICIENT AND EFFECTIVE AGGREGATE KEYWORD SEARCH ON RELATIONAL DATABASES by Luping Li B.Eng., Renmin University, 2009 a Thesis submitted in partial fulfillment of the requirements for the degree of MASTER

More information

Interestingness Measurements

Interestingness Measurements Interestingness Measurements Objective measures Two popular measurements: support and confidence Subjective measures [Silberschatz & Tuzhilin, KDD95] A rule (pattern) is interesting if it is unexpected

More information

Institutionen för datavetenskap Department of Computer and Information Science

Institutionen för datavetenskap Department of Computer and Information Science Institutionen för datavetenskap Department of Computer and Information Science Final thesis K Shortest Path Implementation by RadhaKrishna Nagubadi LIU-IDA/LITH-EX-A--13/41--SE 213-6-27 Linköpings universitet

More information

Mining Generalised Emerging Patterns

Mining Generalised Emerging Patterns Mining Generalised Emerging Patterns Xiaoyuan Qian, James Bailey, Christopher Leckie Department of Computer Science and Software Engineering University of Melbourne, Australia {jbailey, caleckie}@csse.unimelb.edu.au

More information

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)? Introduction to Data Warehousing and Business Intelligence Overview Why Business Intelligence? Data analysis problems Data Warehouse (DW) introduction A tour of the coming DW lectures DW Applications Loosely

More information

On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling

On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling On the Near-Optimality of List Scheduling Heuristics for Local and Global Instruction Scheduling by John Michael Chase A thesis presented to the University of Waterloo in fulfillment of the thesis requirement

More information

Mining Frequent Itemsets in Time-Varying Data Streams

Mining Frequent Itemsets in Time-Varying Data Streams Mining Frequent Itemsets in Time-Varying Data Streams Abstract A transactional data stream is an unbounded sequence of transactions continuously generated, usually at a high rate. Mining frequent itemsets

More information

Constructing a Multidimensional Topic-Concept Cube for OLAP on Large-Scale Search Logs

Constructing a Multidimensional Topic-Concept Cube for OLAP on Large-Scale Search Logs Constructing a Multidimensional Topic-Concept Cube for OLAP on Large-Scale Search Logs Zhen Liao Daxin Jiang Jian Pei Dongyeop Kang Xiaohui Sun Ho-Jin Choi Hang Li Microsoft Research Asia Simon Fraser

More information

Data warehouse architecture consists of the following interconnected layers:

Data warehouse architecture consists of the following interconnected layers: Architecture, in the Data warehousing world, is the concept and design of the data base and technologies that are used to load the data. A good architecture will enable scalability, high performance and

More information

Principles of Algorithm Design

Principles of Algorithm Design Principles of Algorithm Design When you are trying to design an algorithm or a data structure, it s often hard to see how to accomplish the task. The following techniques can often be useful: 1. Experiment

More information

A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining

A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining D.Kavinya 1 Student, Department of CSE, K.S.Rangasamy College of Technology, Tiruchengode, Tamil Nadu, India 1

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Building Fuzzy Blocks from Data Cubes

Building Fuzzy Blocks from Data Cubes Building Fuzzy Blocks from Data Cubes Yeow Wei Choong HELP University College Kuala Lumpur MALAYSIA choongyw@help.edu.my Anne Laurent Dominique Laurent LIRMM ETIS Université Montpellier II Université de

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

REPORTING AND QUERY TOOLS AND APPLICATIONS

REPORTING AND QUERY TOOLS AND APPLICATIONS Tool Categories: REPORTING AND QUERY TOOLS AND APPLICATIONS There are five categories of decision support tools Reporting Managed query Executive information system OLAP Data Mining Reporting Tools Production

More information

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem Faloutsos & Pavlo 15415/615 Carnegie Mellon Univ. Dept. of Computer Science 15415/615 DB Applications Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26) Data mining detailed outline Problem

More information

Materialized Data Mining Views *

Materialized Data Mining Views * Materialized Data Mining Views * Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland tel. +48 61

More information