Value Added Association Rules

Similar documents
2. Discovery of Association Rules

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

Association Pattern Mining. Lijun Zhang

INTELLIGENT SUPERMARKET USING APRIORI

Association Rule Mining. Entscheidungsunterstützungssysteme

Modeling the Real World for Data Mining: Granular Computing Approach

Discovering interesting rules from financial data

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Association Rules. Berlin Chen References:

Lecture notes for April 6, 2005

Chapter 4 Data Mining A Short Introduction

Association Rule with Frequent Pattern Growth. Algorithm for Frequent Item Sets Mining

Jeffrey D. Ullman Stanford University

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

Association Rules Apriori Algorithm

PRODUCT DOCUMENTATION. Association Discovery

Data Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Chapter 15 Introduction to Linear Programming

Association rule mining

APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW

An Algorithm for Frequent Pattern Mining Based On Apriori

620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others

Semantics Oriented Association Rules

Generating Cross level Rules: An automated approach

Business Intelligence. Tutorial for Performing Market Basket Analysis (with ItemCount)

Introduction to Data Mining and Data Analytics

Association Rules Apriori Algorithm

A Generalized Decision Logic Language for Granular Computing

Interestingness Measurements

Mining Distributed Frequent Itemset with Hadoop

On Reduct Construction Algorithms

Association Rule Mining. Introduction 46. Study core 46

BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES

A Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin

Mining Association Rules in Large Databases

Binary Association Rule Mining Using Bayesian Network

Mining Vague Association Rules

Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering Amravati

ON-LINE GENERATION OF ASSOCIATION RULES USING INVERTED FILE INDEXING AND COMPRESSION

Improved Frequent Pattern Mining Algorithm with Indexing

A NEW ASSOCIATION RULE MINING BASED ON FREQUENT ITEM SET

A recommendation engine by using association rules

Market Basket Analysis: an Introduction. José Miguel Hernández Lobato Computational and Biological Learning Laboratory Cambridge University

SOFTWARE ENGINEERING DESIGN I

Market basket analysis

On Generalizing Rough Set Theory

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

Knowledge Engineering in Search Engines

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Materialized Data Mining Views *

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

Temporal Weighted Association Rule Mining for Classification

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Pattern Discovery Using Apriori and Ch-Search Algorithm

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

The notion of functions

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

Association Rule Discovery

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Rank Measures for Ordering

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Mining Imperfectly Sporadic Rules with Two Thresholds

Mining Local Association Rules from Temporal Data Set

Approaches for Mining Frequent Itemsets and Minimal Association Rules

ITEM ARRANGEMENT PATTERN IN WAREHOUSE USING APRIORI ALGORITHM (GIANT KAPASAN CASE STUDY)

Rough Sets, Neighborhood Systems, and Granular Computing

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Leveraging Set Relations in Exact Set Similarity Join

Data Mining Concepts

Efficient SQL-Querying Method for Data Mining in Large Data Bases

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY

Performance Based Study of Association Rule Algorithms On Voter DB

Mining High Average-Utility Itemsets

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

REDUCTION OF LARGE DATABASE AND IDENTIFYING FREQUENT PATTERNS USING ENHANCED HIGH UTILITY MINING. VIT University,Chennai, India.

MA651 Topology. Lecture 4. Topological spaces 2

Mining High Order Decision Rules

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Rare Association Rule Mining for Network Intrusion Detection

Mathematical Foundation of Association Rules - Mining Associations by Solving Integral Linear Inequalities

Warehousing. Data Mining

FP-Growth algorithm in Data Compression frequent patterns

3 No-Wait Job Shops with Variable Processing Times

Association mining rules

Mining Association Rules in OLAP Cubes

Lecture 17: Continuous Functions

The Transpose Technique to Reduce Number of Transactions of Apriori Algorithm

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

Chapter 2 Basic Structure of High-Dimensional Spaces

CSC Discrete Math I, Spring Sets

Chapter 6: Association Rules

Association Rule Mining Using Revolution R for Market Basket Analysis

Association Rule Discovery

A Novel Texture Classification Procedure by using Association Rules

Transcription:

Value Added Association Rules T.Y. Lin San Jose State University drlin@sjsu.edu Glossary Association Rule Mining A Association Rule Mining is an exploratory learning task to discover some hidden, dependency relationship among items, such as state and action, in a database. Relation A relation is a tuple (H,B) with H, the header, and B, the body, a set of tuples that all have the domain H. Such a relation closely corresponds to what is usually called the extension of a predicate in first-order logic except that here we identify the places in the predicate with attribute names. Usually in the relational model a database schema is said to consist of a set of relation names, the headers that are associated with these names and the constraints that should hold for every instance of the database schema. Relational Database A relational database is a finite set of relation schemas (called a database schema) and a corresponding set of relation instances (called a database instance). The relational database model represents data as a twodimensional tables called a relations and consists of three basic components: a set of domains and a set of relations, operations on relations and integrity rules. Data Mining A data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing 1

it into useful information - information that can be used to increase revenue, cuts costs, or both. Summary Association rule mining finds interesting associations and/or correlation relationships among large set of data items. Association rules show attribute value conditions that occur frequently together in a given dataset. A typical and widely-used example of association rule mining is Market Basket Analysis. For example, data are collected using bar-code scanners in supermarkets. Such market basket databases consist of a large number of transaction records. Each record lists all items bought by a customer on a single purchase transaction. Managers would be interested to know if certain groups of items are consistently purchased together. They could use this data for adjusting store layouts (placing items optimally with respect to each other), for cross-selling, for promotions, for catalog design and to identify customer segments based on buying patterns. Association rules provide information of this type in the form of if-then statements. These rules are computed from the data and, unlike the if-then rules of logic, association rules are probabilistic in nature. In addition to the antecedent (the if part) and the consequent (the then part), an association rule has two numbers that express the degree of uncertainty about the rule. In association analysis the antecedent and consequent are sets of items (called itemsets) that are disjoint (do not have any items in common). The first number is called the support for the rule. The support is simply the number of transactions that include all items in the antecedent and consequent parts of the rule. (The support is sometimes expressed as a percentage of the total number of records in the database.) The other number is known as the confidence of the rule. Confidence is the ratio of the number of transactions that include all items in the consequent as well as the antecedent (namely, the support) to the number of transactions that include all items in the antecedent. For example, if a supermarket database has 100,000 point-of-sale transactions, out of which 2,000 include both items A and B and 800 of these include item C, the association rule If A and B are purchased then C is purchased on the same trip has a support of 800 transactions (alternatively 0.8% = 800/100,000) and a confidence of 40% (=800/2,000). One way to 2

think of support is that it is the probability that a randomly selected transaction from the database will contain all items in the antecedent and the consequent, whereas the confidence is the conditional probability that a randomly selected transaction will include all the items in the consequent given that the transaction includes all the items in the antecedent. Lift is one more parameter of interest in the association analysis. Lift is nothing but the ratio of Confidence to Expected Confidence. Expected Confidence in this case means, using the above example, confidence, if buying A and B does not enhance the probability of buying C. It is the number of transactions that include the consequent divided by the total number of transactions. Suppose the number of total number of transactions for C are 5,000. Thus Expected Confidence is 5,000/1,00,000=5%. For our supermarket example the Lift = Confidence/Expected Confidence = 40%/5% = 8. Hence Lift is a value that gives us information about the increase in probability of the then (consequent) given the if (antecedent) part. Abstract. Value added product is an industrial term referring a minor addiction to some major products. In this paper, we borrow a term to denote a minor semantic addition to the well known association rules. We consider the addition of numerical values to the attribute values, such as sale price, profit, degree of fuzziness, level of security and so on. Such additions lead to the notion of random variables (as added values to the attributes) in the data model and hence probabilistic considerations of data mining. 1 Introduction Association rules are mined from transaction databases with the goal of improving sales and services. Two standard measures called support and confidence are used for mining association rules. However, both measures are not directly linked to the use of association rules in the context of marketing. In order to resolve this problem, many proposals have been made by adding market semantics to data model. Using first order logic, one can add semantics either by function and/or relations (functions symbols or predicates). Barber and Hamilton, and Lu et al. considered semantic / constraint that are prescribed by binary relations (=neighborhood systems) and predicates repectively, [7,?,?]. With the introduction of such semantics or constraints, the mined association rules are more suitable for marketing purpose. In this 3

paper, we consider a framework for value added association rules by attaching numerical values to itemsets, representing profits, importance, or benifits of itemsets. Within the proposed framework, we re-examine some fundamental issues and open up doors for probabilistic approach for data mining. 2 Semantics and Relational Data Model Relational database theory assumes that the universe is a classical set, namely, data is discrete and no addition structures are embedded. In practice, additional semantics often exist. For examples, there are, say monetary, values to objects, similarities among events, distance between locations, and so on. To express the additional semantics, we need to extend the expressive power of the relational model. this may be achieved by adopting the first order logic, which uses relations and functions, or predicates and function symbols, to capture such additional semantics information. 2.1 Structure Added by Relations There are many studies on semantics modeling of relationships between objects in a database. Typically, the relationships are expressed in terms of some specific relations or predicates of logic view of databases. Details of such models can be found in [7,6,5,9,11]. 2.2 Value Added by Functions In this paper, we focus on the data model with value added by functions. For an attribute A j, we assume that there exists a non-negative real-valued function, f j : Dom(A j ) R + called value added function, where Dom(A j ) is the domain of the attribute. An attribute can be regarded as a map, A j : U Dom(A j ). By composition of f j and A j, we have: X j = A j f j : U R +, which maps an object to a non-negative real number. For simplicity, we write the inverse image by X j u = (X j ) 1 (X j (u)). It consists of all objects having the same value on A j, and is called a granule, namely the equivalence class containing u. The counting probability P (X j u) = X j u / U gives: Proposition 1. X j is a random variable. 4

A random variable is not a variable varies randomly, it is merely a function whose numerical values are determined by a chance. In other words, the chance (propability) of the function to take its individual value is known. See [2] (page 88) for connections between the mathematical notion and its intution. Definition 1. 1. The system (U, A j, Dom(A j ), j = 1, 2,...n) is called a granular data model. This model allows one to generate automatically all possible attributes (features), including concept hierachy [4]. 2. The system (U, A j, X j, j = 1, 2,...n) is called a value added granular data model VA-GDM. We will work in value added granular data model (U, A j, Dom(A j ), j = 1, 2,...n). An itemset is a sub-tuple in a relation. In terms of GDM, a subtuple corresponds to a finite intersection of elementary granules. By abuse of notations, we will use the same symbol to denote both attribute value and the corresponding elementary granule. So a sub-tuple b = (b 1, b 2,...b q ) could also mean the finite intersection, b = b 1 b 2... b q, of elementary granules. 3 Value added Association Rules The value function f j may be associated with intuitive interpretations such as profits. It seems intuitively natural to compute profit additively, namely, f(a) = i A f(i) for an itemset in association rule mining. In general, the value may not be additive. For example, in security, the level of security of an itemset is often computed by f(a) = Max i A f(i), and integrity by f(a) = Min i A f(i). We will use the semantic neutral term and call f a value function. Definition 2. Large value itemsets (LVA-itemsets), by abuse of language, we may refer to it as value added association rules (not in rule form). Let B be a subset of the attributes A, f a real-valued function that assigns a value to each itemset, and s q be a given threshold value for q-itemset, q = 1,2,... 1. Sum-version: A granule b = (b 1 b 2... b q ), namely, a sub-tuple b = (b 1, b 2,...b q ), is a large value q-va-itemset if Sum(b) s q, where 5

Sum(b) = Σx j 0 p(x j 0) = Σf j (b j ) b / U (1) where x j 0 = f j (b j ). 2. Min-version: A granule b = (b 1 b 2... b q ) is a large value q-vaitemset if Min(b) s q, where Min(b) = Min j.x j 0 p(x j 0) = Min q j=1f j (b j ) b / U. (2) 3. Max-version: A graunle b = (b 1 b 2... b q ) is a large value q-vaitemset if Max(b) s q, where Max(b) = Max j.x j 0 p(x j 0) = Max q i=1f i (b i ) b ). (3) 4. Traditional version: The Max and Min-versions become the traditional one iff the profit function is the constant = 1. 5. Mean version: It captures the mean trends of the data. Two attributes A j1, A j2 is mean associated, if E(X j1 ) E(X j2 )) s q, where E(.) is the expected value,. is the absolute value. An LVA-itemset is an association rule without direction, since we had used only the support. One can easily derive value added and directed association rules from LVA-itemsets. 3.1 Algorithms for Sum-version An immediate thought would be to mimic the classical theory. Unfortunately, apriori may not always be applicable. Note that counting plays a major role in classical association rules. However, in the value added case, the function values are the main concerns. Thresholds are compared against the sum, max, min and average of the function values. Thus, the results are quite different. Consider the case q = 2. Assume s 1 = s 2 and f is not the constant 1. Let b = b 1 b 2 be a 2-large granule. We have, Sum(b 1 ) = f(b 1 ) b 1 / U, Sum(b 2 ) = f(b 2 ) b 2 / U (4) Sum(b) = Sum(b 1 ) + Sum(b 2 ) s 2. (5) 6

In classical case, b b i, i = 1,2; and the apriori exploits this relationship. In the current case, such a relationship is not there; apriori criteria are not useful. Algorithm for finding value added association rules is a brutal exhaustive search: each q is computed independently. 3.2 Algorithm for Max and Min-versions As above, the key question is: Could we conclude any relationship among M(b 1 ), M(b 2 ), and M(b), where M = Max and Min? Nothing for Max, but for Min, we do have: Min(f(b 1 ), f(b 2 )) Min(b i ), i = 1, 2, (6) Hence, we have apriori algorithms for Min-version. 3.3 Experiments This section reports the experimental result of the algorithms for LVAitemsets. There are three basic routines: generating candidates ( potential LVA-itemsets), counting the candidates, and finally, selecting LVA-itemsets that exceed the threshold. Finding the LVA-itemsets is an exhaustive search, the search is conducted from longest to shortest. For each q, we search all q-tuples. The generates raw data set has 8 attributes and 500 tuples. The threshold for selecting the itemset is 5.7. Two potential LVA-itemsets are embedded in data. Each granule is represented by (attribute, value) pair. 1.LVA-itemset (LVA-granules) of length 6 is: ((C175),(C490), (C524), (C661), (C752), (C84)). The frequency is 2, and the sum of their weights is: (0.8 + 0.7 + 0.2 + 0.3 + 0.7 + 0.2 ) * 2 = 5.8. 2. LVA-itemset (LVA-granules) of length 8 is: (C175), (C246), (C323), (C445), (C556), (C679), (C779), (c817). The frequency is 1, and the sum of their weights is: (0.8 + 0.5 + 0.9 + 0.6 + 0.3 + 0.9 + 0.8) * 1 = 5.7. 7

The result of finding LVA-itemsets based on weights is summarized as follows: Length Candidates Generation time Count time LVA-itemsets 1 110 0.01 0.0 88 2 4491 11.20 0.561 416 3 22540 322.173 3.585 342 4 34200 601.505 6.559 42 5 27926 327.671 6.269 14 6 13997 43.562 3.606 1 7 4000 1.112 1.172 0 8 5000 0.020 0.170 6 107764 909 In this table, the first column is the length of itemsets. The second columns is the second column is the number of candidates in the given data. For length 8, it is table length (5000 rows). The 3rd, 4th, and 5th column are the times needed to generate the candidates, to counting the support, to find (check the criteria) the LVA-itemsets. Generating candidates dominates most of the runtime of the algorithm. Sice the dataset is converted to granules, to count the candidates is fast. The runtime is independent of the threshold. The number of candidates to be checked is the same regardless of the threshold value. LVA-itemsets are found at random lengths. There is six case, longer LVA-itemsets are not influenced by shorter ones. The algorithms may be improved; the performance is not our focus here. 4 Probabilistic Data Mining Theory The VA-GDM (U, A j, X j, j = 1, 2,..., n) provides a framework for probabilistic consideration. The model naturally produces a numerical information table (U, A j, X j, j = 1, 2,..., n) so that we can immediately apply techniques in numerical databases. Let Y i = X j i, i = 1, 2,.., m and m n be the reduct [10], that is, smallest functionally independent subset. The collection: V = (Y 1 (u), Y 2 (u),..., Y m (u)) u U, is a finite set of points in Euvlidean space. Since U, and hence V, is finite, a functional dependency can take polynomial form. So the rest of X j are 8

polynomials over Y i. We will regard them as random variables over V. By combining the work of [4] adn [3], we can express all possible numerical attribute (features) by finitely many polynomials over Y i. In other words, e will be able to search association rules in all possible attributes, not restricted to the given attributes, using probability theory. We will report the study in the near future. 5 Conclusions Value added association rules extends standard association rules by taking into consideration semantics of data. Value added granular data model allows us to import probability theory into data mining. In general, there are no apriori criteria for value added cases. However, if we require the thresholds increase with the lenghts, that is, S q q (Max(s 1, s 2,..., s q )), there are apriori criteria: q-large implies all sub-tuples are (q - i) - large, where i 0. This paper reports our preliminary findings, and more results will be presented in the near future. 9