Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Similar documents
Summary: A Tutorial on Learning With Bayesian Networks

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr.

Mixture models and frequent sets: combining global and local methods for 0 1 data

Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos

Inference Complexity As Learning Bias. The Goal Outline. Don t use model complexity as your learning bias

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

Mixture Models and EM

CSI5387: Data Mining Project

Random projection for non-gaussian mixture models

Note Set 4: Finite Mixture Models and the EM Algorithm

A Novel method for Frequent Pattern Mining

Mining Frequent Patterns with Counting Inference at Multiple Levels

Semi-Supervised Clustering with Partial Background Information

Text mining on a grid environment

Optimization using Ant Colony Algorithm

Introduction to Mobile Robotics

Mixture Models and the EM Algorithm

A Literature Review of Modern Association Rule Mining Techniques

Application of Web Mining with XML Data using XQuery

ASSOCIATION RULE MINING: MARKET BASKET ANALYSIS OF A GROCERY STORE

Learning the Structure of Sum-Product Networks

Learning Arithmetic Circuits

Lecture 2 Wednesday, August 22, 2007

On Multiple Query Optimization in Data Mining

Jamil Ahmad IQRA University, Islamabad Campus, PAKISTAN.

arxiv: v1 [cs.db] 28 Sep 2017

Building Classifiers using Bayesian Networks

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

MINING ASSOCIATION RULE FOR HORIZONTALLY PARTITIONED DATABASES USING CK SECURE SUM TECHNIQUE

Unsupervised naive Bayes for data clustering with mixtures of truncated exponentials

MySQL Data Mining: Extending MySQL to support data mining primitives (demo)

Association Rule Mining. Entscheidungsunterstützungssysteme

Implications of Probabilistic Data Modeling for Mining Association Rules

Mix-nets: Factored Mixtures of Gaussians in Bayesian Networks with Mixed Continuous And Discrete Variables

CS6220: DATA MINING TECHNIQUES

Data Access Paths in Processing of Sets of Frequent Itemset Queries

VOILA: Efficient Feature-value Acquisition for Classification

A Novel Algorithm for Associative Classification

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Association Rule Mining. Introduction 46. Study core 46

Ordering attributes for missing values prediction and data classification

Frequent Item-set Mining without Ubiquitous Items

Machine Learning. Unsupervised Learning. Manfred Huber

Knowledge Discovery and Data Mining

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Chapter 4 Data Mining A Short Introduction

Bayes Net Learning. EECS 474 Fall 2016

Collaborative Filtering using a Spreading Activation Approach

Hierarchical Online Mining for Associative Rules

Accelerometer Gesture Recognition

Medical Data Mining Based on Association Rules

Further Applications of a Particle Visualization Framework

Probabilistic Graphical Models

Probabilistic Learning Classification using Naïve Bayes

ScienceDirect. An Efficient Association Rule Based Clustering of XML Documents

Discovering interesting rules from financial data

Detecting Spam with Artificial Neural Networks

Learning Arithmetic Circuits

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

An Algorithm for Frequent Pattern Mining Based On Apriori

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials *

Performance Analysis of Data Mining Classification Techniques

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Clustering Documents in Large Text Corpora

Algorithm for Efficient Multilevel Association Rule Mining

EFFICIENT ALGORITHM FOR MINING FREQUENT ITEMSETS USING CLUSTERING TECHNIQUES

Mining Negative Rules using GRD

An Improved Apriori Algorithm for Association Rules

Cover Page. The handle holds various files of this Leiden University dissertation.

Unsupervised Learning

Chapter 1, Introduction

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Approximate Discrete Probability Distribution Representation using a Multi-Resolution Binary Tree

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Integration of Candidate Hash Trees in Concurrent Processing of Frequent Itemset Queries Using Apriori

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

3 Virtual attribute subsetting

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

A Deterministic Global Optimization Method for Variational Inference

CS229 Final Project: k-means Algorithm

Market Basket Analysis: an Introduction. José Miguel Hernández Lobato Computational and Biological Learning Laboratory Cambridge University

MINING MASSIVE DATA SETS FROM THE WEB

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

Generative and discriminative classification techniques

Machine Learning

Automatic Labeling of Issues on Github A Machine learning Approach

Probabilistic Graphical Models

Bayesian Estimation for Skew Normal Distributions Using Data Augmentation

Chapter 4: Association analysis:

Contents. Preface to the Second Edition

Generation of Near-Optimal Solutions Using ILP-Guided Sampling

7. Boosting and Bagging Bagging

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Association rule mining

Transcription:

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity for machine learning, data mining, and artificial intelligence in general. Probabilistic approaches have been successfully applied to collaborative filtering, robot navigation, text analysis, medical diagnosis, and many other areas. In general, we wish to efficiently represent the joint probability distribution among a set of variables in order to solve these and other problems. Most of the current approaches do so using a graph with a single node for each modeled variable, hidden or observed. While these models have proven effective in many domains, they also have certain limitations. We focus on Bayesian networks since they are the most popular and one of the most general types of probabilistic graphical models. In Bayesian networks, complete inference may take exponential time in the number of variables modeled. Approximation algorithms such as Gibbs sampling can be effective, but this is only a partial solution: iterative sampling is a very slow process as well. While this is fine for problems with limited numbers of variables, special structure, or no time constraints, inference time is an issue for other domains such as collaborative filtering, in which a typical problem may have thousands of variables and require a split-second response. Even ignoring time constraints, a graphical model may not be the most effective representation for some problems. As an alternate representation, association rules remain popular for studying market basket data. An association rule specifies the probability that a user purchasing a certain set of items will also purchase another specific item (Agrawal et al., 1993). For example, an association rule might specify that customers who buy milk are likely to buy bread, with 70% probability, or that customers who buy pens and pencils are likely to buy paper, with 60% percent probability. These association rules may be efficiently mined from data using an algorithm such as APriori (Agrawal and Srikant, 1994). The hope is that some association rules will provide valuable insights into customer behavior and aid in the development of business strategies. Of course, association rules alone do not provide a full joint probability distribution over all variables. Association rules are also somewhat limited, in that they are generally limited to conjunctions of true boolean variables. If these limitations could be overcome, we could construct a new type of probabilistic model where each model component is not a variable but an abstraction that applies to a certain portion of the data as an association rule applies to a certain subset of the observations. In addition to providing rapid inference and perhaps better accuracy in certain domains, an abstraction-based probabilistic model could be easier to understand: while traditional graphical methods depend on an understanding of conditional probability, association rules only require an understanding of cooccurrence. 1

2 Modeling with Abstractions Every day, we use abstraction as a tool for dealing with the complexities of the world around us. Based on prior experience, learning, and cultural influence, we label certain observations as trees or houses or conference papers. Each such label is an abstract way of considering a collection of related observations in order to effectively learn, reason, and communicate with each other about the world around us. The underlying observation may actually be the effect of neurons responding to photons at certain wavelengths reflected from organic molecules composing a tree-like structure, but this is rarely a useful way to see things. By calling a tree a tree, we can infer properties based on prior experiences with trees. Data mining algorithms are also faced with many observations, each typically represented by a collection of variables with particular values: {x 1, x 2,..., x n }. In this context, an abstraction would be any predicate on those values. For example, if all x i were boolean variables, possible abstractions could include (x 1 x 2 ), (x 2 x 3 ), (x 1 ), (x 2 x 1 ), or even true, the abstraction that applies to all observations. As before, multiple abstractions may apply to any given observation, and some abstractions are strictly more specific than others. For example, the abstraction (x 1 x 2 ) is strictly more specific than (x 1 ), which in turn is strictly more specific than true. This concept of an abstraction is trivial to extend to multi-valued or continuous data as well. Note that, by this definition, an abstraction does not specify anything else a priori about the values of the observations to which it applies. We can model any joint probability distribution as a mixture of abstractions given three sets of information: the abstractions to use (A), the probability that each abstraction will generate the next observation (P (A i )), and a probability distribution for each variable given each abstraction (P (x j A i )). The overall probability for any observation in our generative model is as follows: P (o) = A i A Π j=1...n P (x j = o j A i )P (A i ) In the discrete case, we can perfectly represent any probability distribution by simply having one abstraction per possible observation and letting the probability of that abstraction be the frequency with which that observation occurs. Since this would require an exponential number of abstractions, we hope that the structure inherent in most problems will allow them to be effectively modeled with a relatively small number of abstractions. Note that Bayesian networks face a similar worst-case scenario: if all variables depend directly on all others, then the network will have an exponential number of edges and require conditional probability tables of an exponential size. 3 Learning with Abstractions We have argued that abstractions can be effectively applied to the problem of modeling a joint probability distribution. In this section, we present the details of our algorithm to learn a Probabilistic Abstraction Lattice (PAL) from data based on these ideas. For this first exploration and set of experiments, we have focused solely on positive, conjunctive abstractions of boolean variables. That is, all abstractions are of the form (x i x j x k...). This subclass of abstractions is both easier to deal with and easier to find, and such abstractions have already proven their usefulness as the basis for association rule mining. Given a positive, conjunctive abstraction A i, P (x j A i ) = 1 if x j appears in A i s conjunctive definition. Similarly, P ( x j A i ) = 0 if x j appears in A i s conjunct. But what if A i does not 2

reference x j at all? 3.1 Model One The simplest approach is to assume a uniform distribution over x j. In the boolean case, P (x j A i ) = P ( x j A i ) = 0.5. We can generalize this to assume any constant distribution for all unspecified variables. Again, in the boolean case, P (x j A i ) = q and P ( x j A i ) = 1 q. We refer to this simple distribution as model one. 3.2 Model Two Model two allows each variable x j to have a different default distribution. That is, when A i does not reference x j, P (x j A i ) = P (x j ). This model introduces the additional flexibility that some variables may have very different distributions, though this additional flexibility could also permit greater overfitting. 3.3 Model Three In model three, an unspecified variable s distribution depends on the assumed abstraction as well. To accomplish this, each P (x j A i ) is a model parameter that may be learned from data. Again, the greater flexibility allows for more finely tuned abstractions, but also could permit more overfitting. Model three, in addition to being the most general, can actually be seen as a version of the naïve Bayes classifier, where the class node is never observed and corresponds to the index of the abstraction chosen. 3.4 Selecting Abstractions We took advantage of existing software implementations of the APriori algorithm to find frequent item sets, positive conjunctions occurring with at least a certain frequency (Agrawal and Srikant, 1994). Each frequent item set became a candidate abstraction. Since association rules and frequent item sets describe certain aspects of a dataset s structure, we would expect them to be useful abstractions as well. We also included a catch-all abstraction that will apply to all abstractions. This ensures that every observation matches at least one abstraction. 3.5 Learning Abstraction Weights We train all model parameters using the Expectation Maximization (EM) algorithm (Dempster et al., 1977). The EM algorithm guarantees that we will converge to a locally optimal model with respect to the likelihood of the training data. The EM algorithm consists of two steps: an E-step in which the values of all hidden data are estimated using the current model, and an M-step in which new model parameters are chosen to maximize the likelihood of the complete data, both observed and hidden. 3.5.1 E-Step For a Probabilistic Abstraction Lattice, the hidden variables we are trying to estimate in the E-step are the abstractions from which each training example was generated. We can compute a probability distribution over this hidden variable A i by applying Bayes rule and the total probability rule as follows: 3

P (A i o) = P (o A i)p (A i ) P (o) P (o A i )P (A i ) = j P (o A j)p (A j ) = P (A i)π k P (o k A i ) j P (o A j)p (A j ) (1) (2) (3) Though each observation is assumed to be generated by a single abstraction, in this step we fractionally assign each observation to any number of abstractions, based on its probability distribution over abstractions. 3.5.2 M-Step For all models, we will need to recompute all P (A i ). For model one, we must compute q; for model two, we must compute all P (x j ); for model three, we must compute all P (x j A i ). All of these can be easily computed by looking at the fraction of examples assigned to each abstraction. For example, in model three, we can compute the new value of P (x j A i ) as the sum of all fractional observations assigned to A i where x j was true divided by the sum of all fractional observations assigned to A i. Specifically, P (x j A i ) = o O,o j =true P (A i o) o O P (A i o) Models one and two are similar, taking care to only count variable occurrences when unspecified in the assigned abstraction. 4 Inference with Abstractions Given a Probabilistic Abstraction Lattice, we can readily compute any conditional probability in time that is linear in the number of variables and the number of abstractions. Since the conditional probability P (X Y ) is simply the fraction P (X Y )/P (Y ), it is sufficient to compute marginal probabilities P (Z) in linear time. This is done as follows: P (Z) = i P (Z A i )P (A i ) For the case where some variable x j is unspecified in Z, we assume that P (Z j A i ) is 1, since an unspecified variable covers all possible values. If there are n variables and m abstractions in the model, then computing P (Z A i ) is O(n) and so computing P (Z) is O(m n). This is an amazing improvement over Bayesian networks. 5 Empirical Evaluation In order to assess the effectiveness of PALs relative to existing methods, we compared their performance to that of a Bayesian network toolkit on two versions of a web session dataset, MSWeb. We chose to use the WinMine Toolkit from Microsoft Research (Chickering, 2002) because its algorithms have been applied to the MSWeb dataset before before (Breese et al., 1998). 4

Name Variables Density Training Test MSWeb 293 0.01 32,711 5,000 Dense MSWeb 11 0.52 1,782 267 Model MSWeb Dense MSWeb WinMine -9.707-6.046 PAL Model3-9.984-5.923 PAL Model2-10.299-6.167 PAL Model1-11.0845-6.358 Marginal -11.358-6.759 Table 1: Statistics for web session datasets used. Table 2: Log likelihood results for each model on both datasets. The best result for each dataset is shown in bold. Unfortunately, while performing the experiments it became clear that MSWeb s sparsity was weakening the experiments. If we used only a few query variables and only a few conditional variables, then most test examples only listed web sites not visited. Neither model could predict very well, simply because there wasn t very much information to work with. For this reason, we selected the 11 most visited web sites and a subset of the examples, so that the density was artificially increased to about 50%. Statistics about both datasets are summarized in table 1. 5.1 Likelihood Experiments For both the MSWeb and the Dense MSWeb dataset, we computed the average log likelihood over all test examples using all three PAL models, one Bayesian network, and the marginal distribution. We found that while the Bayesian network produced with WinMine did best on MSWeb, the third PAL model did best on the Dense MSWeb data. For both datasets, more flexible PAL models outperformed less flexible ones, suggesting that overfitting is not yet of primary concern. The results are summarized in table 2. To further investigate the relationship between the most effective PAL model and WinMine s Bayesian network, we varied the number of training examples used for each and found that WinMine did even better with fewer training examples, as shown in figure 1. This suggests that a PAL might match WinMine s performance given enough training examples. 5.2 Conditional Likelihood Experiments While Bayesian networks can efficiently compute the likelihood of a complete observation, it is with considerably more difficulty that they compute a conditional probability or likelihood of a partial observation. Since inference requires exponential time, in many cases a lengthy, iterative approximation is the best method available. We compared the performance of Bayesian networks to that of PAL model three for conditional likelihood using the Dense MSWeb dataset. We would have liked to use the original MSWeb dataset as well, but ran into trouble with running Gibbs sampling on the Bayesian network: in order to compute accurate likelihood estimates, we had to sample for 30 seconds per test example. Since we are using 5,000 test examples, a single experiment would take over 40 hours. Due to time constraints, these experiments had to be omitted. Nevertheless, this example serves to show that 5

9.6 9.8 Marginal WinMine PAL3 10 10.2 average log(likelihood) 10.4 10.6 10.8 11 11.2 11.4 11.6 1/32 1/16 1/8 1/4 1/2 1 1/32 1/16 1/8 1/4 1/2 Fraction of training data provided Figure 1: Dependency of third PAL model and WinMine model on number of training examples used, relative to the marginal model. 0.46 0.48 0.5 Conditional log likelihood 0.52 0.54 0.56 0.58 0.6 0.62 1 2 3 4 5 6 7 8 9 10 11 Number of query variables Figure 2: Conditional likelihood performance of third PAL model (solid lines) and WinMine Bayesian networks with 1000 iterations of Gibbs sampling (dashed lines). Each line holds the number of conditioning variables constant at some number 0 through 10. For either model, lines with higher conditional likelihood are those that had more conditioning variables. the running time of Gibbs sampling on a Bayesian network is significant, even problematic. The PAL model was able to the same task in less than a minute. On the Dense MSWeb dataset, we ran one experiment for each possible number of conditioning and query variables. In a first test run, we simply hid portions of the original 267 test examples for the conditioning and query variables. However, with only 267 examples, the sampling variance from randomly choosing which variables to hide resulted in a jagged graph. By duplicating the original examples 100 times before doing the hiding, we were able to decrease the variance and smooth the graph. Our results for Dense MSWeb are shown in figure 2. The PAL model three consistently outperforms the WinMine model, and it does so by an increasingly large margin as the number of query or conditioning variables is increased. 6

6 Conclusion We have presented a novel alternative to traditional graphical probabilistic models, based on a mixture of abstractions rather than a graph of conditional dependencies. In one of the datasets we used, our new approach outperformed one of the leading Bayesian network algorithms. For conditional likelihood estimation, it was faster by several orders of magnitude while achieving greater accuracy. For future work, we plan to continue exploring the application of abstraction-based models to other domains in order to determine where their true strengths and weaknesses lie. One of the next steps will be to learn abstractions inductively rather than relying on a frequent set algorithm, then to experiment with multi-valued and continuous data. References Agrawal, R., Imielinski, T., and Swami, A. N. (1993). Mining association rules between sets of items in large databases. In Buneman, P. and Jajodia, S., editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207 216, Washington, D.C. Agrawal, R. and Srikant, R. (1994). Fast Algorithms for Mining Association Rules. In Bocca, J. B., Jarke, M., and Zaniolo, C., editors, Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487 499. Morgan Kaufmann. Breese, J. S., Heckerman, D., and Kadie, C. (1998). Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the 14th Annual Conference on Uncertainty in Artificial Intelligence, pages 43 52. Chickering, D. M. (2002). The WinMine Toolkit. Technical Report MSR-TR-2002-103, Microsoft, Redmond, WA. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1 38. 7