Concurrent Apriori Data Mining Algorithms

Similar documents
Parallel and Distributed Association Rule Mining - Dr. Giuseppe Di Fatta. San Vigilio,

Algorithms for Frequent Pattern Mining of Big Data

Support Vector Machines

Innovation Typology. Collaborative Authoritativeness. Focused Web Mining. Text and Data Mining In Innovation. Generational Models

Performance Study of Parallel Programming on Cloud Computing Environments Using MapReduce

An Optimal Algorithm for Prufer Codes *

Wireless Sensor Networks Fault Identification Using Data Association

Machine Learning. Topic 6: Clustering

A Heuristic for Mining Association Rules In Polynomial Time

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

The Research of Support Vector Machine in Agricultural Data Classification

A Heuristic for Mining Association Rules In Polynomial Time*

Mining Vehicles Frequently Appearing Together from Massive Passing Records

Fuzzy Weighted Association Rule Mining with Weighted Support and Confidence Framework

METHODS FOR BATCH PROCESSING OF DATA MINING QUERIES

TF 2 P-growth: An Efficient Algorithm for Mining Frequent Patterns without any Thresholds

Meta-heuristics for Multidimensional Knapsack Problems

Smoothing Spline ANOVA for variable screening

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Data Mining: Model Evaluation

A Combined Approach for Mining Fuzzy Frequent Itemset

Cluster Analysis of Electrical Behavior

AADL : about scheduling analysis

A METHOD FOR FACTOR SCREENING OF SIMULATION EXPERIMENTS BASED ON ASSOCIATION RULE MINING

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

CS 268: Lecture 8 Router Support for Congestion Control

Needed Information to do Allocation

CMPS 10 Introduction to Computer Science Lecture Notes

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

A Robust Webpage Information Hiding Method Based on the Slash of Tag

ApproxMGMSP: A Scalable Method of Mining Approximate Multidimensional Sequential Patterns on Distributed System

Association Analysis for an Online Education System

Real-time Fault-tolerant Scheduling Algorithm for Distributed Computing Systems

A Framework for Distributed Computation Over a Heterogeneous Beowulf Cluster.

Lecture 5: Multilayer Perceptrons

A User Selection Method in Advertising System

Outline. CHARM: An Efficient Algorithm for Closed Itemset Mining. Introductions. Introductions

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

ASSOCIATION RULE MINING BASED ON IMAGE CONTENT

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Simple Methodology for Database Clustering. Hao Tang 12 Guangdong University of Technology, Guangdong, , China

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Parallel matrix-vector multiplication

Solving Planted Motif Problem on GPU

Solving two-person zero-sum game by Matlab

Effective Page Recommendation Algorithms Based on. Distributed Learning Automata and Weighted Association. Rules

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Association Rule Mining Based on Estimation of Distribution Algorithm for Blood Indices

Private Information Retrieval (PIR)

Polyhedral Compilation Foundations

Efficient Distributed File System (EDFS)

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Association Rule Mining with Parallel Frequent Pattern Growth Algorithm on Hadoop

Multiway pruning for efficient iceberg cubing

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Security Enhanced Dynamic ID based Remote User Authentication Scheme for Multi-Server Environments

Application of Clustering Algorithm in Big Data Sample Set Optimization

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

Learning Non-Linearly Separable Boolean Functions With Linear Threshold Unit Trees and Madaline-Style Networks

A Saturation Binary Neural Network for Crossbar Switching Problem

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

Classification / Regression Support Vector Machines

Optimized Resource Scheduling Using Classification and Regression Tree and Modified Bacterial Foraging Optimization Algorithm

LinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals

Mining Web Logs with PLSA Based Prediction Model to Improve Web Caching Performance

A Binarization Algorithm specialized on Document Images and Photos

Boundary-Based Time Series Sorting

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Simulation Based Analysis of FAST TCP using OMNET++

Discovering Relational Patterns across Multiple Databases

Problem Set 3 Solutions

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

Related-Mode Attacks on CTR Encryption Mode

ABSTRACT. WEIQING, JIN. Fuzzy Classification Based On Fuzzy Association Rule Mining (Under the direction of Dr. Robert E. Young).

Intro. Iterators. 1. Access

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

X- Chart Using ANOM Approach

A Comparative Study for Outlier Detection Techniques in Data Mining

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Unsupervised Learning

Security Vulnerabilities of an Enhanced Remote User Authentication Scheme

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Task Scheduling for Directed Cyclic Graph. Using Matching Technique

A fault tree analysis strategy using binary decision diagrams

Image Feature Selection Based on Ant Colony Optimization

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Recognizing Faces. Outline

Classifier Selection Based on Data Complexity Measures *

A Statistical Model Selection Strategy Applied to Neural Networks

A Notable Swarm Approach to Evolve Neural Network for Classification in Data Mining

A Parallel Gauss-Seidel Algorithm for Sparse Power System. Matrices. D. P. Koester, S. Ranka, and G. C. Fox

Vectorization in the Polyhedral Model

Keywords: classifier, Association rules, data mining, healthcare, Associative Classifiers, CBA, CMAR, CPAR, MCAR

Virtual Machine Migration based on Trust Measurement of Computer Node

Associative Based Classification Algorithm For Diabetes Disease Prediction

Transcription:

Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015

Outlne Why t s mportant Introducton to Assocaton Rule Mnng ( a Data Mnng technque) Overvew of Sequental Apror algorthm The 3 Parallel Apror algorthm mplementatons Future work

What s Data Mnng? Mnng knowledge from data Data mnng [Han, 2001] Process of extractng nterestng (non-trval, mplct, prevously unknown and potentally useful) knowledge or patterns from data n large databases Objectves of data mnng: Dscover knowledge that characterzes general propertes of data Dscover patterns on the prevous and current data n order to make predctons on future data Source: Data Mnng CSE6412

Bg Data Era Term ntroduced by Roger Magoulas n 2010 A massve volume of both structured and unstructured data that s so large t s dffcult to process usng tradtonal database and software technques - Webopeda Multcore machnes allow for effcent concurrent computatons, whch requre proper synchronzaton technques, that can sgnfcantly reduce task completon tmes

Bg Data Era 45 zettabytes (45 x 1000 3 ggabytes) of data produced n 2020

Source: Data Mnng CSE6412 Why Mne Assocaton Rules?

Assocaton Rule Mnng Applcatons Market basket analyss (e.g. Stock market, Shoppng patterns) Medcal dagnoss (e.g. Causal effect relatonshp) Census data (e.g. Populaton Demographcs) Bo-sequences (e.g. DNA, Proten) Web Log (e.g. Fraud detecton, Web page traversal patterns)

Source: Data Mnng CSE6412 What Knd of Databases?

Source: Data Mnng CSE6412 Defnton of Assocaton Rule

Source: Data Mnng CSE6412 Support and Confdence: Example

Source: Data Mnng CSE6412 Mnng Assocaton Rules

Source: Data Mnng CSE6412 How to Mne Assocaton Rules

Canddate Generaton How to Generate Canddates? (.e. How to Generate C k+1 from L k ) Example of Canddate Generaton Source: Data Mnng CSE6412

Apror Algorthm Proposed by Agrawal and Srkant n 1994 Apror Algorthm (Flow Chart) Apror Algorthm Example Source: Data Mnng CSE6412

My Paper Rakesh Agrawal and John C. Shafer. Parallel mnng of assocaton rules: Desgn, mplementaton and experence. Techncal report, IBM, 1996. Rakesh Agrawal and John C Shafer. Parallel mnng of assocaton rules. IEEE Transactons on Knowledge and Data Engneerng, (6):962 969, 1996. Source: Google Scholar Rakesh Agrawal

3 Parallel Apror Algorthms IMPORTANT: Algorthms mplemented on a shared-nothng multprocessor communcatng va a Message Passng Interface (MPI) Count Dstrbuton Each processor calculates ts Canddate Set Counts from ts local Database and end of each pass sends out Canddate Set Counts to all other processors. Data Dstrbuton Each processor s assgned a mutually exclusve partton of the Canddate Set on whch t computes the count and end of pass sends out Canddate Set Tuple to all other processors. Canddate Dstrbuton Both Canddate Set and Database s parttoned durng some pass k, so that each processor can operate ndependently.

Source: My Paper Notatons

Count Dstrbuton Algorthm Pass k = 1: 1. Processor P scans over ts data partton D ; reads one tuple transacton (.e. (TID,X) ) at a tme and buldng ts local C 1 and storng t n a hash-table (new entry s created f necessary). 2. At end of the pass every P loads contents of nto a buffer and sends t out to all other processors. 3. At the same tme each P receves the send buffer from another processor and ncrements the count value of every element n ts local C 1 hash-table f ths element s present n the buffer otherwse a new entry would be created. 4. P wll now have the entre canddate set C 1 wth global support counts for each canddate/element/temset. Step 2 and 3 requre synchronzaton

Count Dstrbuton Algorthm Cont. (Pass K = 1 Example) Processor/Node 1 Itemset Support {a} 15 {b} 5 {c} 7 {d] 2 Processor/Node 2 Processor/Node 3 Itemset Support Itemset Support {a} 5 {a} 2 {b} 2 {b} 1 {c} 1 {c} 4 {d] 3 {d] 9 {e} 6 Processor/Node 1 at end of pass Itemset {a} 22 {b} 8 {c} 12 {d] 14 {e} 6 Support

Count Dstrbuton Algorthm Cont. Pass k > 1: 1. Every processor P generates C k usng frequent temset L k-1 created at pass k - 1 2. Processor P goes over local database partton D and develops local support count for canddates n C k 3. Processor P exchange local C k counts wth all other processor to develop global C k counts. Processors are forced to synchronze n ths step. 4. Each processor P now computes L k from C k. 5. Each processor P decdes to contnue to next pass or termnate (The decson wll be dentcal as the processors all have dentcal L k ).

Data Dstrbuton Algorthm Pass k = 1: Same as the Count Dstrbuton Algorthm Pass k > 1: 1. Processor P generates C k from L k-1. Retanng only 1/N th of the temsets formng the canddates subset C k that t wll count. The C k sets are all dsjont and the unon of all C k sets s the orgnal C k. 2. Processor P develops support counts for the temsets n ts local canddate set C k usng both local data pages and data pages receved from other processors. 3. At end of the pass, each processor P calculates L k usng the local C k. Agan, all L k sets are dsjont and the unon of all L k s L k. 4. Processors exchange L k so that every processor has the complete L k to generate C k+1 for next pass. Processors are forced to synchronze n ths step. 5. Each processor P can ndependently (but dentcally) decde whether to termnate or contnue.

Canddate Dstrbuton Algorthm Pass k < m: Use ether Count or Data dstrbuton algorthm. Pass k = m: 1. Partton L k-1 among the N processors such that L k-1 sets are well balanced. Important: For each temset remember whch processor was assgned to t. 2. Processor P generates C k usng only the L k-1 partton assgned to t. 3. P develops global counts for canddates n C k and the database s reparttoned nto DR at the same tme. 4. After P has processed local data and data receved from other processors t posts N 1 asynchronous receve buffer to receve L k j from all other processors needed for the prunng C k+1 n the prune step of canddate generaton. 5. Processor P computes L k from C k and asyncronosly broadcasts t to the other N 1 processors usng N 1 asynchronous sends.

Canddate Dstrbuton Algorthm Cont. Pass k > m: 1. Processor P collects all frequent temsets sent by other processors. They are used for the prunng step. Itemsets from some processor j can be not of length k 1 due to processors beng fast or slow, but P keeps track of the longest length of temsets receved for every sngle processor. 2. P generates C k usng local L k-1. P has to be careful durng the prunng process as t could be that not all the L k-1 j from all other processors. So when examnng f a canddate should be pruned t needs to go back to the pass k = m and fnd out whch processor was assgned to the current temset when ts length was m 1 and check f L k-1 j has been receved from ths processor. (e.g. Let m = 2; L 4 = {abcd, abce,abde} and we are lookng at temset {abcd} then we have to go back to when the temset was {ab} (.e. at pass k = m) to determne whch processor was assgned to ths temset). 3. P makes a pass over DR and counts C k. From C k computes L k and broadcast t to every other process va N 1 asynchronous sends.

Pros and Cons of the Algorthms Count Dstrbuton Pro: Mnmzes heavy data transfer between processors Con: Redundant Canddate Set countng Data Dstrbuton Pro: Utlzes Aggregate Memory by assgnng each processor a mutually exclusve subset of the Canddate set Con: Requres good communcaton network(hgh bandwdth/low latency) due to large sze of data needed to be broadcast at each pass Canddate Dstrbuton Pro: Maxmzes use of aggregate memory whle lmtng communcaton to a sngle redstrbuton pass. Elmnates synchronzaton costs that Count and Data must pay at end of every pass Con(Post testng): t turns out the sngle redstrbuton pass takes ts toll on the system

Lookng Ahead Plan Implement all three algorthm Compare ther performance ( wth each other; wth sequental Apror; wth other sequental frequent pattern mnng algorthms) Fnd out synchronzaton capabltes of the MPI (Message Protocol Interface) n a multthreaded envronment Fnd out synchronzaton modfcatons needed of mplementng the algorthms on a system that does not have a shared-nothng multprocessor nfrastructure.

Thank You! Questons?