Concurrent Apriori Data Mining Algorithms

Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015

Outlne Why t s mportant Introducton to Assocaton Rule Mnng ( a Data Mnng technque) Overvew of Sequental Apror algorthm The 3 Parallel Apror algorthm mplementatons Future work

What s Data Mnng? Mnng knowledge from data Data mnng [Han, 2001] Process of extractng nterestng (non-trval, mplct, prevously unknown and potentally useful) knowledge or patterns from data n large databases Objectves of data mnng: Dscover knowledge that characterzes general propertes of data Dscover patterns on the prevous and current data n order to make predctons on future data Source: Data Mnng CSE6412

Bg Data Era Term ntroduced by Roger Magoulas n 2010 A massve volume of both structured and unstructured data that s so large t s dffcult to process usng tradtonal database and software technques - Webopeda Multcore machnes allow for effcent concurrent computatons, whch requre proper synchronzaton technques, that can sgnfcantly reduce task completon tmes

Bg Data Era 45 zettabytes (45 x 1000 3 ggabytes) of data produced n 2020

Source: Data Mnng CSE6412 Why Mne Assocaton Rules?

Assocaton Rule Mnng Applcatons Market basket analyss (e.g. Stock market, Shoppng patterns) Medcal dagnoss (e.g. Causal effect relatonshp) Census data (e.g. Populaton Demographcs) Bo-sequences (e.g. DNA, Proten) Web Log (e.g. Fraud detecton, Web page traversal patterns)

Source: Data Mnng CSE6412 What Knd of Databases?

Source: Data Mnng CSE6412 Defnton of Assocaton Rule

Source: Data Mnng CSE6412 Support and Confdence: Example

Source: Data Mnng CSE6412 Mnng Assocaton Rules

Source: Data Mnng CSE6412 How to Mne Assocaton Rules

Canddate Generaton How to Generate Canddates? (.e. How to Generate C k+1 from L k ) Example of Canddate Generaton Source: Data Mnng CSE6412

Apror Algorthm Proposed by Agrawal and Srkant n 1994 Apror Algorthm (Flow Chart) Apror Algorthm Example Source: Data Mnng CSE6412

My Paper Rakesh Agrawal and John C. Shafer. Parallel mnng of assocaton rules: Desgn, mplementaton and experence. Techncal report, IBM, 1996. Rakesh Agrawal and John C Shafer. Parallel mnng of assocaton rules. IEEE Transactons on Knowledge and Data Engneerng, (6):962 969, 1996. Source: Google Scholar Rakesh Agrawal

3 Parallel Apror Algorthms IMPORTANT: Algorthms mplemented on a shared-nothng multprocessor communcatng va a Message Passng Interface (MPI) Count Dstrbuton Each processor calculates ts Canddate Set Counts from ts local Database and end of each pass sends out Canddate Set Counts to all other processors. Data Dstrbuton Each processor s assgned a mutually exclusve partton of the Canddate Set on whch t computes the count and end of pass sends out Canddate Set Tuple to all other processors. Canddate Dstrbuton Both Canddate Set and Database s parttoned durng some pass k, so that each processor can operate ndependently.

Source: My Paper Notatons

Count Dstrbuton Algorthm Pass k = 1: 1. Processor P scans over ts data partton D ; reads one tuple transacton (.e. (TID,X) ) at a tme and buldng ts local C 1 and storng t n a hash-table (new entry s created f necessary). 2. At end of the pass every P loads contents of nto a buffer and sends t out to all other processors. 3. At the same tme each P receves the send buffer from another processor and ncrements the count value of every element n ts local C 1 hash-table f ths element s present n the buffer otherwse a new entry would be created. 4. P wll now have the entre canddate set C 1 wth global support counts for each canddate/element/temset. Step 2 and 3 requre synchronzaton

Count Dstrbuton Algorthm Cont. (Pass K = 1 Example) Processor/Node 1 Itemset Support {a} 15 {b} 5 {c} 7 {d] 2 Processor/Node 2 Processor/Node 3 Itemset Support Itemset Support {a} 5 {a} 2 {b} 2 {b} 1 {c} 1 {c} 4 {d] 3 {d] 9 {e} 6 Processor/Node 1 at end of pass Itemset {a} 22 {b} 8 {c} 12 {d] 14 {e} 6 Support

Count Dstrbuton Algorthm Cont. Pass k > 1: 1. Every processor P generates C k usng frequent temset L k-1 created at pass k - 1 2. Processor P goes over local database partton D and develops local support count for canddates n C k 3. Processor P exchange local C k counts wth all other processor to develop global C k counts. Processors are forced to synchronze n ths step. 4. Each processor P now computes L k from C k. 5. Each processor P decdes to contnue to next pass or termnate (The decson wll be dentcal as the processors all have dentcal L k ).

Data Dstrbuton Algorthm Pass k = 1: Same as the Count Dstrbuton Algorthm Pass k > 1: 1. Processor P generates C k from L k-1. Retanng only 1/N th of the temsets formng the canddates subset C k that t wll count. The C k sets are all dsjont and the unon of all C k sets s the orgnal C k. 2. Processor P develops support counts for the temsets n ts local canddate set C k usng both local data pages and data pages receved from other processors. 3. At end of the pass, each processor P calculates L k usng the local C k. Agan, all L k sets are dsjont and the unon of all L k s L k. 4. Processors exchange L k so that every processor has the complete L k to generate C k+1 for next pass. Processors are forced to synchronze n ths step. 5. Each processor P can ndependently (but dentcally) decde whether to termnate or contnue.

Canddate Dstrbuton Algorthm Pass k < m: Use ether Count or Data dstrbuton algorthm. Pass k = m: 1. Partton L k-1 among the N processors such that L k-1 sets are well balanced. Important: For each temset remember whch processor was assgned to t. 2. Processor P generates C k usng only the L k-1 partton assgned to t. 3. P develops global counts for canddates n C k and the database s reparttoned nto DR at the same tme. 4. After P has processed local data and data receved from other processors t posts N 1 asynchronous receve buffer to receve L k j from all other processors needed for the prunng C k+1 n the prune step of canddate generaton. 5. Processor P computes L k from C k and asyncronosly broadcasts t to the other N 1 processors usng N 1 asynchronous sends.

Canddate Dstrbuton Algorthm Cont. Pass k > m: 1. Processor P collects all frequent temsets sent by other processors. They are used for the prunng step. Itemsets from some processor j can be not of length k 1 due to processors beng fast or slow, but P keeps track of the longest length of temsets receved for every sngle processor. 2. P generates C k usng local L k-1. P has to be careful durng the prunng process as t could be that not all the L k-1 j from all other processors. So when examnng f a canddate should be pruned t needs to go back to the pass k = m and fnd out whch processor was assgned to the current temset when ts length was m 1 and check f L k-1 j has been receved from ths processor. (e.g. Let m = 2; L 4 = {abcd, abce,abde} and we are lookng at temset {abcd} then we have to go back to when the temset was {ab} (.e. at pass k = m) to determne whch processor was assgned to ths temset). 3. P makes a pass over DR and counts C k. From C k computes L k and broadcast t to every other process va N 1 asynchronous sends.

Pros and Cons of the Algorthms Count Dstrbuton Pro: Mnmzes heavy data transfer between processors Con: Redundant Canddate Set countng Data Dstrbuton Pro: Utlzes Aggregate Memory by assgnng each processor a mutually exclusve subset of the Canddate set Con: Requres good communcaton network(hgh bandwdth/low latency) due to large sze of data needed to be broadcast at each pass Canddate Dstrbuton Pro: Maxmzes use of aggregate memory whle lmtng communcaton to a sngle redstrbuton pass. Elmnates synchronzaton costs that Count and Data must pay at end of every pass Con(Post testng): t turns out the sngle redstrbuton pass takes ts toll on the system

Lookng Ahead Plan Implement all three algorthm Compare ther performance ( wth each other; wth sequental Apror; wth other sequental frequent pattern mnng algorthms) Fnd out synchronzaton capabltes of the MPI (Message Protocol Interface) n a multthreaded envronment Fnd out synchronzaton modfcatons needed of mplementng the algorthms on a system that does not have a shared-nothng multprocessor nfrastructure.

Thank You! Questons?