Discovering Relational Patterns across Multiple Databases

Similar documents
TF 2 P-growth: An Efficient Algorithm for Mining Frequent Patterns without any Thresholds

Concurrent Apriori Data Mining Algorithms

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

An Optimal Algorithm for Prufer Codes *

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Mathematics 256 a course in differential equations for engineering students

A Binarization Algorithm specialized on Document Images and Photos

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Innovation Typology. Collaborative Authoritativeness. Focused Web Mining. Text and Data Mining In Innovation. Generational Models

Performance Evaluation of Information Retrieval Systems

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Available online at Available online at Advanced in Control Engineering and Information Science

Cluster Analysis of Electrical Behavior

CMPS 10 Introduction to Computer Science Lecture Notes

CE 221 Data Structures and Algorithms

Optimizing Document Scoring for Query Retrieval

Oracle Database: SQL and PL/SQL Fundamentals Certification Course

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Hierarchical clustering for gene expression data analysis

Query Clustering Using a Hybrid Query Similarity Measure

Analysis of Continuous Beams in General

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Collaboratively Regularized Nearest Points for Set Based Recognition

The Codesign Challenge

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Module Management Tool in Software Development Organizations

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Programming in Fortran 90 : 2017/2018

Reducing Frame Rate for Object Tracking

Machine Learning: Algorithms and Applications

Problem Set 3 Solutions

Optimal Workload-based Weighted Wavelet Synopses

UNIT 2 : INEQUALITIES AND CONVEX SETS

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Hermite Splines in Lie Groups as Products of Geodesics

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Meta-heuristics for Multidimensional Knapsack Problems

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

GSLM Operations Research II Fall 13/14

USING GRAPHING SKILLS

Wishing you all a Total Quality New Year!

A New Approach For the Ranking of Fuzzy Sets With Different Heights

S1 Note. Basis functions.

Video Proxy System for a Large-scale VOD System (DINA)

Smoothing Spline ANOVA for variable screening

AP PHYSICS B 2008 SCORING GUIDELINES

TN348: Openlab Module - Colocalization

Review of approximation techniques

Solving two-person zero-sum game by Matlab

Related-Mode Attacks on CTR Encryption Mode

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Support Vector Machines

Alignment Results of SOBOM for OAEI 2010

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search

Learning from Multiple Related Data Streams with Asynchronous Flowing Speeds

Load Balancing for Hex-Cell Interconnection Network

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Assembler. Building a Modern Computer From First Principles.

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

A User Selection Method in Advertising System

Maintaining temporal validity of real-time data on non-continuously executing resources

Clustering Algorithm of Similarity Segmentation based on Point Sorting

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

Priority queues and heaps Professors Clark F. Olson and Carol Zander

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Lecture 5: Multilayer Perceptrons

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

Ontology Generator from Relational Database Based on Jena


Parallel matrix-vector multiplication

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries

Fuzzy Weighted Association Rule Mining with Weighted Support and Confidence Framework

Study of Data Stream Clustering Based on Bio-inspired Model

Intro. Iterators. 1. Access

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Agenda & Reading. Simple If. Decision-Making Statements. COMPSCI 280 S1C Applications Programming. Programming Fundamentals

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Virtual Machine Migration based on Trust Measurement of Computer Node

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

Brave New World Pseudocode Reference

Transcription:

Dscoverng Relatonal Patterns across Multple Databases Xngquan Zhu, 3 and Xndong Wu Dept. of Computer Scence & Eng., Florda Atlantc Unversty, Boca Raton, FL 3343, USA Dept. of Computer Scence, Unversty of Vermont, Burlngton, VT 05405, USA 3 Graduate Unversty, Chnese Academy of Scences, Bejng 00080, Chna xqzhu@cse.fau.edu; xwu@cs.uvm.edu Relatonal patterns across multple databases can reveal specal pattern relatonshps hdden nsde data collectons. Exstng research n data mnng has made sgnfcant efforts n dscoverng dfferent types of patterns from sngle or multple databases, but how to fnd patterns that have a hgher support n database A than n database B wth a gven support threshold α s stll an open problem. We propose n ths paper DRAMA, a systematc framework for Dscoverng Relatonal patterns Across Multple databases. More specfcally, gven a seres of data collectons, we try to dscover patterns from dfferent databases wth patterns relatonshps satsfyng the user specfed constrants. Our method seeks to buld a Hybrd Frequent Pattern tree (HFP-tree) from multple databases, and mne patterns from the HFP-tree by ntegratng users constrants nto the pattern mnng process.. Introducton Many real-world applcatons nvolve the collecton and management of multple databases. Examples nclude market basket transacton data from dfferent branches of a whole sale store, data collectons of a partcular branch n dfferent tme perods, census data of dfferent states n a partcular year, and data of a certan state n dfferent years. For years, knowledge dscovery and data mnng (also referred to as KDD) [-] has been proven to be an effectve tool to search for novel and actonable patterns and relatonshps that exst n the data. When patterns take the form of assocaton rules, exstng research n the area has made sgnfcant efforts n dscoverng patterns (frequent temsets, closed patterns or sequental patterns) from dfferent types of data envronments, wth solutons roughly fall nto the followng three categores: () fndng patterns from a sngle (large volume) database; () fndng patterns from multple databases; and (3) fndng patterns from contnuous data streams. The essental goal s to enhance mnng algorthms such that they can scale up well to large volumes of (centralzed, dstrbuted or contnuous) data. To fnd patterns from multple databases, a common concern s to dscover knowledge whch does not exst unless one unfes all data collectons nto a sngle vew. For ths purpose, exstng research has been manly targeted to dscoverng global patterns, wth assstance of a local data mnng process. Collectve data mnng [3] s one of the most representatve efforts n the area wth the objectve of unambguous local analyss that can be used as a buldng block for generatng the correct global results. A common practce s to conduct data mnng on each sngle database, and then forward promsng meta patterns to a central place for analyss [4]. Ths research has been supported by the US Natonal Scence Foundaton (NSF) under Grant No. CCF-05489 and the Natonal Scence Foundaton of Chna (NSFC) under Grant No.6067409. The problem of fndng global patterns s surely mportant n realty, as t reveals knowledge whch s unavalable from each sngle database pont of vew. There s, however, another problem nvolved n pattern mnng from multple databases dscoverng relatonal patterns and ther relatonshps across databases. Takng a retal store wth two branches A and B as an example, f a store manager were organzng data from these two branches for ntellgent analyss, he/she may easly rase concerns lke () what are the frequent patterns n both A and B?.e., (A α) & (B α), where α s the threshold n fndng frequent patterns, and A α means that a pattern s support value n database A should be no less than the value α; () what are the frequent patterns whch appear more often n A than n B,.e. A > B α; and (3) what are the patterns whose support dfferences n these two stores are no less than the value α,.e., A-B α. There are possbly many other concerns n ths regard, but unfortunately, no systematc soluton has been proposed to address ths ssue n an effectve way, such that the dscovered relatonal patterns can be used to support effcent and effectve data and knowledge management. In realty, when users are exposed to the data collected from multple sources, t s a natural sense to refer to a contrast study for knowledge and pattern dscovery. Examples nclude natonal census data analyss, network ntruson detecton, and molecular genetc data analyss. We lst here two motvatng examples. Example : Consderng a data expert who s nterested n studyng resdents of north eastern states of Amerca (.e., the so called New England area ncludng the states of Connectcut (CT), Mane (ME), Massachusettes (MA), New Hampshre (NH), Rhode Island (RI), and Vermont (VT)), ths expert may be also nterested n fndng the smlarty/dfferences between resdents n ths area and the resdents on the West Coast, say Calforna (CA). For these purposes, the followng queres are lkely to be rased by the expert. Query. Fndng patterns that are frequent wth a support level of α n all of the New England states, but sgnfcantly nfrequent wth support level of β n Calforna,.e, {(CT α) & (ME α) & (MA α) & (NH α) & (RI α) & (VT α)} & {CA <β}. Query. Fndng patterns that are frequent wth a support level of α n the New England area, w.r.t. all states,.e., {(CT+ME+MA+NH+RI+VT) α} Query 3. Fndng patterns that are frequent wth a support level of α n all New England States, but wth ther supports declnng from northern to southern states,.e., {ME > (NH VT) > MA > (CT RI) α} Example : Recent development n mcrobology and bonformatcs has made t possble to extract gene expresson data for molecular genetc analyss. One of the most mport applcatons s to use such gene expresson data for genetc dsease proflng, for example, the molecular cancer classfcaton -444-0803-/07/$0.00 007 IEEE. 76

[5]. In order to detect sgnature patterns for Leukema, say Acute Myelod Leukema (AML) and Acute Lymphoblastc Leukema (ALL), a mcrobologst can splt the underlng data nto four datasets, wth D contanng gene expresson data of normal tssues, D contanng data of AML tssues, D 3 contanng all ALL tssues, and D 4 contanng all other cancer tssues. Queres of the followng types can then be used to capture the sgnature patterns for cancer classfcaton. Query : Fndng the patterns that are frequent wth a support level of α n ether of the cancer datasets: D, D 3, or D 4, but are sgnfcantly nfrequent n D..e., {(D D 3 D 4 ) α} & {(D < β)} Query : Fndng the patterns that are frequent wth a support level of α n all cancer datasets, but wth support n Leukema tssues hgher than other cancers tssues..e., {(D D 3 ) D 4 α} There are many other applcatons, asde from the above two examples, that users wll have to deal wth data from dfferent sources. In addton, t s often the case that users know some basc features of these data collectons, such as the date and tme each database was collected, or the regon or entty each database may represent. What remans unclear s the relatonshp of the patterns hdden across multple data collectons. As a result, the needs of comparng patterns from dfferent datasets and understandng ther relatonshps are emergng. For example, the store managers may want to fnd gradually ncreasng shoppng patterns of ther customers n a certan perod of tme, or a mcrobologst may want to fnd patterns of the dseases along an evolvng order. For these purposes, dscoverng relatonal patterns across multple databases can be a very mportant part of the KDD process. Although well motvated, the soluton to ths end, however, requres an effcent mechansm for complex querng and mnng on multple databases.. Smple Solutons and Challenges In a naïve sense, the problem of dscoverng relatonal patterns across multple databases can be solved by three smple solutons: () Sequental Pattern Verfcaton (SPV); () Parallel Pattern Mnng (PPM); and (3) Collaboratve Pattern Mnng (CPM). SPV starts pattern mnng from a seed database (whch can be a subset of a database n the query) and then passes on the dscovered patterns to the second database for verfcaton. Such a sequental process repeats untl patterns have been verfed by all the databases nvolved n the query. For example, to answer Query n Example, SPV may start from the CT database to fnd frequent patterns, then pass on patterns to database ME to fnd patterns frequent n both CT and ME. Any patterns whch do not satsfy the query wll be pruned out mmedately. Ths process repeats untl all the databases n the query have verfed the pattern. Instead of verfyng patterns n a sequental way, PPM concurrently dscovers patterns from each sngle database, and then forwards all frequent patterns (from each sngle database pont of vew) to a central place to fnd the ones whch satsfy the query constrants. For example, to answer Query n Example, PPM concurrently dscovers patterns from each sngle database (CT, ME,.. and VT), and then checks whether a pattern satsfes the query or not. One should be aware that t s techncally nfeasble to fnd patterns whch satsfy CA <β by usng database CA only, because no determnstc prunng rules wll hold and one has to lst all the canddates, f he/she does ntend to do so. Therefore, PPM wll concurrently mne patterns from all other parts (CT, ME,.., and VT), and then pass on the patterns to CA to verfy whether they satsfy CA <β. Both SPV and PPM rely on the results dscovered from a sngle database for pattern verfcaton, where the mnng process (canddate generaton and prunng) at each sngle ste does not consder the exstence of other databases at all (unless the patterns were forwarded to other databases for verfcaton). As we wll dscuss later, ths sngle database based framework wll forbd both SPV and PPM from answerng some complex queres. However, ths dsadvantage can be overcome by CPM, whch unfes all databases n the query nto one vew for canddate generaton and verfcaton. The theme of CPM s to generate length-l canddates from each sngle database, wth all canddates forwarded to a central place for canddate justfcaton, such that only canddates satsfyng certan condtons are redspatched to each database for the next round of pattern growng (length-l+). Ths procedure repeats untl no more canddates can be further generated. All the three methods above can somewhat fulfll the goal of fndng relatonal patterns across multple databases, although not necessarly for all types of queres. For example, SPV and PPM cannot possbly answer Query n Example. Because a pattern satsfyng {(CT+ME+MA+NH+RI+VT) α} can take any support value n each sngle database. For example, f a pattern s support was 0 n CT, ME, MA, NH, and RI, but α n VT, t would stll satsfy the query. To fnd such patterns, SPV and PPM have to lst all possble canddates (by settng each database s threshold value to 0), whch s techncally nfeasble. In fact, the most serous dsadvantage of all these three methods les n the fact that they are all Apror-based, where pattern generaton and database rescannng for verfcaton wll sgnfcantly reduce ther speed n fndng relatonal patterns. It s commonly recognzed that database rescannng for pattern verfcaton could be very tme consumng, especally when the underlyng data volumes are large. Therefore, we need a fundamentally dfferent desgn whch should take the followng concerns nto consderaton n dscoverng relatonal patterns. () Beng able to unfy all databases n the query to fulfll the pattern dscovery process. In other words, conductng pattern mnng from a sngle database wthout consderng all other databases s not an opton for us. () Beng able to meet all queres lsted n the above two examples. In Secton 4, we wll formally defne our problem and queres, whch should also be addressed by our solutons. (3) Beng able to scale well to large data volumes and can be easly extended to dscover other types of relatonal patterns other than frequent temsets. In ths paper, we take the above concerns nto consderaton and propose a hybrd frequent pattern (HFP) tree based soluton. Our method seeks to buld a sngle HFP-tree for each query, where pattern generaton and verfcaton unfy the underlyng databases to speed up the prunng process. Expermental comparsons from both synthetc and real-world databases wll demonstrate that ths framework can sgnfcantly enhance the speed n fndng relatonal patterns, where the mprovement can be as much as over 00 tmes better than smple solutons. -444-0803-/07/$0.00 007 IEEE. 77

3. Related Work The problem of handlng data from multple databases s a nontrval task n realty, and t often rases concerns lke how to compare or unfy dfferent parts of data to acheve a common goal. Domans of applcatons nclude classfcaton [6], frequent temset mnng [7-8], clusterng [9], and OLAP [7]. For example, Yn et. al [6] have prevously proposed a CrossMner for classfcaton from multple databases. The problem of assocaton rule mnng from dstrbuted databases has also been well studed [0-4], where count dstrbuton, data dstrbuton, and canddate dstrbuton are three basc mechansms for effectve mnng from multple databases [4]. However, among all these research actvtes, the focus has typcally been on mnng a sngle database (whether t s dstrbuted or centralzed), wth the objectve of unfyng patterns dscovered from each sngle database nto new knowledge and patterns. In comparson, our research focuses on fndng patterns and ther relatonshps across multple databases. When the underlyng data nvolve multple (dstrbuted /centralzed) sources, one of the most mportant tasks s to assess the smlarty between the datasets, such that the structural nformaton among the databases can be provded for analyss such as clusterng. [5] and [6] have prevously addressed the problem of database smlarty assessment by comparng assocaton rules from each component database, e.g. how many of those rules are dentcal, and what are the numbers of nstances covered by those dentcal rules? In comparson, we are nterested n fndng patterns across multple databases. The mportance of fndng dfferences between databases has been notced by many researchers n the area [7-0], wth the man focus on explorng dfferences between two databases at a tme. Webb et al. [8] proposed a rule based method to explore a contrast set between two databases. J et al. [0] have proposed methods to explore mnmal dstngushng subsequence patterns between two datasets, where the patterns take the form of frequent n database A, but sgnfcantly less frequent n database B,.e. {(A α)&(b β)}. All those methods are nterested n fndng dfferences (n terms of data tems or patterns) between two datasets, but cannot support the complex queres we mentoned n Example. The research n database queres has made sgnfcant efforts n supportng data mnng operatons [-3], wth extensons of the database query languages to support mnng tasks, but most of these efforts focus on a sngle database wth relatvely smple query condtons. Among them, the most relevant work related to ths research s the complex mnng optmzaton system proposed by Jn and Agrawal []. They presented an SQL-based mechansm from queryng frequent patterns across multple databases, wth the objectve of optmzng the users queres to fnd qualfed patterns. There are, however, essental dfferences between ther work and what we wll propose here. () The efforts n [] only focus on the problem of enumeratng dfferent query plans and choosng the one wth the least cost. The pattern mnng methods they adopted are actually the smple solutons as we dscussed n Secton. Instead of optmzng queres, our research wll propose a data mnng framework n supportng users queres to fnd relatonal patterns; () Because of the lmtatons of ther pattern mnng framework (relyng on each sngle database), the soluton n [] can only answer smple queres lke {(A α ) & ( B α ) & (C β)},.e., each element of such a query must explctly specfy one sngle database and ts correspondng threshold value. Ther method, however, cannot answer a complex query lke Queres and 3 n Example, and therefore ts applcablty s lmted n realty. Table Two toy datasets D and D Database D Database D Trans ID Items Trans ID Items {a, b, d} {c, f, g} {a, d, f, g} {a, b, d, g} 3 {a, b, c, d} 3 {a, b, c} 4 {a, c, d, g} 4 {a, b, d} 5 {b, d, f} 5 {a, c} 6 {a, b, d, g} 6 {e, c, d} 7 {e, f, d} 7 {a, c, d, f, g} 8 {a, b, c, e, g} 4. Problem Defnton A pattern, P, dscussed n ths paper takes the form as an temset,.e. a set of tems whch satsfy the user specfed constrant(s). P The support of the pattern P n database D, denoted by Sup, D represents the rato between the number of appearances of P n D and the total transacton number n D. Unless specfed otherwse, we always use ths rato to denote a pattern s support. The users constrants specfy the patterns they ntend to dscover from the database. For example, a user can specfy {D α} to ndcate that he/she s ntendng to fnd patterns from database D, wth all qualfed patterns support larger than the gven threshold α. A user can specfy multple databases n ther constrants, for example {A B α}, whch ndcates a pattern wth ts support values n A and B both larger than α, and n addton, a pattern s support n A should be larger than ts support n B. In ths paper, we defne the followng two types of relatonshp factors and four operators to descrbe a user s constrants. Relatonshp factors: X α (X > α) ndcates that X s no less than α ( X s larger than α) X α ( X < α) ndcates that X s no larger than α ( X s less than α) Operators: X + Y ndcates the operaton of summng up the support values n both X and Y X - Y ndcates the operaton of subtractng the support n Y from the support n X. X & Y (X Y) ndcates the operaton of X and Y ( X or Y) X ndcates the absolute support value n X. Notce that + drectly sums up support values from partcpant databases. The results from ths operator do not reveal patterns support values from the unon of the partcpant databases. Ths operator s helpful when a data manager ntends to fnd the average support of the patterns from multple databases. A user s query s smply the user s constrants, takng the form of a combnaton of the above relatonshp factors and operators, n fndng relatonal patterns across multple databases. More specfcally, a query should nvolve at least one database and one relatonshp factor, say {A α}. A query may also nvolve multple relatonshp factors and multple operators, whch s often the case n realty, such as the query {ME > (NH VT) > -444-0803-/07/$0.00 007 IEEE. 78

MA > (CT RI) α} n Example. A pattern whch satsfes the user s query s called a relatonal pattern. Due to the lmtaton of the pattern mnng process, a user s query cannot take an arbtary form as he/she wshes, nstead, we confne that a query must nvolve at least one relatonshp factor (or >) wth a numercal threshold value mmedately followng ths factor. A query whch comples wth ths confnement s called a vald query. For example {A B C} s not a vald query; and however, {A B C α} s. The reason we defne a vald query s because wthout a specfc threshold α, t s techncally nfeasble to fnd all patterns satsfyng {A B C}. The procedure of dscoverng relatonal patterns across multple databases s an nteractve process, where a user provdes a query and the system fnds all patterns satsfyng the query, n an effectve way. In ths paper, we only deal wth the problem of pattern dscovery. We assume that users queres and the underlyng databases are mmedately avalable. The problems of effectve/effcent user nteractons and data prvacy/securty are not of our concern at ths stage. 5. Hybrd Frequent Pattern Tree Constructon The frequent pattern tree (FP-tree) [7] s a well-known data structure n mnng frequent temsets. The merts of the FP-tree le n the fact that t stores the set of frequent tems of each transacton n a compact structure, whch can avod repeatedly scannng the orgnal database durng the mnng process. In ths secton, we proposes a soluton to have multple databases joned together to buld a sngle Hybrd Frequent Pattern tree (HFP-tree), whch wll be used to dscover relatonal patterns at a later stage. Dfferent from the tradtonal FP-tree whch works on a sngle database, the purpose of an HFP-tree s to fnd the set of frequent temsets from transactons n all databases. For ths purpose, changes and extensons have been made accordngly. As one of the major changes, each node of the HFP-tree takes the followng form {x y :y : :y n }, where x s the name of the tem stored at the current node (denoted by tem_name), and y, y,..,y n are the numbers of tmes that a partcular temset has appeared n databases D, D,.., D n respectvely. Take D and D n Table as our example databases. Assumng they are jonng together to construct an HFP-tree, each node n the tree wll take the form of {x y :y } wth y and y denotng the numbers of tmes that the temset, wth tems startng from the Root and endng at the current node x, has appeared n databases D and D respectvely. If D and D are jonng together to buld a tree, they must agree wth, n advance, the order of the tems lsted n the tree. Here, we assume D and D agreed to lst ther tems accordng to the alphabetc order (we wll dscuss the generaton of ths lst n Secton 5.). We also dscard any threshold value at ths stage, and therefore all tems wll be added nto the HFP-tree. Gven a transacton n D, say Trans T ={a, b, d} where tems have been sorted accordng to the alphabetc order, the HFP-tree constructon wll start from the frst tem, a, and check whether any chld node of the Root has the same tem_name. Snce we know the HFP-tree s empty at ths stage, a s not a chld of Root. As a result, we construct a new chld node ϑ = {a :0} for Root, whch specfes that a s a chld of Root wth a appearng once n D and zero tme n D. After that, we move to the second tem b n T, and check whether b s a chld of the recently bult node ϑ = {a :0}. It s obvous that ϑ currently has no chld, so we buld another node ϑ ={b : 0}, and set ths node as the chld of ϑ. It means that temset {ab} has appeared once n D but 0 tme n D. Fnally, we move to the thrd tem d n T. We fnd d s not a chld of the recently bult node ϑ, so we buld another new node ϑ = {d :0} and set t as the chld of ϑ. Agan, t means that temset {abd} has appeared once n D but stll 0 tme n D. For any other transactons n D or D, we wll repeat the same procedure. Take the thrd transacton n D, T 3 ={a, b, c}, as an example. We frst check whether Root has any chld node named a, and snce we have prevously constructed such a node, we know for sure that t does exst. Denotng ths node by x, we wll ncrease x s frequency count for database D by step, then we check whether x has any chld node named b,.e., the second tem of T 3. We ncrease the frequent count for D by step, f such a node ndeed exsts; otherwse, we smply add a new member {b 0:} as the chld node of x. We recursvely repeat the above procedure untl we fnsh the last tem n T 3. The constructed HFP-tree for D and D s shown n Fgure. To speed up the tree transversal, a header table s bult for all tems ever lsted n the HFP-tree. As shown n Fgure (for a clearer presentaton, we only lst the header table for tems a, c, e, and g). For each tem, say g, ts lst records all the locatons where g has ever appeared n the HFP-tree. The purpose of the header table s to facltate the access of the tem sets endng wth the same tem letter. For example, n Fgure, f we want to fnd the set endng wth pattern letter c, we may smply go through all records of c s header lst, and at each locaton, trackng upwards to the Root wll produce an tem set assocated wth tem c. In Fgure, we have lsted detaled nformaton of buldng an HFPtree from multple databases. But before we go any further, we d lke to solve a partcular ssue rased by multple databases. Header Table Item head of node lnk a b c d e f g g : 0 HFP-Tree Root a 6:5 b : 0 c 0: e : b 4: 3 c : d : 0 d : 0 f 0: c 0: f : 0 c : d : d : f : 0 f : 0 g 0: d 0: d : 0 e : 0 d : 0 g : f 0: g : 0 g : 0 g 0: Fgure Example of constructed HFP-tree for D and D n Table. 5. Jont Rankng Lst In the above example, we assume that all parts partcpatng n the HFP-tree constructon use the same predefned tem lst (the alphabetc order of the tems). In realty, the order of the lst plays an mportant role to buld a compact HFP-tree. Take a dataset contanng four transactons {d}, {c, d}, {b, c, d}, and {a, b, c, d} as an example. A frequent pattern tree bult by usng the tems alphabetc order,.e, a, b, c, and d, wll have 0 nteror nodes (excludng the Root). On the other hand, f tems were prevously ranked n the descendng order of ts frequency,.e., d, c, b, and a, -444-0803-/07/$0.00 007 IEEE. 79

the correspondng FP-tree wll have 4 nteror nodes only, whch s about 60% of tree sze reducton. Reducng the tree sze wll eventually lead to dramatc tme savng n buldng the pattern tree. To solve the problem, the orgnal FP-tree algorthm [7] scans the database beforehand to produce the rankng lst, and then use ths lst to buld the FP-tree. When several databases jon together to buld an HFP-tree, a smple soluton s to use a predefned tem lst to buld the HFPtree. Ths, however, wll sgnfcantly deterorate the system performances, because any lst wthout takng tem frequency nto consderaton wll lead to an nferor soluton and eventually rase the cost n tree constructon. For ths purpose, we propose a rank-jon based rankng mechansm. Gven M databases D, D,.., D M for HFP-tree constructon, assume I, I,.., and I N are the unon of the tems n the databases. For any database D, we scan t and rank all tems n D n a descendng order of ther frequency. Denotng R the rankng order of tem I j n database D (wth the frst tem n the lst denoted by ), then Eq. () wll represent the average rankng order of each tem I j. The fnal rankng lst for all tems s j constructed by rankng R, j=,, N, n an ascendng order, where tems wth the least average rankng are lsted at the top. j M j R = R () = M The above mechansm jons the ranks of each tem n all databases together to produce the fnal rankng. By dong so, we assume that databases are equally weghted, and the rank n all databases plays an equal role n decdng an tem s fnal rankng. In realty, the sze (number of transactons) of the databases nvolved n the query may vary sgnfcantly, where a database contanng more transactons should carry more weght n decdng the fnal rankng of a partcular tem. For ths purpose, we revse Eq. () by takng the sze of each database nto consderaton. Assume S s the number of transactons n D, then S=S +S +..+S M denotes the total number of transactons. The weghted average rankng order s then represented n Eq. (). Input: Databases D,.., D M, and ther mnmal support thresholds α,.., α M. Output: Hybrd Frequent Pattern tree, HFP-tree. Intalze an empty HFP-tree wth node Root only. Scan each database D,,D M once, and calculate the rankng order of each tem I,..,I N n each sngle database (tems wth ther support less than the correspondng threshold α are elmnated). 3. Use Eq. () to produce a jont rankng lst L 4. For each transacton T k n D, sort tems n T k accordng to the lst L. Denote sorted T k by T k wth tems n T k denoted by I,..,IK 5. ϑ Root; κ 6. For all chldren x of ϑ a. If x.tem_name = I κ.tem_name, ncrease the correspondng frequency count by step (the one correspondng to D ). ϑ x, κ κ+. Repeat step 5 untl κ=k. b. If no chld of ϑ has tem_name I κ.tem_name, create a new node y wth y.tem_name = I κ.tem_name. Intalze y s frequency to zero except for D, whch s set to.. Insert y as ϑ s chld. ϑ y, κ κ+. Repeat step 5 untl κ=k. 7. Repeat step 4 for all databases D,..,D M, then return the constructed HFP-tree Fgure Hybrd Frequent Pattern tree constructon j R j = M M = 5. HFP-tree Constructon S S S Fgure lsts the algorthm detals n buldng an HFP-tree from multple databases D,..D M. We assume here that each database D comes wth a mnmal support threshold n determnng ts frequent tems. In the next secton, we wll explan the detals on how to parse a user s query to generate such threshold values. 6. Dscoverng Relatonal Patterns Usng HFP- Tree 6. User Query Decomposton As we have dscussed n Secton 4, a user s query may nvolve multple relatonshp factors and operators. When submttng such a complex query to the data mnng model, t s often the case that not all parts of the query comply wth the down closure property,.e., the subset of a frequent temset may also be frequent. For example, the and < relatonshp factors normally do not comply wth the down closure property. It s obvous that even f a pattern n B, say {abc}, does not satsfy B β, but ts superset, say {abcd}, may stll comply wth B β. Therefore, the mnng process must preprocess a user s query and explctly decompose t nto a set of subqueres whch do comply wth the down closure property, such that the mnng model can use these subqueres to facltate the canddate prunng process. For ths purpose, we lst fve propertes here, and wll use these propertes to decompose each query before t s submtted to the data mnng model. All decomposed subqueres (whch comply wth the down closure property) are placed nto a Down Closure (DC) subset, and meanwhle the orgnal query s stll kept to check a pattern s valdty at the fnal stage. Property 6.. If a subquery has a sngle database and a threshold value α lsted on the left and rght sde of the relatonshp factor or > respectvely, then ths subquery comples wth the down closure property. Ths property s based drectly on the Apror rule n frequent temset mnng. If a pattern P s support n a database s less than a gven threshold α, then any supersets of P (the patterns growng from P) wll also have ther support less than α. Therefore, f a query nvolves multple databases, factors or >, and a sngle threshold value α, we may decompose ths query nto a set of subqueres wth each sngle database and the threshold value α lsted on the left and rght sdes of the factor. For example, the query {A B C α} can be decomposed nto three subqueres (A α ), (B α ), and (C α ), and placed nto the DC set. It s obvous that f a pattern P volates any one of these three subqueres, there s no way for P, as well as P s any supersets, to be a qualfed pattern. It s worth notng that subqueres n the DC set are merely for pattern prunng purposes, and one should not use them to replace the orgnal query. The orgnal query wll stll be used to verfy the patterns at the fnal stage (as we wll dscuss n the next subsecton). Property 6.. If a subquery has the sum ( + ) of multple databases and a threshold value α lsted on the left and rght sde of factor or > respectvely, then ths subquery comples wth the down closure property. R j () -444-0803-/07/$0.00 007 IEEE. 730

For example, a subquery lke {(A+B+C) α} comples wth the down closure property, and can be drectly put nto the DC set. The proof of ths property s trval. Gven a pattern P and any of ts subpatterns Q, assumng P s and Q s supports n A, B and C are P, P, P 3 and Q, Q, Q 3 respectvely, t s obvous that Q P, Q P, Q 3 P 3. If (P +P +P 3 ) α, then t s obvous that (Q +Q +Q 3 ) (P +P +P 3 ) α. Therefore, the property s true. Ths property states that f a subquery sums up multple databases and s followed by factors or > and a threshold value α, then t should be placed nto the DC set for pattern prunng. Property 6..3 If a subquery has the support dfference of two databases, say (A-B), and a threshold value α lsted on the left and rght sde of factors or > respectvely, then ths subquery can be further transformed nto a subquery lke A α, whch stll comples wth the down closure property. It s obvous that f (A-B) α, then A (B+α). Snce a pattern s support n a database cannot be negatve, so we have A α. Property 6..4 If a subquery has the absolute support dfference of two databases, say A-B, and a threshold value α lsted on the left and the rght sde of factors or > respectvely, then ths query can be transformed nto a subquery lke {(A α) (B α)}, whch stll comples wth the down closure property. It s obvous that f A B α, then we have (A B) α or (A B) -α, whch lead to the nequatons A (B+α) or B (A+α),.e. {(A α) (B α)}. For any pattern P, f ts supports n A and B are both less than α, there s no way for P s superset to have a hgher support than α. Therefore, t stll comples wth the down closure property. Property 6..5 A subquery nvolves relatonshp factors or < wll most lkely not comply wth the down closure property, and therefore cannot be placed nto the DC set. Wth the above fve propertes, we can decompose most complex queres nto a set of subqures whch comply wth the down closure property, and use the DC set to support effcent pattern prunng. For example, Query 3 n Example, {ME > (NH VT) > MA > (CT RI) α} can be decomposed nto a set of subqueres lke ME α, (NH VT) α, MA α, and (CT RI) α, whch wll be used to check all canddates durng the pattern growng process. g HFP-tree Root a 6:5 c 0: b 4: 3 c : d : 0 f 0: c : d : d : f : 0 g 0: 6 e : 0 g : 0 A hybrd prefx path of g Meta HFP-tree hfg g Root a 4: c 0: c : d : d : Fgure 3 (b) The meta HFP-tree for T={g} Meta HFP-tree hfg gd g : f 0: g : 0 g : 0 4 5 Root g 0: 3 a 3: Fgure 3 (a) The hybrd prefx paths of g Fgure 3 (c) The meta HFP-tree for T={gd} Fgure 3 A runnng example of hybrd prefx paths and a meta HFP-tree for tem g 6. Relatonal Pattern Dscovery Usng HFP-tree The constructon of the HFP-tree ensures that the set of frequent temsets for transactons n all databases can be enclosed nto a compact tree structure, but ths does not automatcally produce the relatonal patterns to meet our needs. In ths subsecton, we ntroduce the HFP-tree based mnng process n dscoverng relatonal patterns. Fgure 4 gves the pseudo code of the mnng process, whch manly conssts of two procedures: HFP-mnng and HFP-growth. In the man procedure, HFP-mnng, an nput query Q s frst decomposed nto a set of subqueres (DC). Then the system recursvely calls HFP-growth to dscover relatonal patterns from the HFP-tree, where the DC set s used to prune out unnecessary canddates on the fly, and the query Q s used at the fnal stage to assert the valdty of the patterns. Gven an HFP-tree bult from the multple databases, the HFPgrowth frst checks each node a n the header table of the tree. Because the header table has recorded the locatons where a has ever appeared n the tree, we can start from each of a s locatons l j and track upwards towards the Root, whch wll produce a hybrd prefx path HPP j for a (w.r.t. to the current locaton l j ). Fgure 3 pctorally demonstrates the concept of a hybrd prefx path for tem g of the HFP-tree n Fgure (for smplcty, we only show branches nvolvng g). In Fgure 3, g s header table has recorded sx (denoted by 6 dgtal numbers from to 6). For each locaton, say locaton, trackng from g upwards towards the Root wll produce a set {ecba}. We replace the support of each tem n the set by the current support of g, and t wll produce a path {e :0, c :0, b :0, a :0}, whch s called a hybrd prefx path (HPP) for g. It s understandable that an HPP records tems (and ther frequences w.r.t. to each database) whch co-occur wth g and have a hgher rank than g n the lst L. Parsng all the HPPs of g should be able to produce frequent temsets assocated wth g (The HFP-growth wll start from the tem wth the lowest rank for pattern growth). For ths purpose, for any tem n the hybrd prefx paths of g, we sum up ts frequences (w.r.t. to each database) from all locatons, whch wll drectly ndcate whether ths tem s frequently assocated wth g or not. For example, the other fve hybrd prefx paths n Fgure 3 are {d :, b :, a :}, {f 0:, d 0:, c 0:, a 0:}, {d :0, c :0, a :0}, {f :0, d :0, a :0}, and {f 0:, c 0:}. The total frequences of tems n g s hybrd prefx path are Freq g ={a 4:, b :, c :, d 3:, e :0, f :}. Dvdng all the frequency values by the total number of transactons n each database (D =8 and D =7) wll produce the support values of each tem Sup g =={a 0.5:0.9, b 0.5:0.4, c 0.5:0.9, d 0.38:0.9, e 0.3:0, f 0.3:0.9}. Gven a query Q={D D 0.5}, the query decomposton process wll produce a DC set lke DC={(D 0.5) AND (D 0.5)}. Comparng all tems support values n Sup g wth the DC set wll explctly ndcate that any of the followng tems, {b 0.5:0.4} {e 0.3:0}, and {f 0.3:0.9}, cannot form an temset wth g to satsfy the query Q. Therefore, we can prune out those unqualfed tems drectly, wth fltered HPPs of g denoted by {c :0, a :0}, {d :, a :}, {d 0:, c 0:, a 0:}, {d :0, c :0, a :0}, {d :0, a :0}, and {c 0:}. After that, we take each fltered HPP as a meta-transacton, and buld a Meta HFPtree for g, as shown n Fgure 3 (b). At any stage, f a meta HFP-tree, hfp, has more than one path, we wll have to recursvely call the HFP-growth procedure to check each node n the header table of hfp, and buld a meta HFP-tree for the node. The mnng process recursvely calls the -444-0803-/07/$0.00 007 IEEE. 73

HFP-growth procedure, untl the meta HFP-tree eventually contans one path only. In Fgure 3(b), because the meta HFPtree of g, hfp g, contans more than one path, we wll recursvely call the HFP-growth to buld a meta HFP-tree for each of the nodes n hfp g (.e., tem d). For ths purpose, HFP-growth wll push the current tem g nto a base set T={g 4:3} (whch records the frequent tems so far), and conduct recursve pattern growth. The recursve HFP-growth process wll eventually lead to a meta HFP-tree contanng one or zero path. At ths stage, there s no need to grow patterns any further; nstead, we can drectly produce patterns by enumeratng all the combnatons of the nodes n the tree and appendng any of the combnatons to the underlyng base set T to generate a pattern P, as ndcated on lne e of the HFP-growth procedure (Fgure 4). Meanwhle, P s fnal supports are the mnmal support of all nvolved tems (w.r.t. to each database). I.e.,, where PSup = mn{ SupP [ k] }: mn{ SupP [ k] } k=,.., K k=,.., K Sup ] P k [ means the support value of the k th tem n P (w.r.t. to database D ) and K s the number of tems n P. For example, Fgure 3 (c) shows a one path (actually one node) meta HFP-tree bult for base set T={g 4:3, d 3:}. Appendng ths only node to the current base set T wll produce a pattern P=T {a 3:}. The fnal supports of P are the mnmal support value of the tem n P, whch s {3:},.e. P Sup ={0.38:0.8}. As we have analyzed n Secton 6., the DC set s not equvalent to the orgnal query, but rather for pattern prunng purposes only. Therefore, a pattern P whch s generated by usng the down closure rule n the DC set, does not necessarly comply wth the orgnal query Q. A valdty check must be conducted to assert whether P ndeed comples wth the query Q or not. Ths can be easly acheved by comparng pattern P s support P Sup wth the orgnal query Q. It s obvous that the supports of P={g 4:3, d 3:, a 3:} are P Sup ={0.38:0.8}, whch satsfy Q={D D 0.5}, then pattern P s eventually appended to the relatonal-pattern set RP. Input: an HFP-tree hfp bult from M databases, rankng lst L, and the orgnal query Q Output: Ratonal-pattern set, RP Procedure HFP-Mnng (HFP-tree, Q). Down Closure Set (DC) Query-Decomposton (Q). RP, T 3. HFP-growth (HFP-tree, T, RP, DC, Q, L) Procedure HFP-growth (hfp, T, RP, DC, Q, L) For each node n n the header table of hfp (n nverse order of the rankng lst L) a. S ; T T n. The supports of T are the mnmal support values of all the nodes n T (w.r.t. each database) b. For each of n s locaton a j n the header table of hfp. Buld a hybrd prefx path, HPP j, for a j,. S S HPP j c. Prune tems n S based on the down closure rule n the DC set d. Buld a meta HFP-tree, hfp, based on the remanng tems n S and rankng lst L e. If hfp contans a sngle path PS. For each combnaton (denoted by π) of the nodes n the path PS. Generate pattern P T π, the supports of P are the mnmal support values of the node n π (w.r.t. each database) M PSup = mn{ SupP [ k] }:...:mn{ SupP [ k] }, where Sup P [ k ] means the support k=,.., K k=,.., K value of the k th tem n P (w.r.t. to database D ) and K s the number of tems n P. Check whether P comples wth the query Q; f t does, RP RP P f. Else. HFP-growth (hfp, T, RP, DC, Q, L) 7. Expermental Evaluaton In ths secton, we report expermental evaluatons and a comparatve study wth two smple soluton based relatonalpattern dscovery mechansms. Our test datasets are collected from two sources: () synthetc databases generated by usng IBM Quest data generator [8][4]; and () the IPUMS (Integrated Publc Use Mcrodata Seres) 000 USA census mcro-data wth % samplng rate [5]. All experments are performed on a.0 GHz Pentum PC machne wth 5MB man memory. All the programs are wrtten n C++, wth the ntegraton of an STL-lke C++ tree class [6] to fulfll the tree constructon and access. Although t s possble for DRAMA to reuse a prevously constructed HFP-tree to answer multple queres, for farness n comparson, DRAMA wll ntate HFP-tree constructon and HFP-mnng for each query. In the followng tables and fgures, unless specfed otherwse, the runtme always means the total executon tme,.e., the tree constructon plus the mnng tme. For a comparatve study, we mplement two smple solutons, SPV and CPM, as we have dscussed n Secton. Whle SPV sequentally mnes and verfes patterns from each database, CPM wll generate canddates from each component database, and refer to the collaboratve mnng process for canddate prunng. For SPV, we use FP-tree nstead of the Apror algorthm to mne patterns from the frst database. Because CPM needs canddates generated at each sngle database for collaboratve mnng, we apply the tradtonal Apror algorthm on each database. The runtme of CPM s the pattern mnng tme of the databases wth the largest tme expense plus the tme for collaboratve mnng and pattern verfcaton. Because real-world databases can vary sgnfcantly n sze, we generate four synthetc databases wth dfferent szes, as shown n Table. The explanatons of the database descrpton can be found n [8]. In short, T0.I6.D300k.N000.L000 means a database wth 300,000 transactons and 000 tems, where each transacton contans 0 tems, and each pattern contans 6 tems on average. It s understandable that the runtme of the systems wll crucally rely on the underlyng queres. For an objectve assessment, we defne fve queres, as shown n Table 3, and wll demonstrate the average system runtme performances n answerng these queres. Table Synthetc database characterstcs Database Database descrpton D T0.I6.D300k.N000.L000 D T0.I6.D00k.N000.L000 D 3 T0.I6.D00k.N000.L000 T0.I6.D50k.N000.L000 D 4 Table 3 Query plan descrpton Query Query constrants Q {D D D 3 α} Q {(D + D ) α} {(D 3 +D 4 ) α} Q 3 {(D D ) (D 3 D 4 ) α} Q 4 {D (D D 3 ) α} & {D 4 β} Q 5 { D D (D 3 + D 4 ) α} Fgure 4 Relatonal-pattern mnng usng HFP-tree -444-0803-/07/$0.00 007 IEEE. 73

7. HFP-tree Constructon Results In Secton 5. we have proposed a jont rankng lst whch ranks tems from dfferent databases for HFP-tree constructon. We report n ths secton the performance of ths rankng mechansm n facltatng tree constructon and pattern growth processes. We apply Q n Table 3 on the synthetc databases, and use both the jont rankng lst and the fxed rankng lst to buld HFP-trees. We report the results n Fgure 6, where Fgure 6(a) denotes the comparson of the HFP-tree constructon tme, Fgure 6(b) represents the comparson of the total number of HFP-tree nteror nodes, and Fgure 6(c) reports the comparson of the HFP-growth tme. In all fgures, the x-axs denotes the support threshold α n Q, and the y-axs denotes the results of dfferent measures. The meanng of each curve n Fgure 6 s explaned n Fgure 5. As shown n Fgure 6(a), the proposed jont rankng lst can dramatcally reduce the tme n buldng an HFP-tree from multple databases, where the lower the support threshold α, the more sgnfcant the mprovement can be observed. When α=%, t wll cost the fxed rankng lst and jont rankng lst about 98 seconds and 60 seconds respectvely to buld the HFP-tree; on the other hand, when α becomes sgnfcantly low, say 0.0%, the cost of the jont rankng lst ncreases to about 98.5 seconds, whch s about 3.5 tmes less than the tme of the fxed rankng lst (364.8 seconds). A low α value wll have most tems n the database become frequent, and therefore be added nto the HFPtree. Ths can be very tme consumng, f the nsertng process does not take tem frequency nformaton nto consderaton, because each tem needs to check wth the exstng HFP-tree to fnd whether the current path already contans ths tem or not. The more the frequent tems, the fatter the HFP-tree, and the more tme s gong to be spent on ths process. On the other hand, a rankng order whch unfes the tem frequency nformaton from all databases can sgnfcantly reduce the tme n nsertng each transacton nto the HFP-tree, because each tem a wll have less search space n verfyng whether the current node (of the HFP-tree) already contans a or not. In addton, snce the jont rankng lst has tems sorted by ther frequences before they were nserted nto the HFP-tree, t wll have a better chance, compared to the fxed rankng lst, to force tems n a frequent temset to follow a sngle path, and consequently reduce the sze of the constructed HFP-tree. As shown n Fgure 6(b), the nteror node number of the HFP-tree bult from the jont rankng lst s about % to 0% less than the tree bult from the fxed rankng lst. Because of the HFP-tree qualty mprovement (more compact and less nteror nodes), the HFP-growth process wll consequently grow faster n fndng frequent patterns, as shown n Fgure 6(c). Snce the jont rankng lst unfes the rankng order of each tem from dfferent databases, one may argue that why we don t just treat all tems as they were from one sngle database, e.g., D=D +D +D 3, and then rank the tems accordng to ther total frequences (wth nfrequent tems n each database removed beforehand), just lke the tradtonal FP-tree method does. However, such a global rankng lst revews tems as they come from a sngle database wthout consderng ther frequences n each sngle database, whch may produce a lst nferor to the one from the jont rankng lst. For example, f the frequences of tems {a, b, c} n D and D are {3000, 000, 900} and {00, 000, 000} respectvely. The global rankng lst wll sum up each tem s frequency and produce the lst L=abc; on the other hand, the jont rankng lst wll produce the lst L=bac. Consderng that the most possble frequent temsets n D and D are {bc} nstead {ac} or {ab}, the jont rankng lst may lead to better results n realty. Fgure 6(a) also reports the HFP-tree constructon tme of the global rankng lst, whch further supports our analyss. The HFP-growth on the tree bult from the global rankng lst also needs more tme than the one bult from the jont rankng lst, and we therefore omt the results from ths mechansm n Fgures 6(b) and 6(c). 7. Query Runtme Comparson Fgure 7 reports a detaled runtme performance comparson between DRAMA and two smple solutons (SPV and CPM) on Q n Table 3, where the x-axs denotes the support threshold value α and the y-axs represents the system runtme n seconds. For a detaled comparson, we also lst the actual value of each method n the fgure. When the threshold value s relatvely small, say 0.05% or 0.0%, the runtmes of SPV and CPM are extremely large, whch makes no sense for comparson (the empty cells). Overall, DRAMA lnearly responds to the threshold value α and does an excellent job n answerng the query Q. When the value of α s larger than.5%, we notce that DRAMA s nferor to both SPV and CPM. A further study shows that for large α values, the tme for HFP-tree constructon becomes sgnfcant, Jont rankng lst Fxed rankng lst Global rankng lst Fgure 5 The meanngs of curves n Fgure 6 HFP-tree Constructon Tme (s) 340 300 60 0 80 40 00 60 0.0 0.05 0. 0.5.5 # of HFP-tree nodes 9370 66370 4370 6370 9370 66370 4370 6370 0.0 0.05 0. 0.5.5 HFP-growth Tme (s) 5 46 4 36 3 6 6 6 0.0 0.05 0. 0.5.5 Support Threshold (%) Support Threshold (%) Support Threshold (%) (a) HFP-tree constructon tme (b) # of HFP-tree nteror nodes (c) HFP-growth tme Fgure 6 HFP-tree constructon comparsons on Query n Table. (a) the HFP-tree constructon tme; (b) the total number of HFP-tree nteror nodes; and (c) the HFP-growth tme -444-0803-/07/$0.00 007 IEEE. 733

compared to the tme for HFP-growth. For example, when α=.5%, DRAMA spends about 68 seconds on buldng the HFP-tree; however, t only costs about 9 seconds for the HFPgrowth to mne the patterns. At ths support value level, SPV apples an FP-tree based algorthm on D, whch outputs only 96 patterns for D to verfy. So the performance of SPV at α=.5% s really just the runtme of the FP-tree mnng on D. On the other hand, when the threshold value decreases, the patterns generated from D can sgnfcantly ncrease, whch leads to a huge runtme expense for D to verfy these patterns (notce that database scannng for pattern verfcaton can be very expensve, especally for large databases). For example, when α=0.%, D wll generate about eghty thousand patterns whch need to be verfed by D, among whch about ten thousands patterns wll further need to be verfed by D 3. As shown n Fgure 7, the sequental verfcaton mechansm of SPV needs more than ten thousand seconds to check all those patterns. For DRAMA, although the tree constructon at ths level (α=0.%) costs about 96 seconds, the ntegrated pattern prunng mechansm wll sgnfcantly reduce the HFP-growth tme to about 0 seconds only. So n total, DRAMA can answer Q n about 06 seconds, whch s a huge mprovement compared to SPV. Although our analyss n Secton suggests that Collaboratve Parallel Mnng (CPM) may possbly outperform SPV, because of the underlyng collaboratve mnng process n canddate prunng. The results n Fgure 7 ndcate that ths s not the case. Because CPM needs multple databases to forward ther canddates to a central place for collaboratve mnng (by prunng unqualfed canddates), we can only apply Apror on each sngle database. So the system performance of CPM s crucally bounded by the poor performance of Apror based algorthms. When the support value α s large, say %, the performance of Apror and FP-tree s almost dentcal (snce not many tems can be frequent). However, for small α values, the stuaton can be totally dfferent. For example, when α=0.%, about 680 tems n D are frequent, whch produces more than 30 thousand length- patterns from D (although collaboratve pattern prunng can somewhat remove some canddates, t stll leaves a large number of canddates for D to evaluate). Ths huge burden sgnfcantly slows down the performance of CPM, and makes t almost unbearable n answerng many queres. However, beng worse than SPV does not necessarly mean CPM s useless. As we have analyzed n Secton, some queres lke Q n Table 3 cannot be answered by SPV, because no mnng from a sngle database can produce answers for Q. For such stuatons, CPM becomes useful. To answer a query lke Q, we need a mnng process whch s able to unfy multple databases nto one vew. Both DRAMA and CPM can possbly attan ths by usng ther collaboratve mnng and pattern prunng process, where only patterns wth ther support satsfyng (D + D ) α or {(D 3 +D 4 ) α are kept for further actons. For DRAMA, nstead of preflterng any sngle nfrequent tems before the HFP-tree constructon, we wll buld an HFP-tree by usng all tems n the transactons, and then let HFP-growth prune out the canddates on the fly. Ths mechansm turns out to be very effcent n realty, as the HFPtree constructon n ths case spends only 05 seconds (whch s about 7 seconds more than α=0.0%). As shown n Table 4 (where the value of α s fxed to 0.5%), the run tme performance of DRAMA s much better than CPM n answerng Q. Table 4 has further lsted a runtme comparson between DRAMA and CPM n answerng other queres n Table 3, wth the performance of DRAMA consstently and sgnfcantly better than CPM for all the queres. Query runtme comparson (seconds) 00000 0000 000 00 0 DRAMA SPV CPM 0.0 0.05 0. 0.5.5 DRAMA 3.6.6 6.4 06.9 96. 77 67.3 SPV 5787 033 759 69 7 4 CPM 99 064 45 89 9 Support Threshold (%) Fgure 7 Query runtme comparson on Q n Table 3 Table 4 Query runtme comparson on Q, Q 3, Q 4, and Q 5 n Table 3 (α=0.5%, β=0.0%) Algorthm Q Q 3 Q 4 Q 5 DRAMA 57. 35.6 9.5 47.7 CPM 37 5 094 957 7.3 Case Study on a Real-world Dataset To further assess the system performance of our proposed effort on real-world datasets, we download the US 000 census mcrodata from the IPUMS [6], whch provdes census nformaton about the US resdents (ndvduals and households). We use % sample of the year 000 census data wth forty seven attrbutes. Those attrbutes cover age, household/personal ncome, educaton, race, ctzenshp, poverty, and famly relatonshp etc. Because many attrbutes contan multple attrbute values, and some attrbutes are numercal, we further dscretze each contnuous attrbute and extend the total attrbute to 587 dstnct tems. We ntentonally collect the data from four states (Calforna, New York, Florda, and Vermont), correspondng to datasets CA, NY, FL, and VT. Dependng on the number of populatons n each state, the sze of the dataset vares from 6000 (Vermont) to over 330,000 records (Calforna). Table 5 reports a runtme performance comparson among DRAMA, CPM, and SPV, wth dataset settngs D =CA, D =NY, D 3 =VT, and D 4 =FL, and two sets of support threshold values. Because SPV s not able to answer Q and Q 5, ts results n the correspondng cells are set to N/A. Because census data are not randomly generated (lke the synthetc data), tem frequences are not random wth many tems frequences sgnfcantly hgher than others. So the support threshold values (α and β) we choose are relatve hgh. But even so, we can see that DRAMA consstently outperforms both SPV and CPM wth a sgnfcant runtme mprovement. The results n Table 5 ndcate that dfferent from the synthetc data, CPM actually has a much better performance than SPV n answerng some queres. In fact, when α=40% and β=5%, although t wll cost FP-tree mnng about 5 seconds to mne patterns from D, there are over ten thousand patterns generated from D wth the longest pattern contanng 3 tems. All these patterns need to be verfed by D, whch ncreases the runtme sgnfcantly. On the other hand, at the -444-0803-/07/$0.00 007 IEEE. 734