LASH: Large-Scale Sequence Mining with Hierarchies Kaustubh Beedkar and Rainer Gemulla Data and Web Science Group University of Mannheim June 2 nd, 2015 SIGMOD 2015 Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 1
Syntactic Explorer (Verb to Verb Noun) Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 2
Sequence Mining Goal: Discover subsequences as patterns in sequence data Input: Collection of sequences of items, e.g., Text collection (sequence of words) Customer transactions (sequence of products) Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3
Sequence Mining Goal: Discover subsequences as patterns in sequence data Input: Collection of sequences of items, e.g., Text collection (sequence of words) Customer transactions (sequence of products) Output: subsequences that occur in σ input sequences (frequency threshold) have length at most λ (length threshold) have gap γ (contiguous subsequences or non-contiguous subsequences) Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3
Sequence Mining Goal: Discover subsequences as patterns in sequence data Input: Collection of sequences of items, e.g., Text collection (sequence of words) Customer transactions (sequence of products) Output: subsequences that occur in σ input sequences (frequency threshold) have length at most λ (length threshold) have gap γ (contiguous subsequences or non-contiguous subsequences) Example: S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3
Sequence Mining Goal: Discover subsequences as patterns in sequence data Input: Collection of sequences of items, e.g., Text collection (sequence of words) Customer transactions (sequence of products) Output: subsequences that occur in σ input sequences (frequency threshold) have length at most λ (length threshold) have gap γ (contiguous subsequences or non-contiguous subsequences) Example: S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London Subsequence: lives in σ = 2, λ = 2, γ = 0 Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 3
Hierarchies Items can be naturally arranged in a hierarchy, e.g., Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4
Hierarchies Items can be naturally arranged in a hierarchy, e.g., DET a an the Syntactic hierarchy Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4
Hierarchies Items can be naturally arranged in a hierarchy, e.g., DET a an the PERSON CITY Syntactic hierarchy Scientist Politician Melbourne... Albert Einstein...... Barack Obama Semantic hierarchy Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4
Hierarchies Items can be naturally arranged in a hierarchy, e.g., DET a an the PERSON CITY Syntactic hierarchy Scientist Politician Melbourne... Albert Einstein...... Barack Obama Photography Semantic hierarchy DSLR Camera Tripod Cannon5D Nikon5100 Product hierarchy... Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 4
Sequence Mining with Hierarchies Item hierarchies are specifically taken into account Discover non-trivial patterns Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5
Sequence Mining with Hierarchies Item hierarchies are specifically taken into account Discover non-trivial patterns Example S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5
Sequence Mining with Hierarchies Item hierarchies are specifically taken into account Discover non-trivial patterns Example S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON CITY Anna Bob Charlie Melbourne Berlin Semantic hierarchy London Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5
Sequence Mining with Hierarchies Item hierarchies are specifically taken into account Discover non-trivial patterns Example S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London Generalized subsequence: PERSON lives in CITY σ = 2, λ = 4, γ = 3 PERSON CITY Anna Bob Charlie Melbourne Berlin Semantic hierarchy London Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 5
Sequence Mining with Hierarchies Applications Linguistic patterns, e.g., read DET book NNP lives in NNP Information extraction, e.g., PERSON lives in CITY Market-basket analysis, e.g, buy DSLR camera photography book flash Web-usage mining... Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 6
LASH Distributed framework for sequence mining with hierarchies D H Built over MapReduce for large-scale data processing Hierarchy-aware item-based partitioning Map (Partitioning) Divide data into potentially overlapping partitions D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n Reduce (mining) Partitions are mined independently No global post-processing F Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 7
Outline 1 Introduction 2 Partitioning 3 Local Mining 4 Evaluation 5 Conclusion Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 8
Item-based Partitioning D H Hierarchy-aware item-based partitioning a b k D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n F a: Filter a but not b,...,k F b : Filter b but not c,...,k F F k : Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9
Item-based Partitioning Items are ordered by decreasing frequency, e.g., a < b < c < < k D H Hierarchy-aware item-based partitioning a b k D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n F a: Filter a but not b,...,k F b : Filter b but not c,...,k F F k : Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9
Item-based Partitioning Items are ordered by decreasing frequency, e.g., a < b < c < < k Create a partition for each frequent item called pivot item D H Hierarchy-aware item-based partitioning a b k D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n F a: Filter a but not b,...,k F b : Filter b but not c,...,k F F k : Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9
Item-based Partitioning Items are ordered by decreasing frequency, e.g., a < b < c < < k Create a partition for each frequent item called pivot item Key idea: partition the output space a }{{} F a }{{} F b < b < c < < k }{{} F c } {{ } F k D H Hierarchy-aware item-based partitioning a b k D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n F a: Filter a but not b,...,k F b : Filter b but not c,...,k F F k : Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9
Item-based Partitioning Items are ordered by decreasing frequency, e.g., a < b < c < < k Create a partition for each frequent item called pivot item Key idea: partition the output space a }{{} F a }{{} F b < b < c < < k }{{} F c } {{ } F k Rewrite D for each pivot item Reduces communication Reduces computation Reduces skew D H Hierarchy-aware item-based partitioning a b k D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n F a: Filter a but not b,...,k F b : Filter b but not c,...,k F F k : Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 9
Item-based Partitioning Example (σ = 2, γ = 3, λ = 4) S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON CITY Anna Bob Charlie Melbourne Berlin London Semantic hierarchy Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10
Item-based Partitioning Example (σ = 2, γ = 3, λ = 4) S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON CITY Anna Bob Charlie Melbourne Berlin London Semantic hierarchy PERSON < CITY < in < lives Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10
Item-based Partitioning Example (σ = 2, γ = 3, λ = 4) S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON : 3 PERSON PERSON CITY Anna Bob Charlie Melbourne Berlin London Semantic hierarchy PERSON < CITY < in < lives Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10
Item-based Partitioning Example (σ = 2, γ = 3, λ = 4) S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON CITY PERSON : 3 PERSON PERSON _ 2 CITY : 1 PERSON _ CITY : 1 CITY Anna Bob Charlie Melbourne Berlin London Semantic hierarchy PERSON < CITY < in < lives Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10
Item-based Partitioning Example (σ = 2, γ = 3, λ = 4) S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON CITY PERSON : 3 PERSON PERSON _ 2 CITY : 1 PERSON _ CITY : 1 CITY Anna Bob Charlie Melbourne Berlin Semantic hierarchy London PERSON _ in CITY : 1 PERSON _ in _ 3 CITY : 1 in PERSON < CITY < in < lives Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10
Item-based Partitioning Example (σ = 2, γ = 3, λ = 4) S 1 : Anna lives in Melbourne S 2 : Bob lives in the city of Berlin S 3 : Charlie likes London PERSON CITY PERSON : 3 PERSON PERSON _ 2 CITY : 1 PERSON _ CITY : 1 CITY Anna Bob Charlie Melbourne Berlin Semantic hierarchy London PERSON _ in CITY : 1 PERSON _ in _ 3 CITY : 1 in PERSON < CITY < in < lives PERSON lives in CITY : 1 PERSON lives in _ 3 CITY : 1 lives Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 10
Outline 1 Introduction 2 Partitioning 3 Local Mining 4 Evaluation 5 Conclusion Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 11
Local Mining Goal: Compute pivot sequences a }{{} F a }{{} F b < b < c < < k }{{} F c } {{ } F k D H Hierarchy-aware item-based partitioning a b k D 1 H 1 D 2 H 2... D n H n Local mining Local mining Local mining F 1 F 2... F n F a: Filter a but not b,...,k F b : Filter b but not c,...,k F F k : Filter k Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 12
Local Mining Traditional approach Use any mining algorithm (based on depth-first or breadth-first search) Filter out non-pivot sequences Example: depth-first search Pivot item: e a b c d e aa ab ac ae bd ba be cd ce da db dc ea eb ec ed ee abd abe acd ace aee aea aeb aec aed dab dac dae ebd aecd daec Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 13
Local Mining Pivot sequence miner (PSM) Mines only pivot sequences Start with the pivot item Right expansions Left expansions Optimized search space exploration Example: PSM search space Pivot item: e e be ce ae ee ea ec eb ed bae dae aeb ebd Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 14
Outline 1 Introduction 2 Partitioning 3 Local Mining 4 Evaluation 5 Conclusion Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 15
Overall Runtime The New York Times Corpus 50M sequences, 1B items of which 2.7M distinct Syntactic hierarchy (word lowercase lemma POS tag) 10 node hadoop cluster Total time (in seconds) 100 1000 10000 Naive Semi naive LASH P(1000,0,3) P(100,0,3) P(100,0,5) CLP(100,0,5) NYT Hierarchy (σ,γ,λ) > 12 hrs LASH is multiple orders of magnitude faster Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 16
Local Mining Mining time (in seconds) 100 1000 10000 BFS DFS PSM PSM + Index LP(1000,0,5) LP(100,0,5) CLP(100,0,5) CLP(100,0,7) NYT Hierarchy (σ,γ,λ) Insufficient memory PSM is effective, more than 3 faster Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 17
Scalability Total time (seconds) 0 500 1000 1500 2000 2500 Map Shuffle Reduce Total time (seconds) 0 100 300 500 700 Map Shuffle Reduce 2 4 8 2(25%) 4(50%) 8(100%) Number of machines Number of machines (% of data) (a) Strong Scalability (b) Weak Scalability Good strong and weak scalability Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 18
Outline 1 Introduction 2 Partitioning 3 Local Mining 4 Evaluation 5 Conclusion Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 19
Summary and Contributions Sequence mining with hierarchies is an important problem Enables mining non-trivial patterns Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20
Summary and Contributions Sequence mining with hierarchies is an important problem Enables mining non-trivial patterns LASH: LArge-scale Sequence mining with Hierarchies Novel hierarchy-aware form of item-based partitioning Efficient special-purpose algorithm for mining each partition Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20
Summary and Contributions Sequence mining with hierarchies is an important problem Enables mining non-trivial patterns LASH: LArge-scale Sequence mining with Hierarchies Novel hierarchy-aware form of item-based partitioning Efficient special-purpose algorithm for mining each partition First distributed, scalable algorithm to mine such sequences Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20
Summary and Contributions Sequence mining with hierarchies is an important problem Enables mining non-trivial patterns LASH: LArge-scale Sequence mining with Hierarchies Novel hierarchy-aware form of item-based partitioning Efficient special-purpose algorithm for mining each partition First distributed, scalable algorithm to mine such sequences Thank you! Questions? / Comments Kaustubh Beedkar and Rainer Gemulla LASH SIGMOD 2015 June 02, 2015 20