Adaptive Parallel Compressed Event Matching

Size: px

Start display at page:

Download "Adaptive Parallel Compressed Event Matching"

Shanon Wiggins
5 years ago
Views:

1 Adaptive Parallel Compressed Event Matching Mohammad Sadoghi 1,2 Hans-Arno Jacobsen 2 1 IBM T.J. Watson Research Center 2 Middleware Systems Research Group, University of Toronto April 2014 Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

2 1 Event Matching 2 BE-Tree (Boolean Expression-Tree) Background 3 Parallel BE-Tree 4 Experimental Analysis 5 Conclusions & Future Work Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

3 Computational Advertising (A Billion-dollar Industry) Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

4 Computational Advertising (A Billion-dollar Industry) Advertisement Example: (age < 32) wt=0.2 (credit-score > 630) wt=0.6 (num-visits > 4) wt=0.1 (price = 150) wt=0.1 Advertiser Sears Sony Amazon Advertising Campaigns Advertiser Subscriptions (modeled as Boolean Expressions) Indexing Kernel Scaling to millions of subscriptions (queries) over hundreds of dimensions Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

5 Computational Advertising (A Billion-dollar Industry) Advertisement Example: (age < 32) wt=0.2 (credit-score > 630) wt=0.6 (num-visits > 4) wt=0.1 (price = 150) wt=0.1 Advertiser Sears Sony Amazon Advertising Campaigns Advertiser Subscriptions (modeled as Boolean Expressions) Indexing Kernel (Num-visits=13) wt=0.5 (age=25) wt=0.1 (price<235) wt=0.5 (credit-score=647) wt=0.2 User Profiles Online Users Clickstream car=bmw BMW X model=x3 year=2008 Events Events Supporting up to millions of events per second Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

6 Computational Advertising (A Billion-dollar Industry) Advertisement Example: (age < 32) wt=0.2 (credit-score > 630) wt=0.6 (num-visits > 4) wt=0.1 (price = 150) wt=0.1 Advertiser Sears Sony Amazon Advertising Campaigns Advertiser Subscriptions (modeled as Boolean Expressions) (Num-visits=13) wt=0.5 (age=25) wt=0.1 (credit-score=647) wt=0.2 (price<235) wt=0.5 Events User Profiles Indexing Kernel Online Users Ads (Relevant) Clickstream Ads car=bmw BMW X model=x3 year=2008 Events Rocket Fuel processes 19 billion bid requests a day and each ad is served in 100ms Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

7 Application Scenarios 1 Push-based query processing (data analytics) 2 Computational advertising (targeted advertising) 3 Computational finance (algorithmic trading) 4 Approximate string matching (data quality and data cleaning) 5 Intrusion detection (deep packet inspection) 6 Declarative data-centric workflows (business process management) Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

8 Application Scenarios 1 Push-based query processing (data analytics) 2 Computational advertising (targeted advertising) 3 Computational finance (algorithmic trading) 4 Approximate string matching (data quality and data cleaning) 5 Intrusion detection (deep packet inspection) 6 Declarative data-centric workflows (business process management) Problem Statement To continuously evaluate a set of patterns/specifications (subscriptions) over incoming event stream. Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

9 Matching Problem Challenges 1 Handle subscription with a high degree of overlap 2 Scale to millions of subscriptions over thousands of dimensions 3 Sustain high-matching rates in presence of frequent changes of subscriptions 4 Adapt to skewed workload distributions (self-adjusting mechanism) 5 Retrieve only the most relevant subscriptions for given a event 6 Exploit the parallelism and minimize iterations over the matching structure 7 Enable matching over re-ordered and compressed event stream Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

10 1 Event Matching 2 BE-Tree (Boolean Expression-Tree) Background 3 Parallel BE-Tree 4 Experimental Analysis 5 Conclusions & Future Work Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

11 Language and Data Model Subscriptions/Events are defined as Boolean expressions (conjunctions of predicates) Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

12 Language and Data Model Subscriptions/Events are defined as Boolean expressions (conjunctions of predicates) A predicate P (attr,opt,val,wt) (x) is a quadruple consisting of an attribute (in an n-dimensional attribute space), an operator, a range of values, and a weight Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

13 Language and Data Model Subscriptions/Events are defined as Boolean expressions (conjunctions of predicates) A predicate P (attr,opt,val,wt) (x) is a quadruple consisting of an attribute (in an n-dimensional attribute space), an operator, a range of values, and a weight A predicate P(x) either accepts or rejects an input x such that P : x {True, False}, where x Dom(P attr ) Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

14 Language and Data Model Subscriptions/Events are defined as Boolean expressions (conjunctions of predicates) A predicate P (attr,opt,val,wt) (x) is a quadruple consisting of an attribute (in an n-dimensional attribute space), an operator, a range of values, and a weight A predicate P(x) either accepts or rejects an input x such that P : x {True, False}, where x Dom(P attr ) Each predicate supports relational operators (<,, =,,, >), set operators (, / ), or the SQL BETWEEN operator Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

15 Language and Data Model Subscriptions/Events are defined as Boolean expressions (conjunctions of predicates) A predicate P (attr,opt,val,wt) (x) is a quadruple consisting of an attribute (in an n-dimensional attribute space), an operator, a range of values, and a weight A predicate P(x) either accepts or rejects an input x such that P : x {True, False}, where x Dom(P attr ) Each predicate supports relational operators (<,, =,,, >), set operators (, / ), or the SQL BETWEEN operator Boolean Expression P attr,opt,val,wt 1 (x) P attr,opt,val,wt k (x), k n; i, j k, Pi attr = Pj attr iff i = j Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

16 Matching Semantics Stabbing Subscription Given an event ɛ and a set of subscriptions Σ, find all subscriptions σ i Σ that are satisfied by ɛ. Definition SQ(ɛ) = {σ i Pq attr,opt,val,wt (x) σ i, Po attr,opt,val,wt (x) ɛ, Pq attr = Po attr, x Dom(Pq attr ), P q (x) P o (x)}. Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

17 Design Principles Most Important Design Feature Systematically explore the space in two iterative phases of space partitioning and space clustering. The two-phased space-cutting technique consists of 1 space partitioning: global structuring to determine the best splitting dimension(s) 2 space clustering: local structuring for each partition to determine the best grouping of expressions w.r.t. chosen dimension(s) Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

18 Intuition Behind the Two-phase Space-cutting Technique SUBSCRIPTION SPACE SUBSCRIPTIONS Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

19 Intuition Behind the Two-phase Space-cutting Technique SPACE PARTITIONING X-AXIS Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

20 Intuition Behind the Two-phase Space-cutting Technique X-AXIS SPACE CLUSTERING Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

21 Intuition Behind the Two-phase Space-cutting Technique SPACE PARTITIONING Y-AXIS Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

22 Intuition Behind the Two-phase Space-cutting Technique SPACE PARTITIONING SPACE CLUSTERING Y-AXIS Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

23 BE-Tree Core Design (Two-phase Space-cutting) c l k = number of predicates per subscriptions N = domain cardinality O(1) p-directory Partitioning p p O(klogN) O(logN) c-directory c-directory Clustering c c c c l p-directory l l p partition-node p p c l cluster-node leaf-node To systematically explore the space using the two-phases space-cutting technique 1 to cope with the curse of dimensionality 2 to support dynamic changes of subscriptions Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

24 1 Event Matching 2 BE-Tree (Boolean Expression-Tree) Background 3 Parallel BE-Tree 4 Experimental Analysis 5 Conclusions & Future Work Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

25 BE-Tree Bitmap-based Encoded Matching Predicate-based Event Encoding Concise and cache-conscious encoding of events (data) Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

26 BE-Tree Bitmap-based Encoded Matching Predicate-based Event Encoding Bitmap-based Event Encoding Concise and cache-conscious encoding of events (data) Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

27 BE-Tree Bitmap-based Encoded Matching Predicate-based Event Encoding Bitmap-based Event Encoding BE-Tree Concise and cache-conscious encoding of events (data) Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

28 BE-Tree Bitmap-based Encoded Matching Predicate-based Event Encoding Bitmap-based Event Encoding Match Results Subscription ID BE-Tree Concise and cache-conscious encoding of events (data) Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

29 Predicate Evaluation Through Bitmap-based Encoding p-directory c l p p Compressed Events Result Bit-array (bitmap-based event encoding) c-directory c c c-directory c c l 2-dimensional subscription's representation l p l p-node c c-node S 1 S m l l-node Concise and cache-conscious encoding of events (data) Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

30 Parallel Compressed Matching over BE-Tree Stage 1 Event Stream Event 1 Event i Event m Traversing BE-Tree and scanning leaf pages in parallel exactly once for m events Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

31 Parallel Compressed Matching over BE-Tree Event 1 Stage 1 Event Stream Stage 2 Bitmap-based Event Encoding Thread Event i Thread i Event m Thread m Traversing BE-Tree and scanning leaf pages in parallel exactly once for m events Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

32 Parallel Compressed Matching over BE-Tree Event 1 Stage 1 Event Stream Stage 2 Bitmap-based Event Encoding Thread Stage 3 Bitwise-OR of Bitmap-based Encodings (Compressing Event Stream) Event i Thread i Thread 1 Thread m Event m Thread m Traversing BE-Tree and scanning leaf pages in parallel exactly once for m events Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

33 Parallel Compressed Matching over BE-Tree Event 1 Stage 1 Event Stream Stage 2 Bitmap-based Event Encoding Thread 1 Stage 3 Bitwise-OR of Bitmap-based Encodings (Compressing Event Stream) Stage 4 Subscription ID Event Matching (Parallel Tree Traversal) Event i Thread i Thread 1 Thread m BE-Tree Event m Thread m Thread 1 Thread i Thread m Traversing BE-Tree and scanning leaf pages in parallel exactly once for m events Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

34 Parallel Compressed Matching over BE-Tree Stage 5 Event 1 Event i Stage 1 Event Stream Stage 2 Bitmap-based Event Encoding Thread Stage 4 Subscription ID Event Matching (Parallel Tree Traversal) Thread i Bitwise-OR of Bitmap-based Encodings (Compressing Event Stream) Thread 1 Stage 3 Thread m BE-Tree i th Event Event Matching (Parallel Leaf Scanning) Thread 1 Match Results Thread i Match Results Event m Thread m Thread 1 Thread i Thread m Thread m Match Results Traversing BE-Tree and scanning leaf pages in parallel exactly once for m events Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

35 Reordering Events (Adaptive Parallel Matching) Incoming Events Event 1 Event 2 Event 3 Event 4 Event b-4 Event b-3 Event b-2 Event b-1 Event b Efficient online re-ordering of event stream Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

36 Reordering Events (Adaptive Parallel Matching) Incoming Events Event Re-ordering (Inserting Events into BE-Tree) Event 1 Event 2 Event 3 Event 4 Event b-4 Event b-3 Event b-2 BE-Tree (events) Event b-1 Event b Efficient online re-ordering of event stream Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

37 Reordering Events (Adaptive Parallel Matching) Incoming Events Event Re-ordering (Inserting Events into BE-Tree) Reordered Events Event 1 Event 2 Event 3 Event 4 Event b-4 Event b-3 Event b-2 BE-Tree (events) Event b-1 Event b Efficient online re-ordering of event stream Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

38 Reordering Events (Adaptive Parallel Matching) Incoming Events Event Re-ordering (Inserting Events into BE-Tree) Reordered Events Adaptive Processing Event 1 Event 2 Compressed Event 3 Event 4 Event b-4 Event b-3 Event b-2 Event b-1 Event b BE-Tree (events) Compressed Uncompressed Efficient online re-ordering of event stream Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

39 Reordering Events (Adaptive Parallel Matching) Incoming Events Event Re-ordering (Inserting Events into BE-Tree) Reordered Events Adaptive Processing Event Matching Event 1 Event 2 Event 3 Compressed BE-Tree (subscriptions) Event 4 Event b-4 BE-Tree (events) Compressed BE-Tree (subscriptions) Event b-3 Event b-2 Event b-1 Event b Uncompressed BE-Tree (subscriptions) Efficient online re-ordering of event stream Controlling the bucket size, reasoning about the bucket heterogeneity, hybrid matching approach Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

40 1 Event Matching 2 BE-Tree (Boolean Expression-Tree) Background 3 Parallel BE-Tree 4 Experimental Analysis 5 Conclusions & Future Work Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

41 Experimental Evaluation Algorithms 1 BE: BE-Tree (Sadoghi, Jacobsen. SIGMOD 11) 2 Bitmap: BE-Tree (with bitmap) (Sadoghi, Jacobsen. TODS 13) 3 A-PCM: Adaptive Parallel Compressed Matching 4 GR: IBM Gryphon (Aguilera et al., PODC 99) 5 P: Propagation Algorithm (Fabret et al. SIGMOD 01) 6 k-ind: k-index (Whang et al. VLDB 09) 7 SIFT: Counting Algorithm (Yan et al. TODS 94) 8 SCAN: Sequential Scan Hardware 1 2 Intel Quad-core Xeon CPU 3.00GHz, 16GB main memory Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

42 Workload Configurations Workload Size Number of Dimensions Match Prob Stream Similarity Distinct Predicates Match Prob (DBLP Data) Size 1M-5M 5M 5M 5M 5M 5M Number of Dim Cardinality Number of Sub Pred Number of Event Pred Pred Avg. Range Size % % Equality Pred Match Prob % ( 0) or 1 ( 0) or Stream Similarity % BEGen Our comprehensive Boolean expression workload generator: Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

43 Effect of Workload Size on Matching (Log Scale) Matching Time/Event (ms) M 900K 700K 500K 300K 100K BE-B BE GR P k-ind SIFT SCAN Matching Time/Event (ms) M 900K 700K 500K 300K 100K BE-B BE GR P k-ind SIFT SCAN Varying Number of Subscriptions (a) Uniform: Workload Size Varying Number of Subscriptions (b) Zipf: Workload Size Figure: Varying Workload Size (Match Prob = 1%) Improving matching latency by orders of magnitude through our two-phase space-cutting technique Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

44 Effect of Parallel Matching (Log Scale) Avg. Throughput/Second BE-Tree Bitmap Parallel A-PCM Avg. Throughput/Second BE-Tree Bitmap Parallel A-PCM Varying Overlap Probablity; Sub=5M (a) Matching Probability m = 1% Varying Overlap Probablity; Sub=5M (b) Matching Probability 0% Figure: Varying % of Stream Similarity Significantly improving matching throughput through our parallel matching over compressed streams Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

45 Parallel Compressed Matching Time Breakdown Datasets Stream Bitmap Encoding Tree Leaf Re-ordering & Compression Traversal Scanning Unif 2.57% 0.91% 32.08% 63.62% Zipf 0.36% 0.18% 28.61% 70.64% Author 2.18% 0.76% 29.53% 66.97% Title 1.04% 0.22% 23.20% 75.41% Table: Matching time breakdown (%) Overhead of stream re-ordering, parallel bitmap encoding, and compression is negligible Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

46 Effect of Parallel Matching (Log Scale) Cache-misses (Percentage) BE-Tree Bitmap Parallel A-PCM Seq A-PCM Varying Overlap Probablity; Sub=5M (a) Matching Probability m = 1% Avg. Matching Time (ms) Bitmap Varying Event Delay (ms); Sub=5M (b) Matching Probability 0% Figure: Percentage of Cache-misses & Latency Comparison Substantially reducing both cache-misses and improving latency for high-throughput event rate Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

47 1 Event Matching 2 BE-Tree (Boolean Expression-Tree) Background 3 Parallel BE-Tree 4 Experimental Analysis 5 Conclusions & Future Work Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

48 Conclusions Event matching is at the heart of event processing engines (e.g., computational advertising) Key contributions are 1 A novel parallel compressed event matching algorithm over a bitmap-based encoding 2 An efficient online stream re-ordering technique 3 An adaptive algorithm that depending on stream similarity selectively compresses similar events Future work: Moving towards heterogeneous computational model (e.g., FPGAs, GPUs, and co-processors) Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

49 Questions? Thank you! Mohammad Sadoghi (IBM T.J. Watson) Parallel Matching April / 27

BE-Tree: An Index Structure to Efficiently Match Boolean Expressions over High-dimensional Space. University of Toronto

BE-Tree: An Index Structure to Efficiently Match Boolean Expressions over High-dimensional Space Mohammad Sadoghi Hans-Arno Jacobsen University of Toronto June 15, 2011 Mohammad Sadoghi (University of