Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

Size: px

Start display at page:

Download "Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation"

Godfrey Brown
5 years ago
Views:

Large-Scale Data & Systems Group Hammer Slide:

Data & Systems (LSDS) Group Imperial College London

1 Large-Scale Data & Systems Group Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Georgios Theodorakis, Alexandros Koliousis, Peter Pietzuch, Holger Pirk Large-Scale Data & Systems (LSDS) Group Imperial College London [ADMS - Monday, August, ]

2 It s a Streaming Problem! Data-intensive system High-throughput Processing Low-latency results Data-Intensive Applications Real-time analytics applications Personalised web services Network Monitoring Fraud Detection High-throughput & Result Freshness Matter! Stream Processing

3 Processing unbounded data Windows: slice infinite data streams to finite subsets... Tumbling Window Every element contributes to one Window Result Sliding Window Overlapping Window Results 3

4 Window Aggregation: a key operator in Streaming Window Aggregation: accumulate the most recent data Naive Window Computation: VS. Recompute from scratch Easy to parallelise Incremental Window Computation: Intermediate result sharing Data and control dependencies SUM +++ = 6 SUM +++ = +++ = = +- = Subtract-on-Evict (SoE): O() for invertible functions (sum, avg, cnt) What happens for non-invertible functions (min/max)? 4

5 Non-invertible functions? Two-Stacks select timestamp, MIN(cpu) from TaskEvents [row 4 slide ] Input Values Queue maintained as Two-Stacks:. Push values in Front Stack: O(). When Front Stack is full: Flip Phase in O(n) Aggregate Values Push Front Stack Back Stack

6 Two-Stacks: Flip Phase select timestamp, MIN(cpu) from TaskEvents [row 4 slide ] MIN =... FLIP PHASE Queue maintained as Two-Stacks:. Push values in Front Stack: O(). When Front Stack is full: Flip Phase in O(n) Front Stack Back Stack 6

7 Two-Stacks: high throughput for slide! select timestamp, MIN(cpu) from TaskEvents [row 4 slide ] MIN = Queue maintained as Two-Stacks:. Push values in Front Stack: O(). When Front Stack is full: Flip Phase in O(n) 3. Pop values from the Back Stack: O() Flip happens occasionally: O() amortised time for all operations High throughput for slide! Front Stack Back Stack

8 Problem Statement What happens when slide >?

9 Problem Statement What happens when slide >? Do we maintain & use a different version each time? Can we have a single unified operator: that supports both functions? has good performance for varying slides? 9

10 Microarchitecture analysis

11 Window Aggregation: Where does time go? Different patterns than DBs Data dependencies Control-hazards ¾ slots retire: many operations commit per cycle Can we reduce the retiring ratio & compute more data at the same time?

12 How do we improve CPU-efficiency? Vectorise the computation within a slide (>) SIMD instructions Reduce L memory stalls

13 How to represent state? Not suitable for SIMD! Front Stack Back Stack Push Front_Stack Isolate data model from physical representation! Enable optimisations: SIMD instructions Process state that grows arbitrarily with pointer manipulation CPU-cache-friendly data layout Remove unnecessary writes from flip phase Flip Back_Stack Comparison with STL Stacks: 6% improvement Pop 3

14 Reduce footprint and operations in L cache! Input Values Front_Stack Back_Stack ⑴ select timestamp, MIN(cpu) from TaskEvents [row 4 slide ] Aggregate Values ⑴ B_Stack_Aggr Store window state in a columnar format a) easy to compress b) enhance SIMD instructions F_Stack_Aggr ⑶ F_Stack_Aggr ⑵ ⑵ Compress Aggregates of Front Stack: a) single value kept in register B_Stack_Aggr ⑶ Reduce the size of Back Stack: a) single value per slide Reduce cache footprint & operations: 33% improvement 4

15 Bulk Insertion & SIMD: vectorising computation _mm6_min_epi3(vec_input[i], vec_input[i+]); SIMD-parallel versions of aggregate functions Vectorise the computational intensive parts: Insertion: pre-aggregate values upon bulk insert Eviction: parallel scan for SoE Flip Phase: vectorised-version for Two-Stacks Parallel Processing within the slide (>): 4. speedup

Experimental Setup Hardware Single-threaded execution on a Intel Xeon CPU E-64 v3.6 GHz MB LLC cache 64 GB RAM gcc version.3. Invertible and non-invertible functions (min/max, sum, average,.

16 Experimental Setup Hardware Single-threaded execution on a Intel Xeon CPU E-64 v3.6 GHz MB LLC cache 64 GB RAM gcc version.3. Invertible and non-invertible functions (min/max, sum, average,...) Throughput comparison Benchmarks & Real-world Dataset Google Cluster Data 44M jobs events from Google infrastructure 6

17 The Effect of Optimisations on Sliding Windows Non-Incr Two-Stacks

18 The Effect of Optimisations on Sliding Windows Non-Incr. Two-Stacks(simd) SoE(simd). Two-Stacks For intermediate slides: up to throughput gains When slide = ½ size: cheaper to recompute

19 Real-world Scenario: Variable Windows spiky trace: -K events/sec variable windows stress implementations SoE Non-Incr Two-Stacks 9

20 Real-world Scenario: Variable Windows spiky trace: -K events/sec variable windows stress implementations Two-Stacks(simd) SoE(simd) SoE Non-Incr 4.3 Two-Stacks Robust performance: cope with variations in the workload!

21 CPU-efficient profile for tumbling windows % 6% Performance dominated: strongly by Back-End (memory-bandwidth) less by Retiring

22 CPU-efficient profile for tumbling windows % 6% 6% CPU-efficient implementation! Performance dominated: strongly by Back-End (memory-bandwidth) less by Retiring

23 Future Work Provide an efficient operator for holistic functions Highly efficient streaming operators (CPUs, FPGAs, GPUs) Just-in-time generation of platform-specific code Hardware-oblivious primitives that optimise computation based on windows Compilation-based techniques to deal with memory stalls data in CPU registers 3

24 Summary Problem Statement: How to compute both invertible and non-invertible functions with high performance for variable slides? Solution: Bridge the gap between sliding and tumbling windows with a single unified operator Highly CPU- & Work- efficient streaming aggregation Extend Two-Stacks algorithm Introduce parallel processing within a slide (>) Thank You! Questions? Georgios Theodorakis <grt@imperial.ac.uk> 4

25 Backup Slides Front_Stack Agg_Front_Stack Agg_Back_Stack Back_Stack

26 Integration with SABER 3 6

27 Integration with SABER 3..

28 Stack RLE Compression Example: Report the minimum requested CPU utilisation of submitted tasks select timestamp, MIN(cpu) from TaskEvents [row 4 slide ] Agg_B_Stack Agg_F_Stack B_Stack F_Stack Exploitable pattern on the Back Stack: High probability that the aggregate value is equal or smaller to the value beneath it Run-length encoding: Extra data dependencies (9-% worse performance)

Designing Hybrid Data Processing Systems for Heterogeneous Servers

Designing Hybrid Data Processing Systems for Heterogeneous Servers Peter Pietzuch Large-Scale Distributed Systems (LSDS) Group Imperial College London http://lsds.doc.ic.ac.uk University