Designing Hybrid Data Processing Systems for Heterogeneous Servers

Size: px

Start display at page:

Download "Designing Hybrid Data Processing Systems for Heterogeneous Servers"

Archibald Spencer
5 years ago
Views:

Group Imperial College London http://lsds.doc.ic.ac.

1 Designing Hybrid Data Processing Systems for Heterogeneous Servers Peter Pietzuch Large-Scale Distributed Systems (LSDS) Group Imperial College London University of Cambridge Cambridge, United Kingdom November

2 Data is the New Oil Many new sources of data become available Most data is produced continuously Internet services, web sites Social feeds IoT devices Cameras RFID tags Mobile devices Scientific instruments Data repositories Data powers plethora of new and personalised services Peter Pietzuch - Imperial College London 2

3 Data-Intensive Systems Data analytics over web click streams How to maximise user experience with relevant content? How to analyse click paths to trace most common user routes? Uniquely Identified Visitors Unique Visitors Visits Page Views Hits Volume of Available Data Machine learning models for online prediction E.g. serving adverts on search engines Solution: AdPredictor Bayesian learning algorithm ranks adverts according to click probabilities f n update f 1 predict y E { 1,1} Peter Pietzuch - Imperial College London 3

Feedzai: 40K credit card transactions/s < 25

queries/s < 1 ms latency NovaSparks: 150M

4 Throughout and Result Freshness Matter High-throughput processing Data-intensive system Low-latency results Facebook Insights: Aggregates 9 GB/s < 10 sec latency Feedzai: 40K credit card transactions/s < 25 ms latency Google Zeitgeist: 40K user queries/s < 1 ms latency NovaSparks: 150M trade options/s < 1 ms latency Peter Pietzuch - Imperial College London 4

5 Design Space for Data-Intensive Systems Data amount TBs GBs MBs Hard for machine learning algorithms Easy for most algorithms Hard for all algorithms Tension between performance and algorithmic complexity 10s 1s 100ms 10ms 1ms Result latency Peter Pietzuch - Imperial College London 5

6 Algorithmic Complexity Increases T1 T2 T3 T1(a, b, c) T2(c, d, e) T3(g, i, h) highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed Pre-process Share state Parallelize Aggregate Iterate Topicbased filtering Contentbased filtering Complex pattern matching Stream queries Online machine learning, data mining Publish/Subscribe Complex Event Processing (CEP) Stream processing Peter Pietzuch - Imperial College London 6

7 Scale Out Model in Data Centres Peter Pietzuch - Imperial College London 7

8 Task Parallelism vs. Data Parallelism Input data... Servers in data centre Results select distinct W.cid From select distinct Payments W.cid [range 300 seconds] as W, From select distinct Payments Payments W.cid [partition-by [range seconds] row] as Las W, From select highway, where W.cid Payments Payments segment, = L.cid[partition-by [range direction, 300 and W.region 1 seconds] AVG(speed)!= row] L.region as Las W, from Vehicles[range 5 seconds slide 1 second] where W.cid Payments = L.cid[partition-by and W.region 1!= row] L.region as L group by highway, segment, direction where W.cid = L.cid and W.region!= L.region having avg < 40 select highway, segment, direction, AVG(speed) from Vehicles[range 5 seconds slide 1 second] group by highway, segment, direction having avg < 40 Task parallelism: Multiple data processing jobs Data parallelism: Single data processing job Peter Pietzuch - Imperial College London 8

9 Distributed Dataflow Systems Idea: Execute data-parallel tasks on cluster nodes parallelism degree 2 Tasks organised as dataflow graph parallelism degree 3 Almost all big data systems do this: Apache Hadoop, Apache Spark, Apache Storm, Apache Flink, Google TensorFlow,... Peter Pietzuch - Imperial College London 9

10 Nobody Ever Got Fired For Using Hadoop/Spark 2012 study of MapReduce workloads Microsoft: median job size < 14 GB Yahoo: median job size < 12.5 GB Facebook: 90% of jobs < 100 GB (A. Rowstron, D. Narayanan, A. Donnely, G. O Shea, A. Douglas, HotCDP 12) Many data-intensive jobs easily fit into memory One server cheaper/more efficient than compute cluster Peter Pietzuch - Imperial College London 10

Parallelism of Heterogeneous Servers Servers have

GPUs common PCIe Bus Command Queue SMX 1.

C 5 C 6 C 7 C 8 Socket 2 C 1 C 2 C 3 C 4 C 5 C 6 C

DRAM New types of compute accelerators: Xeon Phi,

11 Parallelism of Heterogeneous Servers Servers have many parallel CPU cores Heterogeneous servers with GPUs common PCIe Bus Command Queue SMX 1... SMX N 10s of CPU cores Socket 1 C 1 C 2 C 3 C 4 C 5 C 6 C 7 C 8 Socket 2 C 1 C 2 C 3 C 4 C 5 C 6 C 7 C s of GPU cores L3 L3 L2 Cache DRAM DMA DRAM New types of compute accelerators: Xeon Phi, Google's TPUs, FPGAs,... Peter Pietzuch - Imperial College London 11

Zürich) E How can Data-Intensive Systems Exploit

12 Servers Are Becoming Increasingly Heterogeneous Slide courtesy of Torsten Hoefler (Systems Group, ETH Zürich) E How can Data-Intensive Systems Exploit Heterogeneous Hardware? Peter Pietzuch - Imperial College London 12

13 Roadmap SABER: Hybrid stream processing engine for heterogeneous servers [SIGMOD 16] (1) How to parallelise computation on modern hardware? (2) How to utilise heterogeneous servers? (3) Experimental performance results Peter Pietzuch - Imperial College London 13

14 Analytics with Window-based Stream Queries Real-time analytics over data streams Windows define finite data amount for processing highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed highway segment direction speed window now Time-based window with size τ at current time t [t - τ : t] Vehicles[Range τ seconds] Count-based window with size n: last n tuples Vehicles[Rows n] Peter Pietzuch - Imperial College London 14

15 Defining Stream Query Semantics Windows convert data streams to dynamic relations (database table) Window specification Streams Relations Any relational query (select, project, join, group by, etc) Stream operators: Istream, Dstream, Rstream Peter Pietzuch - Imperial College London 15

16 SQL Stream Queries SQL provides well-defined declarative semantics for queries Based on relational algebra (select, project, join, ) Example: Identify slow moving traffic on highway Input stream: Vehicles(highway, segment, direction, speed) Find highway segments with average speed below 40 km/h Input data select highway, segment, direction, AVG(speed) as avg from Vehicles[range 5 sec slide 1 sec] group by highway, segment, direction having avg < 40 Output Operators Peter Pietzuch - Imperial College London 16

17 (1) How to Parallelise Computation? Perform query evaluation across sliding windows in parallel Exploit data parallelism across stream size: slide: 4 sec 1 sec w 1 w 2 w 3 w 4 Peter Pietzuch - Imperial College London 17

18 How to use GPUs with Stream Queries? Naive strategy parallelises computation along window boundaries size: 4 sec slide: 1 sec Task T 1 Combine partial results Task T 2 E Window-based parallelism results in redundant computation Peter Pietzuch - Imperial College London 18

19 How to use GPUs with Stream Queries? Parallel processing of non-overlapping window data? size: 4 sec slide: 1 sec wt 1 wt 2 Combine partial results T w 3 3 T 5 T 4 w 4 E Slide-based parallelism limits degree of parallelism Peter Pietzuch - Imperial College London 19

20 Apache Spark: Small Slides à Low Throughput Throughput (10 6 tuples/s) select AVG(S.1) from S [rows 1024 slide x] Window slide (10 6 tuples) Spark relates window slide to micro-batch size used for parallelisation E Avoid coupling system parameters with query definition Peter Pietzuch - Imperial College London 20

21 SABER: Parallel Window Processing Idea: Parallelise using task size that is best for hardware T 3 T 2 T 1 5 tuples/task w 5 w 4 w 3 w 2 w 1 size: 7 tuples slide: 2 tuples Task contains one or more window fragments Peter Pietzuch - Imperial College London 21

22 SABER: Window Fragment Processing Process window fragments in parallel Reassemble partial results to obtain overall result Worker A: T 1 w 1 w 2 w 3 w 1 w 2 result Empty Empty result w 5 w 4 w 1 w 2 w 3 Slot 2 Slot 1 Result stage Output result circular buffer Worker B: T 2 Partial result reassembly must also be done in parallel Peter Pietzuch - Imperial College London 22

23 API for Operator Implementation T 2 T 1 5 tuples/task Fragment function f f Processes window fragments w 2 w 1 size: 7 tuples slide: 2 tuples f f f f f f f f f b Assembly function f a Merges partial window results w 2 results w 1 results f a f a output Batch function f b Composes fragment functions within task Allows incremental processing Peter Pietzuch - Imperial College London 23

24 SABER: Performance of Window-based Queries 8 select AVG(S.1) from S [rows 1024 slide x] 0.2 Throughput (GB/s) Latency (sec) Window slide (tuples) 0 E Performance of window-based queries remains predictable Peter Pietzuch - Imperial College London 24

25 How to Pick the Task Size? Problem: Small data transfers over PCIe bus costly Example: select * from S where p1 [rows 1 slide 1] 8 Throughput (GB/s) CPU-only processing GPU-only processing Limited by dispatcher and thread contention Limited by data movement Task Size (KB) Peter Pietzuch - Imperial College London 25

26 Roadmap SABER: Hybrid stream processing engine for heterogeneous servers [SIGMOD 16] (1) How to parallelise computation on modern hardware? E Avoid coupling system parameters with processing semantics (2) How to utilise heterogeneous servers? Peter Pietzuch - Imperial College London 26

27 (2) How to Utilise Heterogeneous Servers? Hard to decide acceleration potential of heterogeneous processors Depends on operator semantics, window definition, data distribution, GPU is faster CPU is faster Throughput (GB/s) CPU execution GPU execution Aggregation Group-by θ-join E Don t leave decision about heterogeneous processors to users Peter Pietzuch - Imperial College London 27

28 SABER: Hybrid Execution Model Idea: Execute tasks on all heterogeneous processors (CPUs, GPUs,...) Task queue CPU worker T 10 T 9 T 8 T 7 T 6 T 5 T 4 T 3 T 2 T 1 GPU worker Q A Q A Q A Q A CPU T 3 T 2 T 7 T 10 T 8 T 9 GPU T 1 T 2 T 4 T 5 T 6 Fully utilise all hardware parallelism available in dedicated servers Peter Pietzuch - Imperial College London 28

29 Static Task Scheduling using Cost Model? Profile tasks to obtain cost model Assign tasks to processor with shortest execution time CPU GPU Task queue CPU workers Q A 3 ms 2 ms 3 ms 1 ms T 10 Q A T 9 Q A T 8 T 7 Q A T 6 T 5 T 4 T 3 T 2 T 1 Q A GPU worker Static: Q A on CPU and on GPU CPU GPU T 1 T 2 T 3 T 4 T 5 T 6 T 8 T 7 T 9 T 10 E Static scheduling under-utilises processors Peter Pietzuch - Imperial College London 29

30 First-Come First-Serve Task Scheduling? Assign tasks to processors first-come, first-serve CPU/GPU execute both Q A and tasks CPU GPU Q A 3 ms 2 ms T 10 T 9 T 8 T 7 T 6 T 5 T 4 T 3 Task queue T 2 T 1 CPU workers GPU worker 3 ms 1 ms Q A Q A Q A Q A First-Come First-Served CPU GPU T 1 T 4 T 8 T 2 T 3 T 5 T 6 T 7 T 9 T 10 E FCFS ignores effectiveness of processors for given task Peter Pietzuch - Imperial College London 30

31 Heterogeneous Lookahead Scheduling (HLS) Idea: Scheduler assigns tasks to idle processors dynamically Skips tasks that could be executed faster by another processor CPU GPU Q A 3 ms 2 ms T 10 T 9 T 8 T 7 T 6 T 5 T 4 T 3 Task queue T 2 T 1 CPU workers GPU worker 3 ms 1 ms Q A Q A Q A Q A HLS CPU T 7 T 10 GPU T 1 T 3 T 2 T 4 T 5 T 6 T 8 T 9 TFinally Skip 12 3 is slower T 4 skip, T 5 and T on 8 and CPU: T 6 and T 9 but Skip and select already select T 7 have T 10 3 ms of work for GPU E HLS achieves aggregate throughput of all heterogeneous processors Peter Pietzuch - Imperial College London 31

32 SABER Hybrid Stream Processing Engine Java 15K LOC C & OpenCL 4K LOC T 1 T 2 op CPU T 1 T 2 T 2 T 1 GPU op α Dispatching stage Dispatch fixed-size tasks Scheduling & execution stage Dequeue tasks based on HLS Result stage Merge & forward partial window results Peter Pietzuch - Imperial College London 32

33 Roadmap SABER: Hybrid stream processing engine for heterogeneous servers [SIGMOD 16] (1) How to parallelise computation on modern hardware? E Avoid coupling system parameters with processing semantics (2) How to utilise heterogeneous servers? E Hybrid execution utilises all heterogeneous processors (3) Experimental performance results Peter Pietzuch - Imperial College London 33

Experimental Evaluation PCIe 3.0 x16 10 Gbps NIC Intel Xeon 2.

infrastructure SmartGrid Measurements 974M plug measurements from

34 Experimental Evaluation PCIe 3.0 x16 10 Gbps NIC Intel Xeon 2.6 GHz 16 cores 64 GB RAM NVIDIA Quadro K5200 2,304 cores 8 GB RAM Ubuntu Linux NVIDIA driver Google Cluster Data 144M jobs events from Google infrastructure SmartGrid Measurements 974M plug measurements from houses Linear Road Benchmark 11M car positions and speed on highway Peter Pietzuch - Imperial College London 34

35 What is SABER s Performance? select group-by avg group-by cnt group-by avg aggr avg group-by avg select group-by cnt Throughput (10 6 tuples/s) CM2 SG1 SG2 LRB3 LRB4 SABER (CPU contrib.) Intel Xeon 2.6 GHz 16 cores SABER (GPU contrib.) NVIDIA Quadro K5200 2,304 cores Cluster Mgmt. Smart Grid LRB E SABER exploits both CPUs and GPUs effectively for different queries Peter Pietzuch - Imperial College London 35

36 Is Hybrid Throughput Additive? GPU is faster CPU is faster Not additive due to queue contention Throughput (GB/s) SABER (CPU only) SABER (GPU only) 0.2 SABER 0.1 Aggregation Group-by θ-join 0 E Aggregate throughput of CPU + GPU always highest Peter Pietzuch - Imperial College London 36

37 What is the Trade-Off between CPUs and GPUs? Hybrid processing model benefits from GPU's ability to process complex predicates fast SABER (CPU only) SABER (GPU only) SABER Throughput (GB/s) Dispatch bound # selection predicates # join predicates [rows 1024, slide 1024] [rows 1024, slide 1024] Peter Pietzuch - Imperial College London 37

38 Does SABER Adapt to Workload Changes? HLS periodically uses idle, non-preferred processor to run tasks to update query task throughput matrix Throughput (GB/s) Selectivity Series1 SABER SABER Series2 (GPU contribution) Time (seconds) E Higher selectivity à more predicates evaluated à GPU preferred Peter Pietzuch - Imperial College London 38

39 Summary Heterogeneous servers have huge impact on data-intensive systems Shift from scale out to scale up model Need new general-purpose system designs for heterogeneous servers SABER: Hybrid Stream Processing Engine for CPUs & GPUs (1) Parallelise computation to fit hardware capabilities E Decouple hardware/system parameters from processing semantics (2) Fully utilise all heterogeneous processors independently of workload E Hybrid processing model to achieve aggregate CPU/GPU throughput Peter Pietzuch - Imperial College London 39

40 Acknowledgement: LSDS Group at Imperial College London We're Hiring! Post-docs, PhDs Thank you! Any Questions? Peter Pietzuch - Imperial College London Peter Pietzuch <prp@imperial.ac.uk> 40

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

Large-Scale Data & Systems Group Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation Georgios Theodorakis, Alexandros Koliousis, Peter Pietzuch, Holger Pirk Large-Scale Data & Systems (LSDS)