Data Processing on Emerging Hardware

Size: px

Start display at page:

Download "Data Processing on Emerging Hardware"

Randell Matthews
5 years ago
Views:

1 Data Processing on Emerging Hardware Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland 3 rd International Summer School on Big Data, Munich, Germany, 2017

2 Systems Group 6 faculty ~40 PhD ~12 postdocs Researching all aspects of system architecture, sw and hw

3 Take home message

4 Slide courtesy of Torsten Hoefler (Systems Group, ETH Zürich)

5 In a nutshell Hardware going crazy More transistors no longer means faster machines but more specialized Big data is the killer app Specialized hardware to support data processing Great opportunity

target-dependent decisions DATABASES of mapping available parallelism

6 Future programming systems must allow the programmer to express their code in a high-level, target-independent manner and optimize the target-dependent decisions DATABASES of mapping available parallelism in time and space. Bill Dally, NVIDIA Chief Scientist (Keynote at HiPEAC 15)

7 Hardware as a problem

8 Joins in main memory, multicore Kim et al. PVLDB 09 Blanas et al. SIGMOD 11 Albutiu et al. PVLDB 12 Hash joins faster than sort merge joins Will change when SIMD wide enough Showed tuning to multicore, SIMD No need for tuning a has join No need for careful partitioning Hardware hides complexity Sort merge join better already No need to use SIMD

9 Join with small build table Workload A: 16M 256M, 16-byte tuples (256MiB 4096MiB) 1. Effective on-chip threading 2. Efficient sync. primitives (ldstub) 3. Larger page size

10 Join with large build table Workload B: Equal-sized tables, (977MiB 977MiB, 8-byte tuples) 50 cy/tpl 3.5X 14 cy/tpl 10

structures and algorithms are more important for the performance than the data structures and

11 This study demonstrates that in main memory, where no time-consuming I/O can mask variations in implementation, implementation details are very important; the implementations of the data structures and algorithms are more important for the performance than the data structures and algorithms themselves. (Sidlauskas & Jensen, PVLDB 14, commenting on Sowell et al., PVLDB 13)

12 Scalability Workload B: Equal-sized tables, 977MiB 977MiB, 8-byte tuples 196 M/sec 14 cy/tpl 3.5X 55 M/sec Intel Sandy Bridge 2.7GHz, 8 cores/16 threads Fastest reported join performance to date! Balkesen et al., ICDE 2013 Balkesen et al., PVLDB 2014 Balkesen et al., IEEE TKDE 2015

13 Specialization as a solution

14 Airline reservations (Amadeus) Load SLAs Features High peak workloads High update rate spikes Stringent response time requirements Extensibility over time Predictability Accurate provisioning

15 CRESCANDO Scan Thread Scan Thread Input Queue (Operations) Split Scan Thread Scan Thread Merge Output Queue (Result Tuples)... Input Queue (Operations) Scan Thread Output Queue (Result Tuples) External Clients... Crescando Aggregation Layers Replication Groups... Unterbrunner et al, PVLDB 09

16 Specialized hardware as the way forward

17 If the data moves, do it efficiently Bumps in the wire(s)

18 IBEX (Woods, PVLDB 14; Woods, PVLDB 11) On programming FPGAs: we had to develop our own SATA 2 driver from the SATA specs!

19 A wide range of algorithms Sorting networks Selection, projection Group by, join Frequent item Skyline Complex Event Detection Hashing Helped us to learn what works, what does not work, and, most importantly, that new algorithms and data structures are needed to exploit an FPGA

20 Near-Data processing The goal is to be able to do this at all levels: Smart storage On the network switch (SDN like) On the network card (smart NIC) On the PCI express bus On the memory bus (active memory) On the memory (near data processing) Every element in the system (a node, a computer rack, a cluster) should be a processing component

21 Integrated accelerators From Oracle M7 documentation

22 Do not replace, enhance Help the CPU to do what it does not do well

23 Text search in databases Istvan et al, FCCM 16 INTEL HARP: This is an experimental system provided by Intel any results presented are generated using preproduction hardware and software, and may not reflect the performance of production or future systems.

24 100% processing on FPGA

25 Hybrid Processing CPU/FPGA

26 Inside a real database: DoppioDB Sidler et al., SIGMOD 17 Owaida et al., FCCM 17

27 Accelerating real engines

28 Huge potential for new functionality (Kara et al. FCCM 17 => demo at SIGMOD 17) Stochastic Gradient Descent => machine learning on the FPGA

29 Requires rethinking the original system Database suboperators (started from work with Oracle Labs on a project called RAPID): Accelerate the important parts of an operators, do not try to accelerate operators or entire query plans Database partitioning Kara&Giceva, SIGMOD 17

30 Integration of Partitioned Hash Joins QPI QPI Endpoint 96GB Main Memory ~30 GB/s 6.5 GB/s Mem. Controller QPI Caches QPI Endpoint R S Pointer Pointer 64B Cache Line Partitioner FPGA 64B Cache Line Intel Xeon CPU Accelerator Counts R Counts S Target Architecture: Intel Xeon+FPGA Altera Stratix V Counts R Partitioned R Padding Counts S Core 0 Core 1... Core 0 Core 1... Core 0 Core 1 Core 0 Core 1 Core 5 Core 6... Kaan et al. SIGMOD 2017 Partitioned S Memory Core 0 Core 1... Core 2 Core 3 Core 4 CPU Core 7 Core 8 Core 9

31 Plenty of opportunities to extend databases Many existing operators that today are not really integrated or available Spatial, time series, statistical operations, temporal,... Ability to deal with complex data types and formats Many new operators that significantly expand the scope of a DB Stochastic gradient descent Support Vector Machines Data mining and data cubes Model based machine learning (clustering, classification) Through the FPGA: gateway to a other machines (Catapult)

32 Disaggregated data center Efficient microservers [FPGAs as standalone nodes]

33 Exploiting the network 40 Mio tuples/relation/core 2x1024M Barthels et al., SIGMOD 15 Barthels et al., PVLDB 17

34 Consensus in a Box (Istvan et al, NSDI 16; Sidler, FPL 16) Xilinx VC709 Evaluation Board SW Clients / Other nodes SFP+ TCP FPGA Reads Other nodes Other nodes SFP+ SFP+ Direct Direct Networking Writes Atomic Broadcast Replicated key-value store SFP+ DRAM (8GB) 34

35 The system 3 FPGA cluster 10Gbps Switch Comm. over TCP/IP Clients X 12 Comm. over direct connections + Leader election + Recovery Drop-in replacement for memcached with Zookeeper s replication Standard tools for benchmarking (libmemcached) Simulating 100s of clients 35

36 Latency of puts in a KVS Direct connections ~3μs Consensus Memaslap (ixgbe) 15-35μs ~10μs TCP / 10Gbps Ethernet 36

37 Througput (consensus rounds/s) The benefit of specialization Specialized solutions x General purpose solutions FPGA (Direct) FPGA (TCP) DARE* (Infiniband) Libpaxos (TCP) Etcd (TCP) Zookeeper (TCP) Consensus latency (us) [1] Dragojevic et al. FaRM: Fast Remote Memory. In NSDI 14. [2] Poke et al. DARE: High-Performance State Machine Replication on RDMA Networks. In HPDC 15. *=We extrapolated from the 5 node setup for a 3 node setup. 37

38 Today: Caribou (Istvan et al. PVLDB 2017) Everything mentioned in the talk can be done on top of this key vale store without affecting performance: Selection, projection Regular expression matching String search Compression/Decompression Next steps: Exploring Catapult (Microsoft) Implementing RoCE, active RDMA In-network data processing

39 Our research agenda: Near Data processing The goal is to be able to process data at all levels and extend database functionality: Smart storage On the network switch (SDN like) On the network card (smart NIC) On the PCI express bus On the memory bus (active memory) On the memory (near data processing) Every element in the system (memory, bus, disk, cache, network card, network switch,...) should be a processing component

Reconfigurable hardware for big data. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland

Reconfigurable hardware for big data Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland www.systems.ethz.ch Systems Group 7 faculty ~40 PhD ~8 postdocs Researching all