The Case for Heterogeneous HTAP

Size: px

Start display at page:

Download "The Case for Heterogeneous HTAP"

Esmond Lee Barton
6 years ago
Views:

1 The Case for Heterogeneous HTAP Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki Data-Intensive Applications and Systems Lab EPFL 1

2 HTAP the contract with the hardware Hybrid OLTP & OLAP Processing High-throughput OLTP Low-latency OLAP HTAP on multicores Massive parallelism => high concurrency Global shared memory => data sharing System-wide coherence => synchronization HTAP DBMS LLC LLC Database Fresh data Necessary for current systems 2

3 Shifting hardware landscape (1): Specialization of CPUs Multisocket multicores Intel SCC, ARM v8, Cell SPE LLC LLC LLC LLC PCIe PCIe PCIe PCIe 1 coherence domain Multiple coherence domains CPUs: general-purpose à customizable features 3

4 Shifting hardware landscape (2): Generalization of GPUs Normalized SGEMM/Watt Tesla Programmability Interface Fermi UVA Maxwell UM Kepler Dynamic Parallelism PCIe 3.0 (16 GB/s) Pascal Paging UM NVLink (80-200GB/s) GPUs: Niche accelerators à general-purpose processors 4

5 Emerging hardware: Revisiting the contract CurrentEmerging hardware Homogeneous Heterogeneous parallelism Task-parallel CPUs Data-parallel GPUs System-wide Relaxed cache coherence OS (FOS), FS (Hare) runtimes (Cosh) Global shared memory Unified address space HTAP software Cannot exploit heterogeneity HTAP across processors Shared-everything OLTP: N/A No synch. sans coherence Server as distributed system Fails to exploit shared memory Clean slate redesign in order 5

6 Heterogeneous HTAP (H 2 TAP): Caldera Store data in shared memory Run OLTP workloads on task-parallel archipelago Run OLAP workloads on data-parallel archipelago Task-parallel archipelago (OLTP) Data-parallel archipelago (OLAP) GPU GPU In-memory data store Loose job-to-core assignment exploits heterogeneity 6

7 H 2 TAP Challenges Store data in shared memory Choose optimal data layout OLTP on task-parallel archipelago Make up for (lack of) cache coherence OLAP on data-parallel archipelago Share transactionally-consistent snapshots across processors Task-parallel archipelago (OLTP) Data-parallel archipelago (OLAP) GPU GPU In-memory data store 7

8 Data layout Need to minimize PCIe data transfer to GPU Data access on GPU should be sequential to enable coalescing Caldera implements NSM, DSM, and PAX C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4 NSM page C1 C1 C2 C1 C2 C3 C2 C3 C3 DSM page C1 C1 C1 PAX minipage C2 C2 C2 PAX minipage C3 C3 C3 PAX minipage PAX fits GPUs best (PCIe & coalesced accesses)paxpage

9 OLTP without cache coherence Use Data-Oriented Transaction Execution principles Thread-to-data assignment leads to partitioned data, metadata (2PL, index) Thd A Thd B 9

10 OLTP without cache coherence Use explicit messaging instead of implicit latching Exploit shared memory by exchanging pointers instead of data 1. Msg (lookup, k) Thd A 2. Reply(&k) 4. Release(k) Thd B 3. Access *k Enforce coherence in software 10

11 Transactionally-consistent data sharing Data sharing across workloads Use Unified Virtual Addressing (UVA) for CPU GPU sharing Consistent data sharing via hardware snapshotting (ex: Hyper) CUDA runtime restricts use in H 2 TAP context Caldera supports lightweight software snapshotting OLAP queries run on immutable snapshot Copy-on-write performed by update transactions Snapshots across GPU-CPU archipelagos 11

12 Caldera blueprint OLTP without cache coherence Task-parallel archipelago Query parser & optimizer Query compiler Query runtime Data-parallel archipelago Scheduler Determine ideal processor for query Compile query to X86 or PTX code OLAP on database snapshot In-memory data store GPU Elastic core to workload assignment 12

13 Experiments Setup Two 12-core Intel Xeon E5-2650L v3 CPUs, 256GB RAM GeForce GTX 980 GPU (PCIe 3.0) with 4GB memory TPC-C, TPC-H, YCSB in various scale factors Silo, MonetDB, DBMS-C Goals Message passing and Software snapshotting overhead PAX performance compared to NSM and DSM on GPUs Caldera performance compared to state-of-the-art 13

14 OLTP throughput Throughput (MTps) Caldera Silo # cores running TPC-C NewOrder (1WH/core) Message passing-based design scales well Better code & data locality (partitioning), no synchronization overhead 14

15 OLAP response time (incl. data movement) 10 Execution Time (sec) Exploits GPU parallelism Saturates PCIe b/w Caldera DBMS-C MonetDB TPCH SF Query 6 Bounded by PCIe bandwidth (12GB/s) Emerging interconnects (NVLink): GB/s 15

16 Impact of snapshotting OLTP Throughput (KTps) Ideal q1 q1-10 9x 3.5x % records touched by OLTP OLAP Response Time (secs) x Ideal % records touched by OLTP Limitation: Software shadow copying imposes a high overhead Possible fix: data classification, snapshot sharing, h/w acceleration 16

17 Impact of data layout Execution Time (sec.) table (i1 integer, i2 integer,. i16 integer) SELECT SUM(colA + colb) FROM table Data (16GB) in host memory PAX, DSM saturate PCIe NSM 14x worse (non-coalesced accesses) DSM PAX NSM Execution Time (msec.) Data (1GB) in GPU memory PAX exploits GPU memory BW NSM only 2x worse (GPUs have reduced the access tax ) DSM PAX NSM Hybrid layouts like PAX a good fit for H 2 TAP 17

18 Conclusion Hardware architecture is changing New opportunities: massive parallelism, fast interconnects New challenges: heterogeneity, relaxed coherence Databases can and should exploit hardware trends Exploit hardware heterogeneity in their core architecture design Decouple system-wide coherence from shared memory Time to move from HTAP to H 2 TAP H 2 TAP architecture: revisit age-old h/w s/w contract Caldera: Preliminary prototype to prove that H 2 TAP is possible 18

The Case For Heterogeneous HTAP

The Case For Heterogeneous HTAP Raja Appuswamy Manos Karpathiotakis Danica Porobic Anastasia Ailamaki École Polytechnique Fédérale de Lausanne RAW Labs SA firstname.lastname@epfl.ch ABSTRACT Modern database