Crazy little thing called hardware GUSTAVO ALONSO SYSTEMS GROUP DEPT. OF COMPUTER SCIENCE ETH ZURICH

Size: px

Start display at page:

Download "Crazy little thing called hardware GUSTAVO ALONSO SYSTEMS GROUP DEPT. OF COMPUTER SCIENCE ETH ZURICH"

Valentine Waters
5 years ago
Views:

1 Crazy little thing called hardware GUSTAVO ALONSO SYSTEMS GROUP DEPT. OF COMPUTER SCIENCE ETH ZURICH HTDC 2014

2 Systems Group = Enterprise Computing Center =

3 Hardware rules Multicore, Many core Transactional Memory SIMD, AVX, vectorization SSDs, persistent memory Infiniband, RDMA GPUs, FPGAs (hardware acceleration) Intelligent storage engines, main memory Database appliances Reacting to changes we do not control

4 Do we care? Software producer of performance instead of consumer of performance Mark D. Hill (U. of Wisconsin)

5 What does it mean? Many good ideas from decades ago are no longer good ideas (as conceived): General purpose solutions Threading, thread based parallelism Locking, shared state Centralized synchronization Concurrency control STM (sorry )

6 The take away message Ignoring hardware trends is irresponsible Software evolves more slowly than hardware Focus on the end (solve a problem), not on the means (the technique used) How to succeed in research: Make hardware your friend Influence hardware evolution

7 Outline 1. Hardware as a problem for system design a) A jungle of improvements b) An example (database joins) 2. The case for custom systems a) Economies of scale b) An example (Crescando) 3. Ideas and directions

8 1. Hardware as a problem for system design a) A jungle of improvements

9 Why is this happening? Single thread performance The Future of Computing Performance: Game Over or Next Level? Fuller and Millett (Eds) Presentation idea from Bob Colwell

10 Why is this happening? Clock frequency

11 Why is this happening? Power dissipation

12 Why is this happening? MULTICORE

13 Multicore is great: avoid distribution Nobody ever got fired for using Hadoop on a Cluster A. Rowstron, D. Narayanan, A. Donnely, G. O Shea, A. Douglas HotCDP 2012, Bern, Switzerland Analysis of MapReduce workloads: Microsoft: median job size < 14 GB Yahoo: median job size < 12.5 GB Facebook: 90% of jobs less than 100 GB Fit in main memory One server more efficient than a cluster Adding memory to a big server better than using a cluster

14 but multicore is dead MULTICORE DARK SILICON

15 Dark Silicon = heterogeneity Symmetric Asymmetric Dynamic (power) Specialized/dynamic Probabilistic

16 Heterogeneity is a mess Example: deployment on multicores Experiment setup 8GB datastore size SLA latency requirement 8s 4 different machines Min Cores Partition Size [GB] RT [s] Intel Nehalem AMD Barcelona AMD Shanghai AMD MagnyCours Jana Giceva, Tudor-Ioan Salomie, Adrian Schüpbach, Gustavo Alonso, Timothy Roscoe: COD: Database / Operating System Co-Design. CIDR 2013

17 Beyond multicore: manycore MULTICORE DARK SILICON MANY CORE

18 PCIe Manycore (example) normal CPU

19 Extreme manycore: Intelligent storage Louis Woods, Zsolt István, Gustavo Alonso: Hybrid FPGA-accelerated SQL query processing. FPL 2013

20 Don t need to trust the software stack Offload all sensitive operations to secure coprocessor Testing industry-standard benchmarks with modified SQL.exe, 0-60% perf overhead Keys, Plaintext Memory CPU Keys, Plaintext Secure Co-processor SLIDE COURTESY OF KEN EGURO, MICROSOFT RESEARCH 20

21 The take away message Ignoring hardware trends at your own peril You will be solving the wrong problem Your solution will irrelevant very quickly Your solution will be making the wrong assumptions Hardware might solve the problem that you are trying to solve (may have solved it already)

22 1. Hardware as a problem for system design b) An example (database joins)

23 The joy of joins "Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying Hardware" by Cagri Balkesen, Jens Teubner, Gustavo Alonso, and Tamer Ozsu, ICDE 2013 Cagri Balkesen, Gustavo Alonso, Jens Teubner, M. Tamer Özsu: Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited. PVLDB 7(1): (2013)

24 Hardware conscious Joins are a complex and demanding operation Lots of work on implementing all kinds of joins In 2009 Kim, Sedlar et al. paper (PVLDB 2009) Radix join on multicore Sort Merge join on multicore Claim fastest implementation to date Key message: when SIMD wide enough, sort merge will be faster than radix join Performance in the multicore era 24

25 Hardware oblivious In 2011 Blanas, Li et al. (SIGMOD 2011) No partitioning join (vs. radix join version of Kim paper) Claim: Hardware is good enough, no need for careful tailoring to the underlying hardware Performance in the multicore era 25

26 Hardware confusion In 2012 Albutiu, Kemper et al. (PVLDB 2012) Sort merge joins (vs join version of Blanas) Claim: Sort merge already better and without using SIMD Performance in the multicore era 26

27 The basic hash join

28 Canonical Hash Join k R 1. Build phase hash table bucket 0 bucket 1 2. Probe phase k S hash(key) match hash(key) bucket n-1 Complexity: O( R + S ), i.e., O(N) Easy to parallelize Performance in the multicore era 28

29 Need for Speed Hardware-Conscious Hash Joins

30 Partitioned Hash Join (Shatdal et al. 1994) Idea: Partition input into disjoint chunks of cache size k R 1 h 2 (k) 1 k S h 1 (key)... p p h 1 (key) 1 Partition 2 Build 3 Probe No more cache misses during the join 1 Partition p > #TLB-entries TLB misses Problem: p can be too large! p > #Cache-entries Cache thrashing "Cache conscious algorithms for relational query processing", Shatdal et al, VLDB 94 Performance in the multicore era 30

31 Multi-Pass Radix Partitioning Problem: Hardware limits fan-out, i.e. T = #TLB-entries (typically ) Solution: Do the partitioning in multiple passes! input relation 1 st pass h 1 (key) nd Pass h 2 (key) 1 T partition i th pass... 1 st log 2 T bits of hash(key) T 2 nd Pass h 2 (key) 2 nd log 2 T bits of hash(key) 1 T i = log T p partition - T i TLB & Cache efficiency compensates multiple read/write passes «Database Architecture Optimized for the new Bottleneck: Memory Access», Manegold et al, VLDB 99 Performance in the multicore era 31

32 Partitioned Relation Parallel Radix Join Parallelizing the Partitioning: Pass - 1 Local Histograms Thread-1 Thread-2 Thread-3 Thread-N 1 st Scan Global Histogram & Prefix Sum T0 T1 T N T0 T1 T N T0 T1 T N T0 T1 T N Each thread scatters out its tuples based on the prefix sum 2 nd Scan Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs, VLDB 09 Performance in the multicore era 32

33 Relation Partitioned Parallel Radix Join Parallelizing the Partitioning: Pass - (2.. i) P 1 P 2 P 3 P 4 Histogram Thread-2 Thread-4 Thread-N... 1 st Scan Prefix Sum Each thread individually partitions sub-relations from pass-1 2 nd Scan P 11 P 12 P 13 P 14 P 21 P 22 P 23 P 24 P 31 P 32 P 33 P 41 P 42 P 43 Performance in the multicore era 33

34 Trust the force Hardware-Oblivious Hash Joins

35 Parallel Hash Join ( no partitioning join of Blanas et al.) R 1. Build Phase Thread-1 Thread-2 Thread-3 Thread-N latches L0 L1 shared hashtable bucket 0 bucket 1 2. Probe Phase 1. Acquire latch L(n-1) 2. Store in bucket bucket n-1 Compare & match Thread-1 Thread-2 Thread-3 Thread-N S Performance in the multicore era 35

36 And the winner is

37 Effect of Workloads Case 1 Workload A: 16M 256M, 16-byte tuples, i.e., 256MiB 4096MiB 16.5 cy/tpl 12.9 cy/tpl 1. Effective on-chip threading 2. Efficient sync. primitives (ldstub) 3. Larger page size Hardware oblivious vs. conscious with our optimized code -30% may seem in favor of hardware-oblivious, but Performance in the multicore era 37

38 Effect of Workloads Case 2 Workload B: Equal-sized tables, 977MiB 977MiB, 8-byte tuples 50 cy/tpl 3.5X 14 cy/tpl Picture radically changes: Hardware-conscious is better by 3.5X on Intel and 2.5X on others With larger build table, overhead of not being hardware-conscious is clearly visible Performance in the multicore era 38

39 Sort or Hash? 375M/sec 6.41 cy/tpl 299M/sec 8.03 cy/tpl 619M/sec 3.87 cy/tpl 305M/sec 7.99 cy/tpl Workload: 12.8GB 12.8GB Input size is an important parameter! Workload: 1GB 1GB Optimizations on radix: NUMA-aware data placement, partitioning with software-managed buffers Machine: Intel Sandy Bridge E4640, 2.4GHz, 32-cores, 64-threads Performance in the multicore era 39

40 Game on for academia The next several hundred papers 1. Pick an algorithm 2. Pick an architecture 3. Optimize 4. Publish 5. Go to 1 Algorithm can be used in parallel for increased throughput

41 2. The case for custom systems a) Economies of scale

42 Game changers Hardware evolution: One size does not fit all Cloud computing IT industry as service industry Big X (X=data volume, loads, system costs, requirements, scale) Demand for customized solutions

43 ORACLE EXADATA Intelligent storage manager Massive caching RAC based architecture Fast network interconnect

44 NETEZZA (IBM) TWINFIN No storage manager Distributed disks (per node) FPGA processing No indexing

45 Has been tried before The idea that database workloads are relevant enough to justify customization is not new Decades ago it was difficult to fight the progress of general purpose systems. Not any more Gamma Database Machine (Dewitt)

46 2. The case for custom systems b) An example (Crescando)

47 Amadeus Workload Passenger-Booking Database ~ 600 GB of raw data (two years of bookings) single table, denormalized ~ 50 attributes: flight-no, name, date,..., many flags Query Workload up to 4000 queries / second latency guarantees: 2 seconds today: only pre-canned queries allowed Update Workload avg. 600 updates per second (1 update per GB per sec) peak of updates per second data freshness guarantee: 2 seconds Problems with State-of-the Art Simple queries work only because of mat. views multi-month project to implement new query / process Complex queries do not work at all

48 Crescando: the Amadeus use case Remove load interaction Remove unpredictability Simplify design for scalability and modeling Treat a multicore machine as a collection of individual nodes (not as a parallel machine) Run only on main memory One thread per core Highly tune the code at each core

49 Scan on a core QUERIES UPDATES BUILD QUERY INDEX FOR NEXT SCAN READ CURSOR WRITE CURSOR DATA IN CIRCULAR BUFFER (WIDE TABLE)

50 Crescando on 1 Machine (N Cores) Scan Thread Scan Thread Input Queue (Operations) Split Scan Thread Scan Thread Merge Output Queue (Result Tuples)... Input Queue (Operations) Scan Thread Output Queue (Result Tuples)

51 Crescando in a Data Center (N Machines) External Clients... Crescando Aggregation Layers Replication Groups...

52 Why is this interesting? Fully predictable performance Response time determined by design regardless of load Only two parameters: Size of the scan Number of queries per scan Scalable to arbitrary numbers of nodes Philipp Unterbrunner, Georgios Giannikis, Gustavo Alonso, Dietmar Fauser, Donald Kossmann: Predictable Performance for Unpredictable Workloads. PVLDB 2(1): (2009)

53 Even more interesting On such a system (or a key value store, No- SQL database, etc.): Why using a general purpose CPU? Many other options available Smaller CPUs FPGAs ASIC Zsolt István, Gustavo Alonso, Michaela Blott, Kees A. Vissers: A flexible hash table design for 10GBPS key-value stores on FPGAS. FPL 2013

54 3. Ideas and directions

55 Not everything is parallel P. Roy, J. Teubner, G. Alonso Efficient Frequent Item Counting in Multi-Core Hardware, KDD 2012 Performance in the multicore era 55

56 Not everything is a CPU Louis Woods, Zsolt István, Gustavo Alonso: Hybrid FPGA-accelerated SQL query processing. FPL 2013

57 Hardware might solve your problem Louis Woods, Gustavo Alonso, Jens Teubner: Parallel Computation of Skyline Queries. FCCM 2013

58 Pipeline parallelism Georgios Giannikis, Gustavo Alonso, Donald Kossmann: SharedDB: Killing One Thousand Queries With One Stone. PVLDB 5(6): (2012) SharedDB does not run queries individually (each one in one thread). Instead, it runs operators that process queries in batches thousands of queries at a time

59 Shared DB can run TPC-W!

60 For the non-db people TPC-W has updates!!! Full consistency without conventional transaction manager Transactions are no longer what you read in textbooks Sequential execution Memory CoW (Hyder, TU Munich) Snapshot isolation

61 Raw performance

62 Predictability, robustness

63 Parallelism or distribution?

64 COD : Overview What is the knowledge we have? Who knows what? DBMS Application requirements and characteristics Insert interface here Hardware & architecture + System state and utilization of resources OS CIDR 13 COD: Database/Operating System co-design 64

65 Cod s Interface supports DB storage engine OS Policy Engine Push application-specific System-level facts: #Requests (in System-level a batch) facts Datastore size (#Tuples, and TupleSize) SLA response time DB requirement system-level Needed for: properties cost / utility functions: Application-specific DB-specific Facts & properties Cost functions CIDR 13 COD: Database/Operating System co-design 65

66 COD s key features Declarative interface Resource allocation for imperative requests Resource allocation based on cost functions Proactive interface Inform of system state Request releasing of resources Recommend reallocation of resources CIDR 13 COD: Database/Operating System co-design 66

67 Throughput [req/second] Experimental results Deployment in a noisy system Experiment setup 1200 System Performance -Throughput AMD MagnyCours 4 x 2.2GHz AMD Opteron 6174 processors total Datastore size 53GB Noise: another CPU-intensive task running on core what we get using COD, knowing the system state 48 cores - isolated cores - noisy Increasing load [#requests] CIDR 13 COD: Database/Operating System co-design 67

68 Throughput [req/second] Experimental results Deployment in a noisy system Experiment setup 1200 System Performance -Throughput AMD MagnyCours 4 x 2.2GHz AMD Opteron 6174 processors total Datastore size 53GB Noise: another CPU-intensive task running on core what we get using COD, knowing the system state 48 cores - isolated cores COD noisy cores - noisy Increasing load [#requests] CIDR 13 COD: Database/Operating System co-design 68

69 Latency [sec] Experimental results Adaptability to dynamic system state 9 Adaptability Latency Experiment setup AMD MagnyCours 4 x 2.2GHz AMD Opteron 6174 processors total Datastore size 53GB Noise: other CPU-intensive threads spawned every 4-5min on core SLA Elapsed time [min] CIDR 13 COD: Database/Operating System co-design 69

70 Latency [sec] Experimental results Adaptability to dynamic system state 9 Adaptability Latency Experiment setup AMD MagnyCours 4 x 2.2GHz AMD Opteron 6174 processors total Datastore size 53GB Noise: other CPU-intensive threads spawned every 4-5min on core Naïve datastore engine SLA Elapsed time [min] CIDR 13 COD: Database/Operating System co-design 70

71 Latency [sec] Experimental results Adaptability to dynamic system state 9 Adaptability Latency Experiment setup AMD MagnyCours 4 x 2.2GHz AMD Opteron 6174 processors total Datastore size 53GB Noise: other CPU-intensive threads spawned every 4-5min on core Naïve datastore engine SLA COD Elapsed time [min] CIDR 13 COD: Database/Operating System co-design 71

72 Conclusions

73 The opportunity is now Consensus on major crisis in hardware (from the sw perspective) Hardware not really improving, responsibility passed on to software Business models and IT systems moving towards specialization

Performance in the Multicore Era

Performance in the Multicore Era Gustavo Alonso Systems Group -- ETH Zurich, Switzerland Systems Group Enterprise Computing Center Performance in the multicore era 2 BACKGROUND - SWISSBOX SwissBox: An