Architecture-Conscious Database Systems

Size: px
Start display at page:

Download "Architecture-Conscious Database Systems"

Transcription

1 Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI)

2 Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query co-processing on Commodity Processors VLDB 2006 tutorial (first half) Anastassia Ailamaki, Stavros Harizopoulos VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 2

3 Focus How can we explore new hardware to run database workloads efficiently? VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 3

4 Overview l Introduction l Interaction computer architecture and DBMS l Query processing l Perf Breakdowns l Bottlenecks l Future Directions l Architecture-Conscious Data Management l Limitations and opportunities VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 4

5 Processor/Memory Speed Gap 2x processor speed every 18 months Larger but not as much faster memories VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 5

6 CPU Architecture Elements: l Storage l CPU caches L1/L2/L3 l Registers l Execution Unit(s) l Pipelined l SIMD VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 6

7 CPU Metrics VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 7

8 DRAM Metrics VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 8

9 The Memory Wall Trip to memory = 1000s of instructions! VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 9

10 Memory Hierarchies +Transition Lookaside Buffer (TLB) Cache for VM address translation Ł only 64 entries! VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 10

11 Cache Implementation Important parameters: cache size, cache line size, cache associativity VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 11

12 Cache Associativity lower associativity Ł faster lookup VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 12

13 Miss Classification Cold misses are unavoidable Capacity, conflict, and coherence misses can be reduced VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 13

14 Super-Scalar Execution (pipelining) VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 14

15 >ILP: Superscalar Out-Of-Order DB: only 1.5x faster than inorder [KPH98,RGA98] Limited ILP opportunity VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 15

16 >>ILP: Branch Prediction DB programs: long code paths => mispredictions VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 16

17 Database Workloads on UniProcs DB apps heavily under-utilize hardware VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 17

18 Paralellism: Multiple Threads l Simultaneous Multithreading (SMT) l Pursue multiple threads in parallel within a processor pipeline l Store multiple contexts in different register sets l Share function units between threads l Fast (hardware) context switching amongst threads l Chip Multiprocessors (CMPs) l >1 complete processors on a chip l Every functional unit of a processor is replicated VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 18

19 MultiThreading Compared RS64-IV (IBM) Cache miss Cache miss Cache miss UltraSparc T1 (Sun) POWER5 (IBM) Speedup: OLTP 3x, DSS 1.5x [LBE98] VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 19

20 The Case for CMPs l Getting diminishing returns l From a single core, though powerful (OoO, superscalar, multithreaded) l n-core CMP outperforms n-thread SMT l CMPs offer productivity advantages l Moore s law: 2x transistors every 18 months l More, not faster Will you still Exponentially need me, Will more you cores still feed me, when Beatles I m 64? VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 20

21 A Chip Multiprocessor Main Memory VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 21

22 Current CMP Technology IBM Power 6 Sun Niagara T2 AMD Barcelona Intel Nehalem VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 22

23 Summary: Trends & DB workloads l Hardware: continuously evolving l Superscalar OoO SMT CMP l Processor/memory speed gap: growing l Software: one processor does not fit all l At most 50% CPU utilization l Heavy reuse vs. sequential scan vs random access loops l Opportunities for Architectural Study l On real conventional processors l On simulators (hard to find/build, slow) l On co-processors: NPUs and GPUs VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 23

24 DB execution time breakdown At least 50% cycles on stalls Memory is major bottleneck VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 24

25 DSS/OLTP Basics: Memory Bottlenecks: data in L2, instructions in L1 Random access (OLTP): L1I-bound VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 25

26 Why not increase L1-Ins sizes? L1I size is stable L2 size increase: Effect on performance? VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 26

27 Increasing L2 Cache Size Larger L2: trade-off for OLTP VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 27

28 Summary: workload analysis l Database workloads: more than 50% stalls l Mostly due to memory delays l Cannot always reduce stalls by increasing cache size l Crucial Bottlenecks l Data accesses to the L2 cache (especially DSS) l Instruction accesses to the L1 cache (especially OLTP) Goal 1: Eliminate unnecessary misses Goal 2: Hide latency of cold misses VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 28

29 Classic Data Layout on Disk Pages VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 29

30 NSM In Memory Hierarchy VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 30

31 Decomposition Storage Model (DSM) VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 31

32 Decomposition Storage Model (DSM) VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 32

33 DSM In Memory Hierarchy VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 33

34 Partition Attributes Across (PAX) VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 34

35 PAX In Memory Hierarchy VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 35

36 PAX Performance Results VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 36

37 Summary (no replication) VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 37

38 Clotho: memory stores PAX minipages New buffer pool manager handles sharing VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 38

39 Fractured Mirrors VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 39

40 What about the rest of misses? Prefetching hides cache miss latency Efficiently used in pointer-chasing lookups! VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 40

41 > Prefetching B+ Trees >2x better search AND update performance Approach complementary to CSB+-trees! VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 41

42 >> Prefetching B+ Trees pb+-trees: 8X speedup over B+-trees Fractal pb+-trees: 80% faster in-mem search VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 42

43 Bulk Lookups 2x speedup with enough concurrency VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 43

44 Hiding Latencies: Summary Lots more to be done in this area - consider interference and scarce resources VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 44

45 Cache-Conscious Joins Radix-Cluster l Radix-Partitioned Hash Join l create partitions << CPU cache l small partitions Ł many partitions l many partitions Ł multiple passes needed Database Architecture Optimized for the New Bottleneck: Memory Access VLDB 99 Generic Database Cost Models for Hierarchical Memory Systems, VLDB 02 (all Manegold, Boncz, Kersten) VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 45

46 l Radix-Partitioned Hash Join l create partitions << CPU cache l small partitions Ł many partitions l many partitions Ł multiple passes needed l Radix-Cluster Cache-Conscious Joins Radix-Cluster l Radix-Sort with early stopping l Each pass looks at Bi higher-most radix bits l Splitting each input cluster into 2 Bi output clusters l leaves relation partially ordered Database Architecture Optimized for the New Bottleneck: Memory Access VLDB 99 Generic Database Cost Models for Hierarchical Memory Systems, VLDB 02 (all Manegold, Boncz, Kersten) VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 46

47 Cache-Conscious Joins Radix-Cluster l Multiple clustering passes l Limit number of clusters per pass l Avoid cache/tlb trashing l Trade memory cost for CPU cost l Any data type (hashing) Database Architecture Optimized for the New Bottleneck: Memory Access VLDB 99 Generic Database Cost Models for Hierarchical Memory Systems, VLDB 02 (all Manegold, Boncz, Kersten) VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 47

48 Cache-Conscious Joins Accurate Cache Miss Cost Modeling VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 48

49 Cache-Conscious Joins Partitioned Hash Join VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 49

50 Cache-Conscious Joins Radix-Clustered Hash-Join: overall perf (64,000,000 tuples) VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 50

51 Cache-Conscious Joins Pre-Projection vs. Post-Projection VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 51

52 Cache-Conscious Joins Pre-Projection vs. Post-Projection l Radix-Decluster = cache-conscious post-projection VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 52

53 Cache-Conscious Joins Post-Projection Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 l Partitioned Hash-Join Join Index! Join Indices Valduriez, TODS 87 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 53

54 l Partitioned Hash-Join l Cluster on Left Cache-Conscious Joins Post-Projection Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 54

55 l Partitioned Hash-Join l Cluster on Left Cache-Conscious Joins Post-Projection Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB N-1 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 55

56 l Partitioned Hash-Join l Cluster on Left l Project Left Cache-Conscious Joins Post-Projection Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 56

57 Cache-Conscious Joins Post-Projection Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 l Partitioned Hash-Join l Cluster on Left l Project Left l Cluster on Right Destination in final result Good locality for fetch VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 57

58 l Partitioned Hash-Join l Cluster on Left l Project Left l Cluster on Right l Project Right Cache-Conscious Joins Post-Projection Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 Ł Radix-Decluster VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 58

59 l Red column forms a dense domain l {0,1,2,,N-1} l All subsequences are ordered l e.g. Cache-Conscious Joins Radix-Decluster Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 l Task: merge subsequences into dense sequence VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 59

60 l Red column forms a dense domain l {0,1,2,,N-1} l All subsequences are ordered l e.g. Cache-Conscious Joins Radix-Decluster Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 l Task: merge subsequences into dense sequence l Approach 1: merge H=2 B lists l L cost O(N log(h)) l many (H) cursors needed Ł cache thrashing VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 60

61 l Red column forms a dense domain l {0,1,2,,N-1} l All subsequences are ordered l e.g. Cache-Conscious Joins Radix-Decluster Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 l Task: merge subsequences into dense sequence l Approach 1: merge H=2 B lists l L cost O(N log(h)), l many (H) cursors needed Ł cache thrashing l Approach 2: insert by position l L many (H) sparse passes over the result Ł no cache reuse VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 61

62 l Red column forms a dense domain l {0,1,2,,N-1} l All subsequences are ordered l e.g. Cache-Conscious Joins Radix-Decluster Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 l Task: merge subsequences into dense sequence l Approach 1: merge H=2 B lists l L cost O(N log(h)) l many (H) cursors needed Ł cache thrashing l Approach 2: insert by position l L many (H) sparse passes over the result Ł no cache reuse l Radix-Decluster: insert by position with sliding window VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 62

63 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 63

64 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 3 clusters VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 64

65 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 destination positions VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 65

66 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 Projection column (still) in wrong order VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 66

67 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 Result column to fill insertion window of size 2 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 67

68 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 68

69 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 69

70 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 70

71 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 71

72 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 72

73 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 73

74 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 74

75 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 75

76 Cache-Conscious Joins Radix-Decluster In Action Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 76

77 Cache-Conscious Joins Radix-Decluster Memory Access Pattern Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB 04 n Random access only to sliding window(<< cache size) l Only compulsary misses Ł cache-conscious VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 77

78 Cache-Conscious Joins Radix-Decluster Performance Tradeoff l Radix-Decluster prefers small W (i.e. vertical fragmentation) Read Also: - Jive Join - Slam Join Fast Joins Using Join Indices Ross, Lei, VLDBJ 98 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 78

79 Hash Aggregation Adaptive Aggregation on Chip Multiprocessors, Cieszlewicz, Ross, VLDB 07 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 79

80 Hash Aggregation Adaptive Aggregation on Chip Multiprocessors, Cieszlewicz, Ross, VLDB 07 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 80

81 Hash Aggregation Adaptive Aggregation on Chip Multiprocessors, Cieszlewicz, Ross, VLDB 07 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 81

82 Multithreaded Hash Aggregation Adaptive Aggregation on Chip Multiprocessors, Cieszlewicz, Ross, VLDB 07 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 82

83 Option 1: Independent Tables Adaptive Aggregation on Chip Multiprocessors, Cieszlewicz, Ross, VLDB 07 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 83

84 Option 2: Global Tables With Mutex(es) Adaptive Aggregation on Chip Multiprocessors, Cieszlewicz, Ross, VLDB 07 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 84

85 Option 3: Global Table + Atomic Instructions Adaptive Aggregation on Chip Multiprocessors, Cieszlewicz, Ross, VLDB 07 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 85

86 Option 4: Hybrid Adaptive Aggregation on Chip Multiprocessors, Cieszlewicz, Ross, VLDB 07 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 86

87 Performance Results Adaptive Aggregation on Chip Multiprocessors, Cieszlewicz, Ross, VLDB 07 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 87

88 Dynamic Choice Adaptive Aggregation on Chip Multiprocessors, Cieszlewicz, Ross, VLDB 07 VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 88

89 Query Processing Algorithms Adapt query processing algorithms to caches Related work includes: l Improving data cache performance l Sorting and Hash-Join l Improving Instruction Cache performance l DSS and OLTP applications VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 89

90 DB Operators Using SIMD VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 90

91 DB Operators Using SIMD result[pos] = n; pos += (x[n] < 10) branch predication VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 91

92 Optimizing NSM Hash Join Idea: exploit inter-tuple parallellism VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 92

93 Group Prefetching VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 93

94 Software Pipelining VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 94

95 Prefetched Hash Join: Results Warning: prefetch distance hard to tune, your mileage may vary does not work for DSM VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 95

96 Instruction Related Stalls Impossible to overlap I-cache delays VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 96

97 Call Graph Prefetching Beneficial for predictable DSS Streams VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 97

98 DSS: Reducing I-misses with buffering 12% speedup for simple TPC-H queries VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 98

99 STEPS: Cache-Resident OLTP VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 99

100 Parallelizing Transactions l Intra-transaction parallellism l Used for long-running queries (DSS) l Does not work for short queries l Short queries dominate in OLTP workloads VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 100

101 Parallelizing Transactions l Intra-transaction parallellism l Each thread spans multiple queries l Hard to add to existing systems! Thread Level Speculation (TLS) makes parallelization easier VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 101

102 Thread-Level Speculation (TLS) VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 102

103 Thread-Level Speculation (TLS) Data dependencies limit performance VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 103

104 Removing Bottleneck Dependencies 2x lower latency with 4 CPUs Useful for non-tls parallelism as well VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 104

105 Summary: Memory Affinity l Cache-aware data placement l Eliminates unnecessary trips to memory l Minimizes conflict/capacity misses l What about compulsory (cold) misses? l Can t avoid, but can hide latency with prefetching and grouping l Techniques for B-trees, Hash-Joins l Query Processing Algorithms l For sorting, and hash-joins, addressing D- and I-stalls l Low-level instruction cache optimizations l For DSS: call graph prefetching, branch predication l OLTP: STEPS (explicit transaction scheduling) VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 105

106 References Workload studies (Simulation) VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 106

107 References Workload studies (Real Machines) VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 107

108 References Architecture-Conscious Data Placement VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 108

109 References Architecture Conscious Access Methods VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 109

110 References Architecture-Conscious Query Processing VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 110

111 References Instrustion Stream Optimizations and DBMS Architectures VLDB 2009 Summer School Shanghai Architecture-Conscious Database Techniques 111

Bridging the Processor/Memory Performance Gap in Database Applications

Bridging the Processor/Memory Performance Gap in Database Applications Bridging the Processor/Memory Performance Gap in Database Applications Anastassia Ailamaki Carnegie Mellon http://www.cs.cmu.edu/~natassa Memory Hierarchies PROCESSOR EXECUTION PIPELINE L1 I-CACHE L1 D-CACHE

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon Computer Platforms in 198 Execution PROCESSOR 1 cycles/instruction Data and Instructions cycles

More information

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Architecture-Conscious Database Systems Anastassia Ailamaki Ph.D. Examination November 30, 2000 A DBMS on a 1980 Computer DBMS Execution PROCESSOR 10 cycles/instruction DBMS Data and Instructions 6 cycles

More information

STEPS Towards Cache-Resident Transaction Processing

STEPS Towards Cache-Resident Transaction Processing STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance VLDB 2001, Rome, Italy Best Paper Award Weaving Relations for Cache Performance Anastassia Ailamaki David J. DeWitt Mark D. Hill Marios Skounakis Presented by: Ippokratis Pandis Bottleneck in DBMSs Processor

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison Memory Hierarchies PROCESSOR EXECUTION PIPELINE

More information

Query co-processing on Commodity Processors. Processor Performance / Time. Focus of this tutorial

Query co-processing on Commodity Processors. Processor Performance / Time. Focus of this tutorial Query co-processing on Commodity Processors Anastassia Ailamaki Carnegie Mellon University Naga K. Govindaraju Dinesh Manocha University of North Carolina at Chapel Hill Processor Performance / Time Performance

More information

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS Marcin Zukowski Sandor Heman, Niels Nes, Peter Boncz CWI, Amsterdam VLDB 2007 Outline Scans in a DBMS Cooperative Scans Benchmarks DSM version VLDB,

More information

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,

More information

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016 Datenbanksysteme II: Modern Hardware Stefan Sprenger November 23, 2016 Content of this Lecture Introduction to Modern Hardware CPUs, Cache Hierarchy Branch Prediction SIMD NUMA Cache-Sensitive Skip List

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

Walking Four Machines by the Shore

Walking Four Machines by the Shore Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0

More information

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Performance in the Multicore Era

Performance in the Multicore Era Performance in the Multicore Era Gustavo Alonso Systems Group -- ETH Zurich, Switzerland Systems Group Enterprise Computing Center Performance in the multicore era 2 BACKGROUND - SWISSBOX SwissBox: An

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Main-Memory Databases 1 / 25

Main-Memory Databases 1 / 25 1 / 25 Motivation Hardware trends Huge main memory capacity with complex access characteristics (Caches, NUMA) Many-core CPUs SIMD support in CPUs New CPU features (HTM) Also: Graphic cards, FPGAs, low

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Crescando: Predictable Performance for Unpredictable Workloads

Crescando: Predictable Performance for Unpredictable Workloads Crescando: Predictable Performance for Unpredictable Workloads G. Alonso, D. Fauser, G. Giannikis, D. Kossmann, J. Meyer, P. Unterbrunner Amadeus S.A. ETH Zurich, Systems Group (Funded by Enterprise Computing

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

class 9 fast scans 1.0 prof. Stratos Idreos

class 9 fast scans 1.0 prof. Stratos Idreos class 9 fast scans 1.0 prof. Stratos Idreos HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ 1 pass to merge into 8 sorted pages (2N pages) 1 pass to merge into 4 sorted pages (2N pages) 1 pass to merge into

More information

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 16 MARKS CS 2354 ADVANCE COMPUTER ARCHITECTURE 1. Explain the concepts and challenges of Instruction-Level Parallelism. Define

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

Performance Issues and Query Optimization in Monet

Performance Issues and Query Optimization in Monet Performance Issues and Query Optimization in Monet Stefan Manegold Stefan.Manegold@cwi.nl 1 Contents Modern Computer Architecture: CPU & Memory system Consequences for DBMS - Data structures: vertical

More information

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Accelerating Foreign-Key Joins using Asymmetric Memory Channels Accelerating Foreign-Key Joins using Asymmetric Memory Channels Holger Pirk Stefan Manegold Martin Kersten holger@cwi.nl manegold@cwi.nl mk@cwi.nl Why? Trivia: Joins are important But: Many Joins are (Indexed)

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund

Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access Martin Grund Agenda Key Question: What is the memory hierarchy and how to exploit it? What to take home How is computer memory organized.

More information

Improving Database Performance on Simultaneous Multithreading Processors

Improving Database Performance on Simultaneous Multithreading Processors Tech Report CUCS-7-5 Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia University johnc@cs.columbia.edu

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture X: Parallel Databases Topics Motivation and Goals Architectures Data placement Query processing Load balancing

More information

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs. Parallel Database Systems STAVROS HARIZOPOULOS stavros@cs.cmu.edu Outline Background Hardware architectures and performance metrics Parallel database techniques Gamma Bonus: NCR / Teradata Conclusions

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

complex plans and hybrid layouts

complex plans and hybrid layouts class 7 complex plans and hybrid layouts prof. Stratos Idreos HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ essential column-stores features virtual ids late tuple reconstruction (if ever) vectorized execution

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Hardware Acceleration for Database Systems using Content Addressable Memories

Hardware Acceleration for Database Systems using Content Addressable Memories Hardware Acceleration for Database Systems using Content Addressable Memories Nagender Bandi, Sam Schneider, Divyakant Agrawal, Amr El Abbadi University of California, Santa Barbara Overview The Memory

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN CHAPTER 4 TYPICAL MEMORY HIERARCHY MEMORY HIERARCHIES MEMORY HIERARCHIES CACHE DESIGN TECHNIQUES TO IMPROVE CACHE PERFORMANCE VIRTUAL MEMORY SUPPORT PRINCIPLE OF LOCALITY: A PROGRAM ACCESSES A RELATIVELY

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery

More information

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of

More information

Computer Architecture Crash course

Computer Architecture Crash course Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping

More information

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Multi-core processors are here, but how do you resolve data bottlenecks in native code? Multi-core processors are here, but how do you resolve data bottlenecks in native code? hint: it s all about locality Michael Wall October, 2008 part I of II: System memory 2 PDC 2008 October 2008 Session

More information

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University Multithreaded Architectures and The Sort Benchmark Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University About our Sort Benchmark Based on the benchmark proposed in A measure

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question: Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive

More information

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!

More information

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Chapter 20: Parallel Databases. Introduction

Chapter 20: Parallel Databases. Introduction Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Chapter 17: Parallel Databases

Chapter 17: Parallel Databases Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems Database Systems

More information

ECE 588/688 Advanced Computer Architecture II

ECE 588/688 Advanced Computer Architecture II ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Multi-threaded Queries. Intra-Query Parallelism in LLVM Multi-threaded Queries Intra-Query Parallelism in LLVM Multithreaded Queries Intra-Query Parallelism in LLVM Yang Liu Tianqi Wu Hao Li Interpreted vs Compiled (LLVM) Interpreted vs Compiled (LLVM) Interpreted

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Multicore Hardware and Parallelism

Multicore Hardware and Parallelism Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

COURSE 12. Parallel DBMS

COURSE 12. Parallel DBMS COURSE 12 Parallel DBMS 1 Parallel DBMS Most DB research focused on specialized hardware CCD Memory: Non-volatile memory like, but slower than flash memory Bubble Memory: Non-volatile memory like, but

More information

A high performance database kernel for query-intensive applications. Peter Boncz

A high performance database kernel for query-intensive applications. Peter Boncz MonetDB: A high performance database kernel for query-intensive applications Peter Boncz CWI Amsterdam The Netherlands boncz@cwi.nl Contents The Architecture of MonetDB The MIL language with examples Where

More information

THREAD LEVEL PARALLELISM

THREAD LEVEL PARALLELISM THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

Lec 25: Parallel Processors. Announcements

Lec 25: Parallel Processors. Announcements Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Cache-Aware Database Systems Internals Chapter 7

Cache-Aware Database Systems Internals Chapter 7 Cache-Aware Database Systems Internals Chapter 7 1 Data Placement in RDBMSs A careful analysis of query processing operators and data placement schemes in RDBMS reveals a paradox: Workloads perform sequential

More information

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Anastasia Ailamaki. Performance and energy analysis using transactional workloads Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:

More information

Hash Joins for Multi-core CPUs. Benjamin Wagner

Hash Joins for Multi-core CPUs. Benjamin Wagner Hash Joins for Multi-core CPUs Benjamin Wagner Joins fundamental operator in query processing variety of different algorithms many papers publishing different results main question: is tuning to modern

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

In-Memory Data Management

In-Memory Data Management In-Memory Data Management Martin Faust Research Assistant Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University of Potsdam Agenda 2 1. Changed Hardware 2.

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Data Processing on Modern Hardware

Data Processing on Modern Hardware Data Processing on Modern Hardware Jens Teubner, TU Dortmund, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2016 c Jens Teubner Data Processing on Modern Hardware Summer 2016 1 Part III Instruction

More information

Improving Instruction Cache Performance in OLTP

Improving Instruction Cache Performance in OLTP Improving Instruction Cache Performance in OLTP STAVROS HARIZOPOULOS MIT CSAIL and ANASTASSIA AILAMAKI Carnegie Mellon University Instruction-cache misses account for up to 40% of execution time in Online

More information

Cache-Aware Database Systems Internals. Chapter 7

Cache-Aware Database Systems Internals. Chapter 7 Cache-Aware Database Systems Internals Chapter 7 Data Placement in RDBMSs A careful analysis of query processing operators and data placement schemes in RDBMS reveals a paradox: Workloads perform sequential

More information

ECE 588/688 Advanced Computer Architecture II

ECE 588/688 Advanced Computer Architecture II ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Winter 2018 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2018 1 When and Where? When:

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information