Architecture-Conscious Database Systems
|
|
- Gervase Hicks
- 5 years ago
- Views:
Transcription
1 Architecture-Conscious Database Systems Anastassia Ailamaki Ph.D. Examination November 30, 2000
2 A DBMS on a 1980 Computer DBMS Execution PROCESSOR 10 cycles/instruction DBMS Data and Instructions 6 cycles MAIN MEMORY < 1 MBps 1 Megabyte (Buffer pool) Hot data DATABASE The main performance bottleneck was I/O latency 2000 Anastassia Ailamaki Ph.D. Defense 2
3 Present and Future Platforms DBMS Execution PROCESSOR 0.33 CACHE Data and cycles/instruction Instructions 70 cycles DBMS Data and Instructions MAIN MEMORY MAIN MEMORY 1 Gigabyte Terabyte MBps DB DB DB Hot data migrates to larger and slower main memory 2000 Anastassia Ailamaki Ph.D. Defense 3
4 Processor & Memory Speed Gap cycles per instruction VAX 11/780 CPI memory latency Pentium II Xeon Memory latency One access to memory is 100 s of instruction opportunities 2000 Anastassia Ailamaki Ph.D. Defense 4
5 On Today s Computers When you think about what today s machines do - they look at the instruction stream dynamically, find parallelism on the fly, execute instructions out of order, and speculate on branch outcomes - it s amazing that they work. John Hennessy, IEEE Computer, August 1999 New architectures are more sophisticated 2000 Anastassia Ailamaki Ph.D. Defense 5
6 Why Study Database Performance? Cycles per instruction Theoretical minimum Desktop/ Engineering (SPECInt) Decision Support Online Transaction Processing High average time per instruction for DB workloads 2000 Anastassia Ailamaki Ph.D. Defense 6
7 Contributions (I): Analysis Problem: Where does query execution time go? Proposed evaluation framework [VLDB 99] Identified bottlenecks in hardware memory access hardware implementation details Discovered two memory-related bottlenecks second-level cache data access first-level instruction cache access Methodological discovery: micro-benchmarks A systematic evaluation framework 2000 Anastassia Ailamaki Ph.D. Defense 7
8 Contributions (II): Software Problem: Current data placement hurts caches Proposed novel data placement [subm. SIGMOD 01] rearranges data records on disk page optimizes data cache performance Evaluated it against the popular scheme 70% less data-related memory access delays does not affect I/O behavior especially beneficial for decision support workloads A cache-conscious data placement 2000 Anastassia Ailamaki Ph.D. Defense 8
9 Contributions (III): Hardware Problem: Hardware design affects DB behavior Compared Shore on four different systems different processor architectures/µ-architectures different memory subsystems Found evidence that DBMSs would benefit from 2-4 way associative, larger L2, no inclusion large blocks, no sub-blocking high-accuracy branch prediction memory-aggressive execution engine Step towards a DSS-centric machine 2000 Anastassia Ailamaki Ph.D. Defense 9
10 Outline Introduction PART I: Analysis Background Query execution time breakdown Experimental results Bottleneck assessment PART II: Partition Attributes Across (PAX) PART III: Towards a DSS-centric h/w design Conclusions 2000 Anastassia Ailamaki Ph.D. Defense 10
11 Previous Work Workload characterization studies, e.g., [Barroso 98], [Keeton 98] Various platforms, mostly multiprocessor One DBMS per platform Results: Commercial different than scientific apps OLTP different than DSS workloads Memory is major bottleneck No coherent study across DBMSs and workloads 2000 Anastassia Ailamaki Ph.D. Defense 11
12 An Execution Pipeline INSTRUCTION POOL FETCH/ DECODE UNIT DISPATCH EXECUTE UNIT RETIRE UNIT L1 I-CACHE L1 D-CACHE L2 CACHE MAIN MEMORY Branch prediction, non-blocking cache, out-of-order 2000 Anastassia Ailamaki Ph.D. Defense 12
13 Where Does Time Go? Hardware Resources Branch Mispredictions Memory Computation Delays (Stalls) Overlap opportunity: Load A D=B+C Load E Execution Execution Time Time = Computation = Computation + Stalls + Stalls - Overlap 2000 Anastassia Ailamaki Ph.D. Defense 13
14 Setup and Methodology Range Selection (sequential, indexed) select avg (a3) from R where a2 > Lo and a2 < Hi Equijoin (sequential) select avg (a3) from R, S where R.a2 = S.a1 Four commercial DBMSs: A, B, C, D 6400 PII Xeon/MT running Windows NT 4 Used processor counters to measure/estimate Crafted microbenchmarks to isolate execution loops 2000 Anastassia Ailamaki Ph.D. Defense 14
15 Time Calculations Measured: Resource stalls, L1I stalls Estimated: L1 data stalls: # misses * penalty L2 stalls: # misses * measured memory latency Branch misprediction stalls: # mispr. * penalty Overlap: measured CPI / expected CPI 2000 Anastassia Ailamaki Ph.D. Defense 15
16 Microbenchmarks vs. TPC 3.5 System B 3.5 System D Clock cycles mbench (seq) TPC-D mbench (idx) benchmark TPC-C benchmark Computation Memory Branch misprediction Resource 2000 Anastassia Ailamaki Ph.D. Defense mbench (seq) TPC-D High CPI compared to integer workloads Sequential scan / TPC-D, 2ary index / TPC-C mbench (idx) TPC-C
17 Execution Time Breakdown (%) 10% Sequential Scan 10% 2ary index selection Join (no index) 100% 100% 100% % execution time 80% 60% 40% 20% 80% 60% 40% 20% 80% 60% 40% 20% 0% 0% 0% A B C D DBMS B C D DBMS A B C D DBMS Computation Memory Branch mispredictions Resource Stalls at least 50% of time Memory stalls are major bottleneck 2000 Anastassia Ailamaki Ph.D. Defense 17
18 Memory Stalls Breakdown (%) 10% Sequential Scan 10% 2ary index selection Join (no index) Memory stall time (%) 100% 80% 60% 40% 20% 0% A B C D DBMS 100% 80% 60% 40% 20% 0% B C D DBMS 100% 80% 60% 40% 20% L1 Data L2 Data L1 Instruction L2 Instruction 0% A B C D DBMS L1 instruction and L2 data stalls dominate Different memory bottlenecks across DBMSs and queries 2000 Anastassia Ailamaki Ph.D. Defense 18
19 Summary of Analysis We can use microbenchmarks instead of TPC Execution time breakdown shows trends Memory access is a major bottleneck Increasing memory-processor performance gap Deeper memory hierarchies expected L2 cache data misses L2 grows (8MB), but will be slower Stalls due to L1 I-cache misses L1 I-cache not likely to grow as much as L2 We need to address every reason for stalls 2000 Anastassia Ailamaki Ph.D. Defense 19
20 Addressing Bottlenecks Memory D-cache I-cache D I DBMS: improve locality DBMS + Compiler Branch Mispredictions Hardware Resources B R Compiler + Hardware Hardware Data cache: A clear responsibility of the DBMS 2000 Anastassia Ailamaki Ph.D. Defense 20
21 Outline Introduction PART I: Where Does Time Go? PART II: Partition Attributes Across The current scheme: Slotted pages Partition Attributes Across (PAX) Performance Results PART III: Towards a DSS-centric h/w design Conclusions 2000 Anastassia Ailamaki Ph.D. Defense 21
22 The Data Placement Tradeoff Slotted Pages: Used by all commercial DBMSs Store table records sequentially Intra-record locality (attributes of record r together) but pollutes cache Inspiration: Vertical partitioning [Copeland 85] Store n-attribute table as n single-attribute tables Problem: High record reconstruction cost Partition Attributes Across (PAX) Have the cake and eat it, too! PAX: Inter-record locality, low reconstruction cost 2000 Anastassia Ailamaki Ph.D. Defense 22
23 Current Scheme: Slotted Pages Formal name: NSM (N-ary Storage Model) R PAGE HEADER RH RID 1 SSN 1237 Name Jane Age 30 Jane 30 RH John 45 RH Jim 20 RH John Susan Jim Susan Leon Dan 37 Records are stored sequentially Offsets to start of each record at end of page 2000 Anastassia Ailamaki Ph.D. Defense 23
24 Current Scheme: Slotted Pages HEADER FIXED-LENGTH VALUES VARIABLE-LENGTH VALUES null bitmap, record length, etc offsets to variablelength fields All attributes of a record are stored together 2000 Anastassia Ailamaki Ph.D. Defense 24
25 NSM Cache Behavior PAGE HEADER RH Jane 30 RH John 45 RH Jim 20 RH4 Jane 30 RH block 1 45 RH block Susan Leon Jim 20 RH4 block Leon block 4 CACHE MAIN MEMORY select name from R where age > 40 NSM pollutes the cache and wastes bandwidth 2000 Anastassia Ailamaki Ph.D. Defense 25
26 Partition Attributes Across (PAX) NSM PAGE PAX PAGE PAGE HEADER RH PAGE HEADER Jane 30 RH John RH Jim 20 RH Susan 52 Jane John Jim Susan Partition data within the page for spatial locality 2000 Anastassia Ailamaki Ph.D. Defense 26
27 PAX: Mapping to Cache PAGE HEADER block 1 Jane John Jim Suzan MAIN MEMORY CACHE select name from R where age > 40 Fewer cache misses, low reconstruction cost 2000 Anastassia Ailamaki Ph.D. Defense 27
28 PAX: Detailed Design # records free space attribute # attributes sizes pid v 4 f }}Page Header F - Minipage Jane John presence bits v-offsets presence bits 1 1 } } 1 1 V - Minipage F - Minipage 2000 Anastassia Ailamaki Ph.D. Defense 28
29 Basic Evaluation: Methodology Main-memory resident R Query: select avg (a i ) from R where a j >= Lo and a j <= Hi PII Xeon running Windows NT 4 16KB L1-I, 16KB L1-D, 512 KB L2, 512 MB RAM Used processor counters Implemented schemes on Shore Storage Manager Similar behavior to commercial Database Systems 2000 Anastassia Ailamaki Ph.D. Defense 29
30 Why Use Shore? 100% Range selection query on 4 commercial DBMSs + Shore Breakdown of execution & memory delays Range Selection (no index) 100% Range Selection (no index) % execution time 80% 60% 40% 20% Memory stall time (%) 80% 60% 40% 20% 0% 0% A B C D Shore DBMS Computation Memory Branch mispr. Resource A B C D Shore DBMS L1 Data L2 Data L1 Instruction L2 Instruction We can use Shore to evaluate DSS workload behavior 2000 Anastassia Ailamaki Ph.D. Defense 30
31 Effect on Accessing Cache Data Cache data stalls Sensitivity to Selectivity stall cycles / record L1 Data stalls L2 Data stalls stall cycles / record NSM L2 PAX L2 0 0 NSM PAX 1% 5% 10% 20% 50% 100% page layout selectivity PAX incurs 70% less data cache penalty than NSM PAX reduces cache misses at both L1 and L2 Selectivity doesn t matter for PAX data stalls 2000 Anastassia Ailamaki Ph.D. Defense 31
32 Time and Sensitivity Analysis Execution time breakdown Sensitivity to # of attributes clock cycles per record NSM PAX page layout Resource Branch Mispred. Memory Comp. elapsed time (sec) # of attributes in record NSM PAX PAX: 75% less memory penalty than NSM (10% of time) Execution times converge as number of attrs increases 2000 Anastassia Ailamaki Ph.D. Defense 32
33 Sensitivity Analysis (2) Elapsed time sensitivity to projectivity / # predicates Range selection queries, 1% selectivity seconds seconds NSM PAX NSM PAX projectivity # of attributes in predicate PAX and NSM times converge as query covers entire tuple 2000 Anastassia Ailamaki Ph.D. Defense 33
34 Evaluation Using a DSS Benchmark Loaded 100M, 200M, and 500M TPC-H DBs Ran Queries: Range Selections w/ variable parameters (RS) TPC-H Q1 and Q6 sequential scans lots of aggregates (sum, avg, count) grouping/ordering of results TPC-H Q12 and Q14 (Adaptive Hybrid) Hash Join complex where clause, conditional aggregates PII Xeon running Windows NT 4 Used processor counters 2000 Anastassia Ailamaki Ph.D. Defense 34
35 Insertions with PAX Elapsed Bulk Load Times Estimate average field sizes Start inserting records If a record doesn t fit, Reorganize page (move minipage boundaries) Adjust average field sizes time (seconds) NSM PAX DSM % of reorganizations accommodate a single record MB 200 MB 500 MB Database Size PAX loads a TPC-H database in 2-26% longer than NSM 2000 Anastassia Ailamaki Ph.D. Defense 35
36 Elapsed Execution Time TPC-H 100M TPC-H 200M TPC-H 500M Elapsed time (sec) Q1 Q6 Q12 Q14 TPC-H Query NSM Q1 Q6 Q12 Q14 TPC-H query Q1 Q6 Q12 Q14 PAX TPC-H query PAX improves performance up to 42% even with I/O 2000 Anastassia Ailamaki Ph.D. Defense 36
37 Speedup PAX/NSM Speedup 45% 30% 15% 100 MB 200 MB 500 MB PAX/NSM Speedup on PII/NT 0% RS Q1 Q6 Q12 Q14 Query PAX improves performance up to 42% even with I/O Speedup differs across DB sizes 2000 Anastassia Ailamaki Ph.D. Defense 37
38 PAX: Summary Advantages High data cache performance Faster than NSM for DSS queries Orthogonal to other storage decisions Does not affect I/O performance Current Disadvantages Complex free space mgmt with variable length attributes Complicates update algorithm PAX beneficial for read-mostly workloads (e.g., DSS) (update-intensive workloads in future work) 2000 Anastassia Ailamaki Ph.D. Defense 38
39 Outline Introduction PART I: Where Does Time Go? PART II: Partition Attributes Across PART III: Towards DSS-Centric H/W Memory subsystem Branch prediction mechanism Processor pipeline Conclusions 2000 Anastassia Ailamaki Ph.D. Defense 39
40 Architecture Platform Differences RISC or CISC Instruction set Microarchitecture Pipeline Speculation (out-of-order, multiple issue) Branch prediction Memory subsystem Cache size, associativity Block size, subblocking Inclusion Which design looks beneficial for DSS workloads? 2000 Anastassia Ailamaki Ph.D. Defense 40
41 Experimental Setup Used four machines Sun UltraSparc: US-II and US-IIi, Solaris 2.6/2.7 Intel P6: PII Xeon, Linux v2.2 DEC Alpha: 21164A, OSF1 v.4.0 Architecture and Processor Microarchitecture Characteristic US-II UltraSparc US-IIi PII Xeon Alpha speed 296 MHz 300 MHz 400 MHz 532 MHz introduced in out of order? no no yes no instruction set RISC RISC CISC RISC 2000 Anastassia Ailamaki Ph.D. Defense 41
42 Cache Hierarchies Characteristic L1 D L3 size, assoc block/subblock inclusion by L2 size, assoc L1 I block/subblock 32/32 32/32 32/32 32/16 inclusion by L2 size, assoc L2 block/subblock 64/64 64/64 32/32 64/32 inclusion by L3 size, assoc block/subblock UltraSparc US-II US-IIi PII Xeon Alpha KB, DM 16KB, DM 16KB, 2-way 8KB, DM 32/16 32/16 32/32 32/32 yes yes no yes 16KB, 2-way 16KB, 2-way 16KB, 4-way 8KB, DM yes yes no no 2 MB, DM 512KB, DM 512KB, 4-way 96KB, 3-way N/A N/A N/A yes N/A N/A N/A 4 MB / DM N/A N/A N/A 64/ Anastassia Ailamaki Ph.D. Defense 42
43 Methodology Compiled Shore with gcc Alpha version not optimized Ran DSS workload Range Selections w/ variable parameters (RS) TPC-H 1, 6, 12, 14 Used processors counters Sun: run-pic (by Glenn Ammons, modified) PII: PAPI (public-domain counter library) Alpha: DCPI (sampling software by Compaq) 2000 Anastassia Ailamaki Ph.D. Defense 43
44 Superscalar Processor Capability Alpha issues at most 2 instructions / cycle (max=4) >60% of the time the Xeon retires 0/1 instruction (max=3) 100% Alpha Issue Breakdown PII Xeon (NT) Retire Breakdown 100% % of total clock cycles 80% 60% 40% 20% % of total clock cycles 80% 60% 40% 20% 0% 0% RS Q1 Q6 Q12 Q14 query pipeline dry 0-issue 1-issue 2-issue A B C D Shore DBMS 0-retired 1-retired 2-retired 3-retired The current issue/retire width remains unexploited 2000 Anastassia Ailamaki Ph.D. Defense 44
45 Clock-per-Record Breakdown UltraSparc-II PII Xeon A21164 clock cycles / record (%) 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% RS Q1 Q6 Q12 Q14 DBMS RS Q1 Q6 Q12 Q14 DBMS RS Q1 Q6 Q12 Q14 DBMS D-stalls I-stalls Br. Mispr Other Memory + branch misprediction stalls = 35-60% of time Data accesses: major memory bottleneck (esp. Q12, Q14) 2000 Anastassia Ailamaki Ph.D. Defense 45
46 Branch Prediction Branch penalty = frequency*misprediction rate*penalty Frequency is typically 20-25% In-order processors => lower penalty Low misprediction accuracy may break it (e.g., UltraSparc) Branch frequency Characteristic RS, Q1, Q6 Q12, Q14 22% 3.5% 9% 15% Branch RS, Q1, Q6 misprediction rate Q12, Q14 1% 6% Branch penalty (cycles) PII Xeon 18% 15 Alpha % 5 High-accuracy predictors 2000 Anastassia Ailamaki Ph.D. Defense 46
47 Cache Inclusion UltraSparc II/IIi cache comparison (RS) 1.05 normalized unit Elapsed time L1D misses L1I misses L2D misses L2I misses UltraSparc-IIi UltraSparc-II Small caches should not maintain inclusion 2000 Anastassia Ailamaki Ph.D. Defense 47
48 Cache Block Size PAX/NSM miss rate improv. 120% 105% 90% 75% 60% 45% 30% 15% 0% PAX savings on L3 data miss rates PII Xeon (L2) US-II (L2) A21164 (L3) RS Q1 Q6 Q12 Q14 Query Larger cache line = lower miss rates 2000 Anastassia Ailamaki Ph.D. Defense 48
49 Sub-Blocking / Associativity UltraSparc: direct-mapped, subblocking (32/16) Xeon: 2-way, no subblocking (32/32) L1 Data Cache Misses (RS) L1 Data Cache Misses (RS) # L1D misses per record US-II US-IIi PII Xeon # L1D misses per record US-II US-IIi PII Xeon projectivity 2% 10% 20% 50% 100% selectivity High associativity, no sub-blocking 2000 Anastassia Ailamaki Ph.D. Defense 49
50 PAX vs. NSM across platforms PAX/NSM Speedup on Unix (100MB database) PAX/NSM Speedup 45% 30% 15% PII Xeon UltraSparc-II A % RS Q1 Q6 Q12 Q14 Query PAX improves all queries 2000 Anastassia Ailamaki Ph.D. Defense 50
51 Memory Hierarchy Non-blocking caches >64-byte block, no sub-blocking Generous-sized L1-I (128K) and L2 (> 2MB) A tiny, fast L1/2 with a large, slow L3 won t add much High associativity (2-4) No inclusion (at least for instructions) Processor pipeline Summary Issue width is fine, out-of-order overlaps stall time Execution engine to sustain >1 load/store instr. High-accuracy branch prediction provided that implementation cost remains stable Anastassia Ailamaki Ph.D. Defense 51
52 Conclusions Found trends in behavior of commercial DBMSs using an analytic framework to model execution time Identified bottlenecks among HW components Main memory access is the new DB bottleneck Major showstoppers: L1 Instruction + L2 Data Proposed new design for cache performance Increase spatial locality using novel data placement 70% less data-related memory access delays Significant improvement on sequential scans Evaluated several hardware parameters Suggested DSS-centric processor and memory design 2000 Anastassia Ailamaki Ph.D. Defense 52
Walking Four Machines by the Shore
Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0
More informationWeaving Relations for Cache Performance
Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon Computer Platforms in 198 Execution PROCESSOR 1 cycles/instruction Data and Instructions cycles
More informationBridging the Processor/Memory Performance Gap in Database Applications
Bridging the Processor/Memory Performance Gap in Database Applications Anastassia Ailamaki Carnegie Mellon http://www.cs.cmu.edu/~natassa Memory Hierarchies PROCESSOR EXECUTION PIPELINE L1 I-CACHE L1 D-CACHE
More informationWeaving Relations for Cache Performance
Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison Memory Hierarchies PROCESSOR EXECUTION PIPELINE
More informationWeaving Relations for Cache Performance
VLDB 2001, Rome, Italy Best Paper Award Weaving Relations for Cache Performance Anastassia Ailamaki David J. DeWitt Mark D. Hill Marios Skounakis Presented by: Ippokratis Pandis Bottleneck in DBMSs Processor
More informationData Page Layouts for Relational Databases on Deep Memory Hierarchies
Data Page Layouts for Relational Databases on Deep Memory Hierarchies Anastassia Ailamaki David J. DeWitt Mark D. Hill Carnegie Mellon University natassa@cmu.edu University of Wisconsin - Madison {dewitt,
More informationAnastasia Ailamaki. Performance and energy analysis using transactional workloads
Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:
More informationArchitecture-Conscious Database Systems
Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query
More informationColumn Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin
Column Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin Problem Statement TB disks are coming! Superwide, frequently sparse tables are common DB
More informationSTEPS Towards Cache-Resident Transaction Processing
STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I
More informationDBMSs on a Modern Processor: Where Does Time Go? Revisited
DBMSs on a Modern Processor: Where Does Time Go? Revisited Matthew Becker Information Systems Carnegie Mellon University mbecker+@cmu.edu Naju Mancheril School of Computer Science Carnegie Mellon University
More informationDBMSs On A Modern Processor: Where Does Time Go?
DBMSs On A Modern Processor: Where Does Time Go? Anastassia Ailamaki David J. DeWitt Mark D. Hill David A. Wood University of Wisconsin-Madison Computer Science Dept. 1210 W. Dayton St. Madison, WI 53706
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationCache-Aware Database Systems Internals Chapter 7
Cache-Aware Database Systems Internals Chapter 7 1 Data Placement in RDBMSs A careful analysis of query processing operators and data placement schemes in RDBMS reveals a paradox: Workloads perform sequential
More informationCache-Aware Database Systems Internals. Chapter 7
Cache-Aware Database Systems Internals Chapter 7 Data Placement in RDBMSs A careful analysis of query processing operators and data placement schemes in RDBMS reveals a paradox: Workloads perform sequential
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationJune 2003 CMU-CS School of Computer Science. Carnegie Mellon University. Pittsburgh, PA Abstract
DBmbench: Fast and Accurate Database Workload Representation on Modern Microarchitecture Minglong Shao Anastassia Ailamaki Babak Falsafi Database Group Carnegie Mellon University Computer Architecture
More informationECE 341. Lecture # 15
ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationColumn Stores vs. Row Stores How Different Are They Really?
Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationLecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )
Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationInstruction Level Parallelism (Branch Prediction)
Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution
More informationComputer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction
More informationMesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors
: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth
More informationProcessors, Performance, and Profiling
Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode
More informationMore on Conjunctive Selection Condition and Branch Prediction
More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationPhysical Data Organization. Introduction to Databases CompSci 316 Fall 2018
Physical Data Organization Introduction to Databases CompSci 316 Fall 2018 2 Announcements (Tue., Nov. 6) Homework #3 due today Project milestone #2 due Thursday No separate progress update this week Use
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationCS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines
CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per
More informationComplex Pipelines and Branch Prediction
Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationQuery co-processing on Commodity Processors. Processor Performance / Time. Focus of this tutorial
Query co-processing on Commodity Processors Anastassia Ailamaki Carnegie Mellon University Naga K. Govindaraju Dinesh Manocha University of North Carolina at Chapel Hill Processor Performance / Time Performance
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationCache Optimisation. sometime he thought that there must be a better way
Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching
More informationSISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:
SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationPerformance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha
Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha 21264. The Alpha s predictor is very successful. On the SPECfp 95 benchmarks, there
More informationCaching Basics. Memory Hierarchies
Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby
More informationPage 1. Memory Hierarchies (Part 2)
Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy
More informationAnalyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009
Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationStoring Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Database Management Systems need to:
Storing : Disks and Files base Management System, R. Ramakrishnan and J. Gehrke 1 Storing and Retrieving base Management Systems need to: Store large volumes of data Store data reliably (so that data is
More informationMemory. Lecture 22 CS301
Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch
More informationStoring Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Chapter 7
Storing : Disks and Files Chapter 7 base Management Systems, R. Ramakrishnan and J. Gehrke 1 Storing and Retrieving base Management Systems need to: Store large volumes of data Store data reliably (so
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors
ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George Chrysos Digital Equipment Corporation 1 Motivation
More informationStoring and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?
Storing and Retrieving Storing : Disks and Files base Management Systems need to: Store large volumes of data Store data reliably (so that data is not lost!) Retrieve data efficiently Alternatives for
More informationStoring and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?
Storing and Retrieving Storing : Disks and Files Chapter 9 base Management Systems need to: Store large volumes of data Store data reliably (so that data is not lost!) Retrieve data efficiently Alternatives
More informationCSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]
CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user
More informationEITF20: Computer Architecture Part3.2.1: Pipeline - 3
EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done
More informationUse-Based Register Caching with Decoupled Indexing
Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register
More informationCache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics
More informationDatabase Systems and Modern CPU Architecture
Database Systems and Modern CPU Architecture Prof. Dr. Torsten Grust Winter Term 2006/07 Hard Disk 2 RAM Administrativa Lecture hours (@ MI HS 2): Monday, 09:15 10:00 Tuesday, 14:15 15:45 No lectures on
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationAs the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.
Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction
More informationNear Memory Key/Value Lookup Acceleration MemSys 2017
Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationOutline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate
Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21,
More informationCISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP
CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationExecution-based Prediction Using Speculative Slices
Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers
More informationTopic 14: Dealing with Branches
Topic 14: Dealing with Branches COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 FLASHBACK: Pipeline Hazards Control Hazards What is the next instruction?
More informationTopic 14: Dealing with Branches
Topic 14: Dealing with Branches COS / ELE 375 FLASHBACK: Pipeline Hazards Control Hazards What is the next instruction? Branch instructions take time to compute this. Stall, Predict, or Delay: Computer
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationPowerPC 620 Case Study
Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance
More informationSandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.
Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS Marcin Zukowski Sandor Heman, Niels Nes, Peter Boncz CWI, Amsterdam VLDB 2007 Outline Scans in a DBMS Cooperative Scans Benchmarks DSM version VLDB,
More informationNew Advances in Micro-Processors and computer architectures
New Advances in Micro-Processors and computer architectures Prof. (Dr.) K.R. Chowdhary, Director SETG Email: kr.chowdhary@jietjodhpur.com Jodhpur Institute of Engineering and Technology, SETG August 27,
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major
More informationParallel Computer Architecture
Parallel Computer Architecture What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation:»
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationMultiple Issue ILP Processors. Summary of discussions
Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware
More information2
1 2 3 4 5 6 For more information, see http://www.intel.com/content/www/us/en/processors/core/core-processorfamily.html 7 8 The logic for identifying issues on Intel Microarchitecture Codename Ivy Bridge
More informationPipelining, Branch Prediction, Trends
Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping
More informationDatabase Systems. November 2, 2011 Lecture #7. topobo (mit)
Database Systems November 2, 2011 Lecture #7 1 topobo (mit) 1 Announcement Assignment #2 due today Assignment #3 out today & due on 11/16. Midterm exam in class next week. Cover Chapters 1, 2,
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationIn-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution
In-order vs. Out-of-order Execution In-order instruction execution instructions are fetched, executed & committed in compilergenerated order if one instruction stalls, all instructions behind it stall
More informationColumn-Stores vs. Row-Stores: How Different Are They Really?
Column-Stores vs. Row-Stores: How Different Are They Really? Daniel J. Abadi, Samuel Madden and Nabil Hachem SIGMOD 2008 Presented by: Souvik Pal Subhro Bhattacharyya Department of Computer Science Indian
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationComputer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013
18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed
More informationEECS 322 Computer Architecture Superpipline and the Cache
EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:
More informationBranch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken
Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of
More information