Architecture-Conscious Database Systems

Size: px
Start display at page:

Download "Architecture-Conscious Database Systems"

Transcription

1 Architecture-Conscious Database Systems Anastassia Ailamaki Ph.D. Examination November 30, 2000

2 A DBMS on a 1980 Computer DBMS Execution PROCESSOR 10 cycles/instruction DBMS Data and Instructions 6 cycles MAIN MEMORY < 1 MBps 1 Megabyte (Buffer pool) Hot data DATABASE The main performance bottleneck was I/O latency 2000 Anastassia Ailamaki Ph.D. Defense 2

3 Present and Future Platforms DBMS Execution PROCESSOR 0.33 CACHE Data and cycles/instruction Instructions 70 cycles DBMS Data and Instructions MAIN MEMORY MAIN MEMORY 1 Gigabyte Terabyte MBps DB DB DB Hot data migrates to larger and slower main memory 2000 Anastassia Ailamaki Ph.D. Defense 3

4 Processor & Memory Speed Gap cycles per instruction VAX 11/780 CPI memory latency Pentium II Xeon Memory latency One access to memory is 100 s of instruction opportunities 2000 Anastassia Ailamaki Ph.D. Defense 4

5 On Today s Computers When you think about what today s machines do - they look at the instruction stream dynamically, find parallelism on the fly, execute instructions out of order, and speculate on branch outcomes - it s amazing that they work. John Hennessy, IEEE Computer, August 1999 New architectures are more sophisticated 2000 Anastassia Ailamaki Ph.D. Defense 5

6 Why Study Database Performance? Cycles per instruction Theoretical minimum Desktop/ Engineering (SPECInt) Decision Support Online Transaction Processing High average time per instruction for DB workloads 2000 Anastassia Ailamaki Ph.D. Defense 6

7 Contributions (I): Analysis Problem: Where does query execution time go? Proposed evaluation framework [VLDB 99] Identified bottlenecks in hardware memory access hardware implementation details Discovered two memory-related bottlenecks second-level cache data access first-level instruction cache access Methodological discovery: micro-benchmarks A systematic evaluation framework 2000 Anastassia Ailamaki Ph.D. Defense 7

8 Contributions (II): Software Problem: Current data placement hurts caches Proposed novel data placement [subm. SIGMOD 01] rearranges data records on disk page optimizes data cache performance Evaluated it against the popular scheme 70% less data-related memory access delays does not affect I/O behavior especially beneficial for decision support workloads A cache-conscious data placement 2000 Anastassia Ailamaki Ph.D. Defense 8

9 Contributions (III): Hardware Problem: Hardware design affects DB behavior Compared Shore on four different systems different processor architectures/µ-architectures different memory subsystems Found evidence that DBMSs would benefit from 2-4 way associative, larger L2, no inclusion large blocks, no sub-blocking high-accuracy branch prediction memory-aggressive execution engine Step towards a DSS-centric machine 2000 Anastassia Ailamaki Ph.D. Defense 9

10 Outline Introduction PART I: Analysis Background Query execution time breakdown Experimental results Bottleneck assessment PART II: Partition Attributes Across (PAX) PART III: Towards a DSS-centric h/w design Conclusions 2000 Anastassia Ailamaki Ph.D. Defense 10

11 Previous Work Workload characterization studies, e.g., [Barroso 98], [Keeton 98] Various platforms, mostly multiprocessor One DBMS per platform Results: Commercial different than scientific apps OLTP different than DSS workloads Memory is major bottleneck No coherent study across DBMSs and workloads 2000 Anastassia Ailamaki Ph.D. Defense 11

12 An Execution Pipeline INSTRUCTION POOL FETCH/ DECODE UNIT DISPATCH EXECUTE UNIT RETIRE UNIT L1 I-CACHE L1 D-CACHE L2 CACHE MAIN MEMORY Branch prediction, non-blocking cache, out-of-order 2000 Anastassia Ailamaki Ph.D. Defense 12

13 Where Does Time Go? Hardware Resources Branch Mispredictions Memory Computation Delays (Stalls) Overlap opportunity: Load A D=B+C Load E Execution Execution Time Time = Computation = Computation + Stalls + Stalls - Overlap 2000 Anastassia Ailamaki Ph.D. Defense 13

14 Setup and Methodology Range Selection (sequential, indexed) select avg (a3) from R where a2 > Lo and a2 < Hi Equijoin (sequential) select avg (a3) from R, S where R.a2 = S.a1 Four commercial DBMSs: A, B, C, D 6400 PII Xeon/MT running Windows NT 4 Used processor counters to measure/estimate Crafted microbenchmarks to isolate execution loops 2000 Anastassia Ailamaki Ph.D. Defense 14

15 Time Calculations Measured: Resource stalls, L1I stalls Estimated: L1 data stalls: # misses * penalty L2 stalls: # misses * measured memory latency Branch misprediction stalls: # mispr. * penalty Overlap: measured CPI / expected CPI 2000 Anastassia Ailamaki Ph.D. Defense 15

16 Microbenchmarks vs. TPC 3.5 System B 3.5 System D Clock cycles mbench (seq) TPC-D mbench (idx) benchmark TPC-C benchmark Computation Memory Branch misprediction Resource 2000 Anastassia Ailamaki Ph.D. Defense mbench (seq) TPC-D High CPI compared to integer workloads Sequential scan / TPC-D, 2ary index / TPC-C mbench (idx) TPC-C

17 Execution Time Breakdown (%) 10% Sequential Scan 10% 2ary index selection Join (no index) 100% 100% 100% % execution time 80% 60% 40% 20% 80% 60% 40% 20% 80% 60% 40% 20% 0% 0% 0% A B C D DBMS B C D DBMS A B C D DBMS Computation Memory Branch mispredictions Resource Stalls at least 50% of time Memory stalls are major bottleneck 2000 Anastassia Ailamaki Ph.D. Defense 17

18 Memory Stalls Breakdown (%) 10% Sequential Scan 10% 2ary index selection Join (no index) Memory stall time (%) 100% 80% 60% 40% 20% 0% A B C D DBMS 100% 80% 60% 40% 20% 0% B C D DBMS 100% 80% 60% 40% 20% L1 Data L2 Data L1 Instruction L2 Instruction 0% A B C D DBMS L1 instruction and L2 data stalls dominate Different memory bottlenecks across DBMSs and queries 2000 Anastassia Ailamaki Ph.D. Defense 18

19 Summary of Analysis We can use microbenchmarks instead of TPC Execution time breakdown shows trends Memory access is a major bottleneck Increasing memory-processor performance gap Deeper memory hierarchies expected L2 cache data misses L2 grows (8MB), but will be slower Stalls due to L1 I-cache misses L1 I-cache not likely to grow as much as L2 We need to address every reason for stalls 2000 Anastassia Ailamaki Ph.D. Defense 19

20 Addressing Bottlenecks Memory D-cache I-cache D I DBMS: improve locality DBMS + Compiler Branch Mispredictions Hardware Resources B R Compiler + Hardware Hardware Data cache: A clear responsibility of the DBMS 2000 Anastassia Ailamaki Ph.D. Defense 20

21 Outline Introduction PART I: Where Does Time Go? PART II: Partition Attributes Across The current scheme: Slotted pages Partition Attributes Across (PAX) Performance Results PART III: Towards a DSS-centric h/w design Conclusions 2000 Anastassia Ailamaki Ph.D. Defense 21

22 The Data Placement Tradeoff Slotted Pages: Used by all commercial DBMSs Store table records sequentially Intra-record locality (attributes of record r together) but pollutes cache Inspiration: Vertical partitioning [Copeland 85] Store n-attribute table as n single-attribute tables Problem: High record reconstruction cost Partition Attributes Across (PAX) Have the cake and eat it, too! PAX: Inter-record locality, low reconstruction cost 2000 Anastassia Ailamaki Ph.D. Defense 22

23 Current Scheme: Slotted Pages Formal name: NSM (N-ary Storage Model) R PAGE HEADER RH RID 1 SSN 1237 Name Jane Age 30 Jane 30 RH John 45 RH Jim 20 RH John Susan Jim Susan Leon Dan 37 Records are stored sequentially Offsets to start of each record at end of page 2000 Anastassia Ailamaki Ph.D. Defense 23

24 Current Scheme: Slotted Pages HEADER FIXED-LENGTH VALUES VARIABLE-LENGTH VALUES null bitmap, record length, etc offsets to variablelength fields All attributes of a record are stored together 2000 Anastassia Ailamaki Ph.D. Defense 24

25 NSM Cache Behavior PAGE HEADER RH Jane 30 RH John 45 RH Jim 20 RH4 Jane 30 RH block 1 45 RH block Susan Leon Jim 20 RH4 block Leon block 4 CACHE MAIN MEMORY select name from R where age > 40 NSM pollutes the cache and wastes bandwidth 2000 Anastassia Ailamaki Ph.D. Defense 25

26 Partition Attributes Across (PAX) NSM PAGE PAX PAGE PAGE HEADER RH PAGE HEADER Jane 30 RH John RH Jim 20 RH Susan 52 Jane John Jim Susan Partition data within the page for spatial locality 2000 Anastassia Ailamaki Ph.D. Defense 26

27 PAX: Mapping to Cache PAGE HEADER block 1 Jane John Jim Suzan MAIN MEMORY CACHE select name from R where age > 40 Fewer cache misses, low reconstruction cost 2000 Anastassia Ailamaki Ph.D. Defense 27

28 PAX: Detailed Design # records free space attribute # attributes sizes pid v 4 f }}Page Header F - Minipage Jane John presence bits v-offsets presence bits 1 1 } } 1 1 V - Minipage F - Minipage 2000 Anastassia Ailamaki Ph.D. Defense 28

29 Basic Evaluation: Methodology Main-memory resident R Query: select avg (a i ) from R where a j >= Lo and a j <= Hi PII Xeon running Windows NT 4 16KB L1-I, 16KB L1-D, 512 KB L2, 512 MB RAM Used processor counters Implemented schemes on Shore Storage Manager Similar behavior to commercial Database Systems 2000 Anastassia Ailamaki Ph.D. Defense 29

30 Why Use Shore? 100% Range selection query on 4 commercial DBMSs + Shore Breakdown of execution & memory delays Range Selection (no index) 100% Range Selection (no index) % execution time 80% 60% 40% 20% Memory stall time (%) 80% 60% 40% 20% 0% 0% A B C D Shore DBMS Computation Memory Branch mispr. Resource A B C D Shore DBMS L1 Data L2 Data L1 Instruction L2 Instruction We can use Shore to evaluate DSS workload behavior 2000 Anastassia Ailamaki Ph.D. Defense 30

31 Effect on Accessing Cache Data Cache data stalls Sensitivity to Selectivity stall cycles / record L1 Data stalls L2 Data stalls stall cycles / record NSM L2 PAX L2 0 0 NSM PAX 1% 5% 10% 20% 50% 100% page layout selectivity PAX incurs 70% less data cache penalty than NSM PAX reduces cache misses at both L1 and L2 Selectivity doesn t matter for PAX data stalls 2000 Anastassia Ailamaki Ph.D. Defense 31

32 Time and Sensitivity Analysis Execution time breakdown Sensitivity to # of attributes clock cycles per record NSM PAX page layout Resource Branch Mispred. Memory Comp. elapsed time (sec) # of attributes in record NSM PAX PAX: 75% less memory penalty than NSM (10% of time) Execution times converge as number of attrs increases 2000 Anastassia Ailamaki Ph.D. Defense 32

33 Sensitivity Analysis (2) Elapsed time sensitivity to projectivity / # predicates Range selection queries, 1% selectivity seconds seconds NSM PAX NSM PAX projectivity # of attributes in predicate PAX and NSM times converge as query covers entire tuple 2000 Anastassia Ailamaki Ph.D. Defense 33

34 Evaluation Using a DSS Benchmark Loaded 100M, 200M, and 500M TPC-H DBs Ran Queries: Range Selections w/ variable parameters (RS) TPC-H Q1 and Q6 sequential scans lots of aggregates (sum, avg, count) grouping/ordering of results TPC-H Q12 and Q14 (Adaptive Hybrid) Hash Join complex where clause, conditional aggregates PII Xeon running Windows NT 4 Used processor counters 2000 Anastassia Ailamaki Ph.D. Defense 34

35 Insertions with PAX Elapsed Bulk Load Times Estimate average field sizes Start inserting records If a record doesn t fit, Reorganize page (move minipage boundaries) Adjust average field sizes time (seconds) NSM PAX DSM % of reorganizations accommodate a single record MB 200 MB 500 MB Database Size PAX loads a TPC-H database in 2-26% longer than NSM 2000 Anastassia Ailamaki Ph.D. Defense 35

36 Elapsed Execution Time TPC-H 100M TPC-H 200M TPC-H 500M Elapsed time (sec) Q1 Q6 Q12 Q14 TPC-H Query NSM Q1 Q6 Q12 Q14 TPC-H query Q1 Q6 Q12 Q14 PAX TPC-H query PAX improves performance up to 42% even with I/O 2000 Anastassia Ailamaki Ph.D. Defense 36

37 Speedup PAX/NSM Speedup 45% 30% 15% 100 MB 200 MB 500 MB PAX/NSM Speedup on PII/NT 0% RS Q1 Q6 Q12 Q14 Query PAX improves performance up to 42% even with I/O Speedup differs across DB sizes 2000 Anastassia Ailamaki Ph.D. Defense 37

38 PAX: Summary Advantages High data cache performance Faster than NSM for DSS queries Orthogonal to other storage decisions Does not affect I/O performance Current Disadvantages Complex free space mgmt with variable length attributes Complicates update algorithm PAX beneficial for read-mostly workloads (e.g., DSS) (update-intensive workloads in future work) 2000 Anastassia Ailamaki Ph.D. Defense 38

39 Outline Introduction PART I: Where Does Time Go? PART II: Partition Attributes Across PART III: Towards DSS-Centric H/W Memory subsystem Branch prediction mechanism Processor pipeline Conclusions 2000 Anastassia Ailamaki Ph.D. Defense 39

40 Architecture Platform Differences RISC or CISC Instruction set Microarchitecture Pipeline Speculation (out-of-order, multiple issue) Branch prediction Memory subsystem Cache size, associativity Block size, subblocking Inclusion Which design looks beneficial for DSS workloads? 2000 Anastassia Ailamaki Ph.D. Defense 40

41 Experimental Setup Used four machines Sun UltraSparc: US-II and US-IIi, Solaris 2.6/2.7 Intel P6: PII Xeon, Linux v2.2 DEC Alpha: 21164A, OSF1 v.4.0 Architecture and Processor Microarchitecture Characteristic US-II UltraSparc US-IIi PII Xeon Alpha speed 296 MHz 300 MHz 400 MHz 532 MHz introduced in out of order? no no yes no instruction set RISC RISC CISC RISC 2000 Anastassia Ailamaki Ph.D. Defense 41

42 Cache Hierarchies Characteristic L1 D L3 size, assoc block/subblock inclusion by L2 size, assoc L1 I block/subblock 32/32 32/32 32/32 32/16 inclusion by L2 size, assoc L2 block/subblock 64/64 64/64 32/32 64/32 inclusion by L3 size, assoc block/subblock UltraSparc US-II US-IIi PII Xeon Alpha KB, DM 16KB, DM 16KB, 2-way 8KB, DM 32/16 32/16 32/32 32/32 yes yes no yes 16KB, 2-way 16KB, 2-way 16KB, 4-way 8KB, DM yes yes no no 2 MB, DM 512KB, DM 512KB, 4-way 96KB, 3-way N/A N/A N/A yes N/A N/A N/A 4 MB / DM N/A N/A N/A 64/ Anastassia Ailamaki Ph.D. Defense 42

43 Methodology Compiled Shore with gcc Alpha version not optimized Ran DSS workload Range Selections w/ variable parameters (RS) TPC-H 1, 6, 12, 14 Used processors counters Sun: run-pic (by Glenn Ammons, modified) PII: PAPI (public-domain counter library) Alpha: DCPI (sampling software by Compaq) 2000 Anastassia Ailamaki Ph.D. Defense 43

44 Superscalar Processor Capability Alpha issues at most 2 instructions / cycle (max=4) >60% of the time the Xeon retires 0/1 instruction (max=3) 100% Alpha Issue Breakdown PII Xeon (NT) Retire Breakdown 100% % of total clock cycles 80% 60% 40% 20% % of total clock cycles 80% 60% 40% 20% 0% 0% RS Q1 Q6 Q12 Q14 query pipeline dry 0-issue 1-issue 2-issue A B C D Shore DBMS 0-retired 1-retired 2-retired 3-retired The current issue/retire width remains unexploited 2000 Anastassia Ailamaki Ph.D. Defense 44

45 Clock-per-Record Breakdown UltraSparc-II PII Xeon A21164 clock cycles / record (%) 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% RS Q1 Q6 Q12 Q14 DBMS RS Q1 Q6 Q12 Q14 DBMS RS Q1 Q6 Q12 Q14 DBMS D-stalls I-stalls Br. Mispr Other Memory + branch misprediction stalls = 35-60% of time Data accesses: major memory bottleneck (esp. Q12, Q14) 2000 Anastassia Ailamaki Ph.D. Defense 45

46 Branch Prediction Branch penalty = frequency*misprediction rate*penalty Frequency is typically 20-25% In-order processors => lower penalty Low misprediction accuracy may break it (e.g., UltraSparc) Branch frequency Characteristic RS, Q1, Q6 Q12, Q14 22% 3.5% 9% 15% Branch RS, Q1, Q6 misprediction rate Q12, Q14 1% 6% Branch penalty (cycles) PII Xeon 18% 15 Alpha % 5 High-accuracy predictors 2000 Anastassia Ailamaki Ph.D. Defense 46

47 Cache Inclusion UltraSparc II/IIi cache comparison (RS) 1.05 normalized unit Elapsed time L1D misses L1I misses L2D misses L2I misses UltraSparc-IIi UltraSparc-II Small caches should not maintain inclusion 2000 Anastassia Ailamaki Ph.D. Defense 47

48 Cache Block Size PAX/NSM miss rate improv. 120% 105% 90% 75% 60% 45% 30% 15% 0% PAX savings on L3 data miss rates PII Xeon (L2) US-II (L2) A21164 (L3) RS Q1 Q6 Q12 Q14 Query Larger cache line = lower miss rates 2000 Anastassia Ailamaki Ph.D. Defense 48

49 Sub-Blocking / Associativity UltraSparc: direct-mapped, subblocking (32/16) Xeon: 2-way, no subblocking (32/32) L1 Data Cache Misses (RS) L1 Data Cache Misses (RS) # L1D misses per record US-II US-IIi PII Xeon # L1D misses per record US-II US-IIi PII Xeon projectivity 2% 10% 20% 50% 100% selectivity High associativity, no sub-blocking 2000 Anastassia Ailamaki Ph.D. Defense 49

50 PAX vs. NSM across platforms PAX/NSM Speedup on Unix (100MB database) PAX/NSM Speedup 45% 30% 15% PII Xeon UltraSparc-II A % RS Q1 Q6 Q12 Q14 Query PAX improves all queries 2000 Anastassia Ailamaki Ph.D. Defense 50

51 Memory Hierarchy Non-blocking caches >64-byte block, no sub-blocking Generous-sized L1-I (128K) and L2 (> 2MB) A tiny, fast L1/2 with a large, slow L3 won t add much High associativity (2-4) No inclusion (at least for instructions) Processor pipeline Summary Issue width is fine, out-of-order overlaps stall time Execution engine to sustain >1 load/store instr. High-accuracy branch prediction provided that implementation cost remains stable Anastassia Ailamaki Ph.D. Defense 51

52 Conclusions Found trends in behavior of commercial DBMSs using an analytic framework to model execution time Identified bottlenecks among HW components Main memory access is the new DB bottleneck Major showstoppers: L1 Instruction + L2 Data Proposed new design for cache performance Increase spatial locality using novel data placement 70% less data-related memory access delays Significant improvement on sequential scans Evaluated several hardware parameters Suggested DSS-centric processor and memory design 2000 Anastassia Ailamaki Ph.D. Defense 52

Walking Four Machines by the Shore

Walking Four Machines by the Shore Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon Computer Platforms in 198 Execution PROCESSOR 1 cycles/instruction Data and Instructions cycles

More information

Bridging the Processor/Memory Performance Gap in Database Applications

Bridging the Processor/Memory Performance Gap in Database Applications Bridging the Processor/Memory Performance Gap in Database Applications Anastassia Ailamaki Carnegie Mellon http://www.cs.cmu.edu/~natassa Memory Hierarchies PROCESSOR EXECUTION PIPELINE L1 I-CACHE L1 D-CACHE

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison Memory Hierarchies PROCESSOR EXECUTION PIPELINE

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance VLDB 2001, Rome, Italy Best Paper Award Weaving Relations for Cache Performance Anastassia Ailamaki David J. DeWitt Mark D. Hill Marios Skounakis Presented by: Ippokratis Pandis Bottleneck in DBMSs Processor

More information

Data Page Layouts for Relational Databases on Deep Memory Hierarchies

Data Page Layouts for Relational Databases on Deep Memory Hierarchies Data Page Layouts for Relational Databases on Deep Memory Hierarchies Anastassia Ailamaki David J. DeWitt Mark D. Hill Carnegie Mellon University natassa@cmu.edu University of Wisconsin - Madison {dewitt,

More information

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Anastasia Ailamaki. Performance and energy analysis using transactional workloads Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:

More information

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query

More information

Column Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin

Column Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin Column Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin Problem Statement TB disks are coming! Superwide, frequently sparse tables are common DB

More information

STEPS Towards Cache-Resident Transaction Processing

STEPS Towards Cache-Resident Transaction Processing STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I

More information

DBMSs on a Modern Processor: Where Does Time Go? Revisited

DBMSs on a Modern Processor: Where Does Time Go? Revisited DBMSs on a Modern Processor: Where Does Time Go? Revisited Matthew Becker Information Systems Carnegie Mellon University mbecker+@cmu.edu Naju Mancheril School of Computer Science Carnegie Mellon University

More information

DBMSs On A Modern Processor: Where Does Time Go?

DBMSs On A Modern Processor: Where Does Time Go? DBMSs On A Modern Processor: Where Does Time Go? Anastassia Ailamaki David J. DeWitt Mark D. Hill David A. Wood University of Wisconsin-Madison Computer Science Dept. 1210 W. Dayton St. Madison, WI 53706

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Cache-Aware Database Systems Internals Chapter 7

Cache-Aware Database Systems Internals Chapter 7 Cache-Aware Database Systems Internals Chapter 7 1 Data Placement in RDBMSs A careful analysis of query processing operators and data placement schemes in RDBMS reveals a paradox: Workloads perform sequential

More information

Cache-Aware Database Systems Internals. Chapter 7

Cache-Aware Database Systems Internals. Chapter 7 Cache-Aware Database Systems Internals Chapter 7 Data Placement in RDBMSs A careful analysis of query processing operators and data placement schemes in RDBMS reveals a paradox: Workloads perform sequential

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

June 2003 CMU-CS School of Computer Science. Carnegie Mellon University. Pittsburgh, PA Abstract

June 2003 CMU-CS School of Computer Science. Carnegie Mellon University. Pittsburgh, PA Abstract DBmbench: Fast and Accurate Database Workload Representation on Modern Microarchitecture Minglong Shao Anastassia Ailamaki Babak Falsafi Database Group Carnegie Mellon University Computer Architecture

More information

ECE 341. Lecture # 15

ECE 341. Lecture # 15 ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itanium Processors : Optimizations for Improving Fetch Bandwidth of Future Itanium Processors Marsha Eng, Hong Wang, Perry Wang Alex Ramirez, Jim Fung, and John Shen Overview Applications of for Itanium Improving fetch bandwidth

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Physical Data Organization. Introduction to Databases CompSci 316 Fall 2018

Physical Data Organization. Introduction to Databases CompSci 316 Fall 2018 Physical Data Organization Introduction to Databases CompSci 316 Fall 2018 2 Announcements (Tue., Nov. 6) Homework #3 due today Project milestone #2 due Thursday No separate progress update this week Use

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Query co-processing on Commodity Processors. Processor Performance / Time. Focus of this tutorial

Query co-processing on Commodity Processors. Processor Performance / Time. Focus of this tutorial Query co-processing on Commodity Processors Anastassia Ailamaki Carnegie Mellon University Naga K. Govindaraju Dinesh Manocha University of North Carolina at Chapel Hill Processor Performance / Time Performance

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha

Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha 21264. The Alpha s predictor is very successful. On the SPECfp 95 benchmarks, there

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

Page 1. Memory Hierarchies (Part 2)

Page 1. Memory Hierarchies (Part 2) Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Database Management Systems need to:

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Database Management Systems need to: Storing : Disks and Files base Management System, R. Ramakrishnan and J. Gehrke 1 Storing and Retrieving base Management Systems need to: Store large volumes of data Store data reliably (so that data is

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Chapter 7

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Chapter 7 Storing : Disks and Files Chapter 7 base Management Systems, R. Ramakrishnan and J. Gehrke 1 Storing and Retrieving base Management Systems need to: Store large volumes of data Store data reliably (so

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George Chrysos Digital Equipment Corporation 1 Motivation

More information

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes? Storing and Retrieving Storing : Disks and Files base Management Systems need to: Store large volumes of data Store data reliably (so that data is not lost!) Retrieve data efficiently Alternatives for

More information

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes? Storing and Retrieving Storing : Disks and Files Chapter 9 base Management Systems need to: Store large volumes of data Store data reliably (so that data is not lost!) Retrieve data efficiently Alternatives

More information

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

Use-Based Register Caching with Decoupled Indexing

Use-Based Register Caching with Decoupled Indexing Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Database Systems and Modern CPU Architecture

Database Systems and Modern CPU Architecture Database Systems and Modern CPU Architecture Prof. Dr. Torsten Grust Winter Term 2006/07 Hard Disk 2 RAM Administrativa Lecture hours (@ MI HS 2): Monday, 09:15 10:00 Tuesday, 14:15 15:45 No lectures on

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate Outline Lecture 7: EITF20 Computer Architecture Anders Ardö EIT Electrical and Information Technology, Lund University November 21, 2012 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21,

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Topic 14: Dealing with Branches

Topic 14: Dealing with Branches Topic 14: Dealing with Branches COS / ELE 375 Computer Architecture and Organization Princeton University Fall 2015 Prof. David August 1 FLASHBACK: Pipeline Hazards Control Hazards What is the next instruction?

More information

Topic 14: Dealing with Branches

Topic 14: Dealing with Branches Topic 14: Dealing with Branches COS / ELE 375 FLASHBACK: Pipeline Hazards Control Hazards What is the next instruction? Branch instructions take time to compute this. Stall, Predict, or Delay: Computer

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

PowerPC 620 Case Study

PowerPC 620 Case Study Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance

More information

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS Marcin Zukowski Sandor Heman, Niels Nes, Peter Boncz CWI, Amsterdam VLDB 2007 Outline Scans in a DBMS Cooperative Scans Benchmarks DSM version VLDB,

More information

New Advances in Micro-Processors and computer architectures

New Advances in Micro-Processors and computer architectures New Advances in Micro-Processors and computer architectures Prof. (Dr.) K.R. Chowdhary, Director SETG Email: kr.chowdhary@jietjodhpur.com Jodhpur Institute of Engineering and Technology, SETG August 27,

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

Parallel Computer Architecture

Parallel Computer Architecture Parallel Computer Architecture What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation:»

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Multiple Issue ILP Processors. Summary of discussions

Multiple Issue ILP Processors. Summary of discussions Summary of discussions Multiple Issue ILP Processors ILP processors - VLIW/EPIC, Superscalar Superscalar has hardware logic for extracting parallelism - Solutions for stalls etc. must be provided in hardware

More information

2

2 1 2 3 4 5 6 For more information, see http://www.intel.com/content/www/us/en/processors/core/core-processorfamily.html 7 8 The logic for identifying issues on Intel Microarchitecture Codename Ivy Bridge

More information

Pipelining, Branch Prediction, Trends

Pipelining, Branch Prediction, Trends Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping

More information

Database Systems. November 2, 2011 Lecture #7. topobo (mit)

Database Systems. November 2, 2011 Lecture #7. topobo (mit) Database Systems November 2, 2011 Lecture #7 1 topobo (mit) 1 Announcement Assignment #2 due today Assignment #3 out today & due on 11/16. Midterm exam in class next week. Cover Chapters 1, 2,

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution In-order vs. Out-of-order Execution In-order instruction execution instructions are fetched, executed & committed in compilergenerated order if one instruction stalls, all instructions behind it stall

More information

Column-Stores vs. Row-Stores: How Different Are They Really?

Column-Stores vs. Row-Stores: How Different Are They Really? Column-Stores vs. Row-Stores: How Different Are They Really? Daniel J. Abadi, Samuel Madden and Nabil Hachem SIGMOD 2008 Presented by: Souvik Pal Subhro Bhattacharyya Department of Computer Science Indian

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information