Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Size: px
Start display at page:

Download "Anastasia Ailamaki. Performance and energy analysis using transactional workloads"

Transcription

1 Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun

2 Online Transaction Processing $2B+ industry Characteristics: Has many concurrent requests Touch small part of whole data Need high & predictable performance Primary application for databases

3 Hardware keeps providing new forms of parallelism How s the utilization? Hardware OLTP runs on Core ILP SMT multisocket multicores heterogeneous many-cores

4 Utilizing modern processors Execution Cycles Breakdown 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % STALLED BUSY Shore-MT DBMS D VoltDB HyPer DBMS M TPC-C 1GB data 1 worker thread Intel Ivy Bridge Processor stalled most of the time

5 Scaling up OLTP on multisockets Throughput Number of sockets Multisocket servers severely under-utilized

6 Why care about power? Monthly datacenter costs 18% 13% 8% 4% 57% 3% power-related Dynamic fraction increasing [J. R. Hamilton] Servers Networking Equipment Power Distribution & Cooling Power Other Power Energy proportionality Today Ideal Utilization Energy efficiency as important as performance

7 Why is my system under-utilizing hardware? Why isn t my system faster on new hardware? Are new processors more energy-efficient?

8

9 Analyzing performance and energy Macrobenchmarks or Microbenchmarks? Execution time breakdowns Measuring energy efficiency

10 Analyzing performance and energy Macrobenchmarks or Microbenchmarks? Execution time breakdowns Measuring energy efficiency

11 Instructions per Cycle Intel Xeon X566 maximum Instructions per Cycle Utilization (microarchitecture level) Sun Niagara T2 maximum [EDBT13] OLTP utilizes lean cores better TPC-B TPC-C TPC-E TPC-B TPC-C TPC-E TPC-E has higher IPC

12 Macrobenchmark: Execution Cycles & Stalls Execution Cycles Breakdown Intel Xeon X566 1% 8% 6% 4% 2% % STALLED BUSY TPC-B TPC-C TPC-E Core Stalls 1% 8% 6% 4% 2% % INSTRUCTION REST TPC-B TPC-C TPC-E Over 7% of time goes to stalls Instruction stalls are the main problem [EDBT13]

13 Execution cycles breakdown Utilization across systems 1% 8% 6% 4% 2% % STALLED BUSY Shore-MT DBMS D VoltDB HyPer DBMS M on-disk in-memory [SIGMOD16] TPC-C, 1GB 1 worker thread Intel Ivy Bridge Even in-memory systems stall > 6% time

14 [SIGMOD16] Microbenchmarks for what-if analysis 4 3 1GB 1 worker thread Intel Ivy Bridge IPC Shore-MT DBMS D VoltDB HyPer DBMS M Number of rows read Lower data locality à low IPC for some systems

15 OLTP on hardware islands 21 cycles 72 cycles 237 cycles [PVLDB12] scan Core Core Core Core Core Core Core Core L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 Column L3 Memory controller Inter-socket links L3 Memory controller Inter-socket links Column Socket Socket 1 How measure impact? 2x12-core Intel Xeon E5-265L v3, 24x16GiB DDR3

16 Partition sensitive microbenchmark Single site version probe/update N rows from the local site [PVLDB12] Multisite version probe/update 1 row from the local site probe/update N-1 rows uniformly from any site sites may reside on the same instance

17 OLTP deployment configurations [PVLDB12] Shared-everything Island shared-nothing Shared-nothing

18 Multisite transactions: read only [PVLDB12] Throughput (KTps) Shared-everything Shared-nothing Island shared-nothing 4socket x 6cores 1 rows accessed % multisite transactions More instances -> faster performance degradation

19 Multisite transactions: updates [PVLDB12] Throughput (KTps) Shared-everything Shared-nothing Island shared-nothing 4socket x 6cores 1 rows accessed % multisite transactions Update distributed transactions are more expensive

20 Analyzing performance and energy Macrobenchmarks or Microbenchmarks? Execution time breakdowns Measuring energy efficiency

21 My first toy: PII Xeon [VLDB99] INSTRUCTION POOL FETCH/ DECODE UNIT DISPATCH EXECUTE UNIT RETIRE UNIT L1 I-CACHE L1 D-CACHE L2 CACHE MAIN MEMORY ÊBranch prediction, non-blocking caches, out-of-order

22 Where Does Time Go? [VLDB99] Computation Stalls Cache misses Branch mispredictions Other execution pipeline stalls Stall time and computation overlap Time = T Computation +T Memory +T Branch +T Resource -T Overlap

23 Setup and Methodology [VLDB99] Range Selection (sequential, indexed) select avg (a3) from R where a2 > Lo and a2 < Hi Equijoin (sequential) select avg (a3) from R, S where R.a2 = S.a1 Four commercial DBMSs: A, B, C, D 64 PII Xeon/MT running Windows NT 4 Used PII counters Correctness: Measured & computed CPI

24 Two very useful breakdowns [VLDB99] processor stalled >5% of time most stalls: L1I and L2D

25 Adapted formula [SIGMOD16] # branch-misp x penalty T " + T $%& = T ( + T ) + T * + T + T ) = T &,- + T &,. + T &/- + T &/. +T &- + T &. T + = T 12 + T.34 + T )-5( + T.6&* + T -6&*

26 Intel Xeon X566 32KB L1-I & 32 KB L1-D Misses per k-instructions Cache Misses today LLC L2D L1D L2I L1I 1 TPC-B TPC-C TPC-E Misses per k-instructions L1-I misses dominate TPC-E has lower data miss ratio [EDBT13] Sun Niagara T2 16KB L1-I & 8KB L1-D TPC-B TPC-C TPC-E

27 More scans à Increased page reuse TPC-E s lower miss ratio 15 4 Average per transaction #Records Accessed #Pages Accessed Heap Index TPC-B TPC-C TPC-E TPC-B TPC-C TPC-E

28 Breaking down clock cycles [SIGMOD16] Stall cycles breakdown in 1 instructions LLC D L2D L1D LLC I L2I L1I L1I LLC D TPC-C, 1GB 1 worker thread Intel Ivy Bridge Shore-MT DBMS D VoltDB HyPer DBMS M L1I or LLC D stalls dominate

29 Where do L1I stalls come from? Stall cycles breakdown in 1 instructions L1I L2I L2I L1I Number of rows read DBMS D Number of rows read DBMS M Read N rows, 1GB 1 worker thread Intel Ivy Bridge Code outside storage mgr à high L1I misses

30 Analyzing performance and energy Macrobenchmarks or Microbenchmarks? Execution time breakdowns Measuring energy efficiency

31 ARM server-grade processor [DAMON16] lean ARM Fat 4-way fetch, decode, rename, dispatch Issue Branch Integer Integer 1 Store FP/SIMD FP/SIMD 1 Load OLTP on ARM: performance & power?

32 Xeon vs. ARM [DAMON16] Processor Intel Xeon ARM Cortex-A57 # Sockets 2 (one is active) 1 # Cores/socket 8 8 Issue width 4 4 Clock speed 2.GHz 2.Ghz L1I / L1D 32KB / 32KB 32KB / 32KB L2 256KB 256KB L3 (shared) 2MB 8MB RAM 256GB 16GB

33 ARM is a promising alternative ARM 15 Xeon 15 15x 15 Silo, TPC-C, 5GB Single thread [DAMON16] Normalized throughput x Normalized power 1 5 Normalized energy efficiency 1 5 6x

34 Normalized throughput ARM achieves energy proportionality Number of threads ARM Xeon Normalized energy efficiency 1.5 Normalized power Number of threads Silo, TPC-C, 5GB Number of threads [DAMON16]

35 Normalized latency ARM is less suitable for low latency Shore-MT 2x ARM Normalized latency Xeon 3x Silo 5x TPC-C, 5GB All cores active 11x [DAMON16] 5th 99th 99.9th 5th 99th 99.9th

36 Lessons learned Macrobenchmarks show big picture Microbenchmarksreveal details Breakdowns correlate numbers Sensitivity analysis highlights trends Right methodology is essential for understanding behavior

37 References [DAMON16] U. Sirin, R. Appuswamy, and A. Ailamaki: OLTP on a server-grade ARM. [EDBT13] P. Tözün, I. Pandis, C. Kaynak, D. Jevdjic, and A. Ailamaki: From A to E: Analyzing TPC s OLTP Benchmarks The obsolete, the ubiquitous, the unexplored. [PVLDB12] D. Porobic, I. Pandis, M. Branco, P. Tözün, and A. Ailamaki: OLTP on Hardware Islands. [SIGMOD16] U. Sirin, P. Tözün, D. Porobic, and A. Ailamaki: Micro-architectural Analysis of In-memory OLTP [VLDB99] A. Ailamaki, D. DeWitt, M. Hill, and D. Wood: DBMSs On A Modern Processor: Where Does Time Go?

Walking Four Machines by the Shore

Walking Four Machines by the Shore Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0

More information

PLP: Page Latch free

PLP: Page Latch free PLP: Page Latch free Shared everything OLTP Ippokratis Pandis Pınar Tözün Ryan Johnson Anastasia Ailamaki IBM Almaden Research Center École Polytechnique Fédérale de Lausanne University of Toronto OLTP

More information

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Architecture-Conscious Database Systems Anastassia Ailamaki Ph.D. Examination November 30, 2000 A DBMS on a 1980 Computer DBMS Execution PROCESSOR 10 cycles/instruction DBMS Data and Instructions 6 cycles

More information

Bridging the Processor/Memory Performance Gap in Database Applications

Bridging the Processor/Memory Performance Gap in Database Applications Bridging the Processor/Memory Performance Gap in Database Applications Anastassia Ailamaki Carnegie Mellon http://www.cs.cmu.edu/~natassa Memory Hierarchies PROCESSOR EXECUTION PIPELINE L1 I-CACHE L1 D-CACHE

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon Computer Platforms in 198 Execution PROCESSOR 1 cycles/instruction Data and Instructions cycles

More information

Scalable Transaction Processing on Multicores

Scalable Transaction Processing on Multicores Scalable Transaction Processing on Multicores [Shore-MT & DORA] Ippokratis Pandis Ryan Johnson, Nikos Hardavellas, Anastasia Ailamaki CMU & EPFL HPTS -26 Oct 2009 Multicore cluster on a chip Then Now CPU

More information

The Case for Heterogeneous HTAP

The Case for Heterogeneous HTAP The Case for Heterogeneous HTAP Raja Appuswamy, Manos Karpathiotakis, Danica Porobic, and Anastasia Ailamaki Data-Intensive Applications and Systems Lab EPFL 1 HTAP the contract with the hardware Hybrid

More information

OLTP on Hardware Islands

OLTP on Hardware Islands OLTP on Hardware Islands Danica Porobic Ippokratis Pandis Miguel Branco Pınar Tözün Anastasia Ailamaki École Polytechnique Fédérale de Lausanne Lausanne, VD, Switzerland {danica.porobic, miguel.branco,

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison Memory Hierarchies PROCESSOR EXECUTION PIPELINE

More information

DBMSs on a Modern Processor: Where Does Time Go? Revisited

DBMSs on a Modern Processor: Where Does Time Go? Revisited DBMSs on a Modern Processor: Where Does Time Go? Revisited Matthew Becker Information Systems Carnegie Mellon University mbecker+@cmu.edu Naju Mancheril School of Computer Science Carnegie Mellon University

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance VLDB 2001, Rome, Italy Best Paper Award Weaving Relations for Cache Performance Anastassia Ailamaki David J. DeWitt Mark D. Hill Marios Skounakis Presented by: Ippokratis Pandis Bottleneck in DBMSs Processor

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency. Lesson 1 Course Notes Review of Computer Architecture Embedded Systems ideal: low power, low cost, high performance Overview of VLIW and ILP What is ILP? It can be seen in: Superscalar In Order Processors

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

OLTP On A Server-grade ARM: Power, Throughput and Latency Comparison

OLTP On A Server-grade ARM: Power, Throughput and Latency Comparison OLTP On A Server-grade ARM: Power, Throughput and Latency Comparison Utku Sirin utku.sirin@epfl.ch Raja Appuswamy raja.appuswamy@epfl.ch École Polytechnique Fédérale de Lausanne RAW Labs SA Anastasia Ailamaki

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

2

2 1 2 3 4 5 6 For more information, see http://www.intel.com/content/www/us/en/processors/core/core-processorfamily.html 7 8 The logic for identifying issues on Intel Microarchitecture Codename Ivy Bridge

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

From A to E: Analyzing TPC s OLTP Benchmarks

From A to E: Analyzing TPC s OLTP Benchmarks From A to E: Analyzing TPC s OLTP Benchmarks The obsolete, the ubiquitous, the unexplored Pınar Tözün Ippokratis Pandis Cansu Kaynak Djordje Jevdjic Anastasia Ailamaki École Polytechnique Fédérale de Lausanne

More information

Smooth Scan: Statistics-Oblivious Access Paths. Renata Borovica-Gajic Stratos Idreos Anastasia Ailamaki Marcin Zukowski Campbell Fraser

Smooth Scan: Statistics-Oblivious Access Paths. Renata Borovica-Gajic Stratos Idreos Anastasia Ailamaki Marcin Zukowski Campbell Fraser Smooth Scan: Statistics-Oblivious Access Paths Renata Borovica-Gajic Stratos Idreos Anastasia Ailamaki Marcin Zukowski Campbell Fraser Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q16 Q18 Q19 Q21 Q22

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

HW1 Solutions. Type Old Mix New Mix Cost CPI

HW1 Solutions. Type Old Mix New Mix Cost CPI HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

STEPS Towards Cache-Resident Transaction Processing

STEPS Towards Cache-Resident Transaction Processing STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I

More information

DBMSs On A Modern Processor: Where Does Time Go?

DBMSs On A Modern Processor: Where Does Time Go? DBMSs On A Modern Processor: Where Does Time Go? Anastassia Ailamaki David J. DeWitt Mark D. Hill David A. Wood University of Wisconsin-Madison Computer Science Dept. 1210 W. Dayton St. Madison, WI 53706

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded Reference Case Study of a Multi-core, Multithreaded Processor The Sun T ( Niagara ) Computer Architecture, A Quantitative Approach, Fourth Edition, by John Hennessy and David Patterson, chapter. :/C:8

More information

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses

More information

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

June 2003 CMU-CS School of Computer Science. Carnegie Mellon University. Pittsburgh, PA Abstract

June 2003 CMU-CS School of Computer Science. Carnegie Mellon University. Pittsburgh, PA Abstract DBmbench: Fast and Accurate Database Workload Representation on Modern Microarchitecture Minglong Shao Anastassia Ailamaki Babak Falsafi Database Group Carnegie Mellon University Computer Architecture

More information

A Database System Performance Study with Micro Benchmarks on a Many-core System

A Database System Performance Study with Micro Benchmarks on a Many-core System DEIM Forum 2012 D6-3 A Database System Performance Study with Micro Benchmarks on a Many-core System Fang XI Takeshi MISHIMA and Haruo YOKOTA Department of Computer Science, Graduate School of Information

More information

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture Motivation Banked Register File for SMT Processors Jessica H. Tseng and Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA BARC2004 Increasing demand on

More information

How much energy can you save with a multicore computer for web applications?

How much energy can you save with a multicore computer for web applications? How much energy can you save with a multicore computer for web applications? Peter Strazdins Computer Systems Group, Department of Computer Science, The Australian National University seminar at Green

More information

Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version)

Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version) Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Fall 2002 Edward F. Gehringer Automobile Factory (note: non-animated version) Automobile Factory (note: non-animated version)

More information

Exploring the Effects of Hyperthreading on Scientific Applications

Exploring the Effects of Hyperthreading on Scientific Applications Exploring the Effects of Hyperthreading on Scientific Applications by Kent Milfeld milfeld@tacc.utexas.edu edu Kent Milfeld, Chona Guiang, Avijit Purkayastha, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER

More information

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,

More information

Energy Efficiency in Main-Memory Databases

Energy Efficiency in Main-Memory Databases B. Mitschang et al. (Hrsg.): BTW 2017 Workshopband, Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn 2017 335 Energy Efficiency in Main-Memory Databases Stefan Noll 1 Abstract: As

More information

class 9 fast scans 1.0 prof. Stratos Idreos

class 9 fast scans 1.0 prof. Stratos Idreos class 9 fast scans 1.0 prof. Stratos Idreos HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ 1 pass to merge into 8 sorted pages (2N pages) 1 pass to merge into 4 sorted pages (2N pages) 1 pass to merge into

More information

Memory Ordering Mechanisms for ARM? Tao C. Lee, Marc-Alexandre Boéchat CS, EPFL

Memory Ordering Mechanisms for ARM? Tao C. Lee, Marc-Alexandre Boéchat CS, EPFL Memory Ordering Mechanisms for ARM? Tao C. Lee, Marc-Alexandre Boéchat CS, EPFL Forecast This research studies the performance of memory ordering mechanisms on Chip Multi- Processors (CMPs) for modern

More information

Architectures for Instruction-Level Parallelism

Architectures for Instruction-Level Parallelism Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Simultaneous Multithreading Processor

Simultaneous Multithreading Processor Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

Jignesh M. Patel. Blog:

Jignesh M. Patel. Blog: Jignesh M. Patel Blog: http://bigfastdata.blogspot.com Go back to the design Query Cache from Processing for Conscious 98s Modern (at Algorithms Hardware least for Hash Joins) 995 24 2 Processor Processor

More information

Join Processing for Flash SSDs: Remembering Past Lessons

Join Processing for Flash SSDs: Remembering Past Lessons Join Processing for Flash SSDs: Remembering Past Lessons Jaeyoung Do, Jignesh M. Patel Department of Computer Sciences University of Wisconsin-Madison $/MB GB Flash Solid State Drives (SSDs) Benefits of

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

PowerPC 620 Case Study

PowerPC 620 Case Study Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101 18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor Kostas Papadopoulos December 11, 2005 Abstract Simultaneous Multi-threading (SMT) has been developed to increase instruction

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

ECE 341. Lecture # 15

ECE 341. Lecture # 15 ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

Optimize Data Structures and Memory Access Patterns to Improve Data Locality Optimize Data Structures and Memory Access Patterns to Improve Data Locality Abstract Cache is one of the most important resources

More information

32 Hyper-Threading on SMP Systems

32 Hyper-Threading on SMP Systems 32 Hyper-Threading on SMP Systems If you have not read the book (Performance Assurance for IT Systems) check the introduction to More Tasters on the web site http://www.b.king.dsl.pipex.com/ to understand

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Data Page Layouts for Relational Databases on Deep Memory Hierarchies

Data Page Layouts for Relational Databases on Deep Memory Hierarchies Data Page Layouts for Relational Databases on Deep Memory Hierarchies Anastassia Ailamaki David J. DeWitt Mark D. Hill Carnegie Mellon University natassa@cmu.edu University of Wisconsin - Madison {dewitt,

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations? Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined

More information

The Pentium II/III Processor Compiler on a Chip

The Pentium II/III Processor Compiler on a Chip The Pentium II/III Processor Compiler on a Chip Ronny Ronen Senior Principal Engineer Director of Architecture Research Intel Labs - Haifa Intel Corporation Tel Aviv University January 20, 2004 1 Agenda

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

CEC 450 Real-Time Systems

CEC 450 Real-Time Systems CEC 450 Real-Time Systems Lecture 6 Accounting for I/O Latency September 28, 2015 Sam Siewert A Service Release and Response C i WCET Input/Output Latency Interference Time Response Time = Time Actuation

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

ibench: Quantifying Interference in Datacenter Applications

ibench: Quantifying Interference in Datacenter Applications ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Two hours - online. The exam will be taken on line. This paper version is made available as a backup

Two hours - online. The exam will be taken on line. This paper version is made available as a backup COMP 25212 Two hours - online The exam will be taken on line. This paper version is made available as a backup UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE System Architecture Date: Monday 21st

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB Memory Technology Caches 1 Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Average access time similar

More information

CMSC411 Fall 2013 Midterm 2 Solutions

CMSC411 Fall 2013 Midterm 2 Solutions CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001 Data-flow prescheduling for large instruction windows in out-of-order processors Pierre Michaud, André Seznec IRISA / INRIA January 2001 2 Introduction Context: dynamic instruction scheduling in out-oforder

More information

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang

More information

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400) CS252 Graduate Computer Architecture Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400) April 4, 2001 Prof. David A. Patterson Computer Science 252 Spring 2001 Lec

More information