Rectangles All The Way Down. Martin Thompson

Size: px
Start display at page:

Download "Rectangles All The Way Down. Martin Thompson"

Transcription

1 Rectangles All The Way Down Martin Thompson

2 The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry. - Henry Peteroski

3 Fundamental Laws

4 CPU Performance Memory Lane Transistor density doubles every year - Gordon Moore

5 CPU Performance Memory Lane Transistor density doubles every 2 years - Gordon Moore Transistor density doubles every year - Gordon Moore

6 CPU Performance Memory Lane CPUs double in speed every 18 months - David House Transistor density doubles every 2 years - Gordon Moore Transistor density doubles every year - Gordon Moore

7 CPU Performance Memory Lane The free lunch is over: - Herb Sutter CPUs double in speed every 18 months - David House Transistor density doubles every 2 years - Gordon Moore Transistor density doubles every year - Gordon Moore

8 CPU Performance Memory Lane Retirement of Tick Tock - Intel The free lunch is over: - Herb Sutter CPUs double in speed every 18 months - David House Transistor density doubles every 2 years - Gordon Moore Transistor density doubles every year - Gordon Moore

9 CPU Performance Memory Lane Spectre & Meltdown - Google Retirement of Tick Tock - Intel The free lunch is over: - Herb Sutter CPUs double in speed every 18 months - David House Transistor density doubles every 2 years - Gordon Moore Transistor density doubles every year - Gordon Moore

10 Concurrency & Parallelism

11

12 Universal Scalability Law (USL) C(N) = N / (1 + α(n 1) + ((β* N) * (N 1))) C = capacity or throughput N = number of processors α = contention penalty β = coherence penalty

13 Speedup Universal Scalability Law (USL) Processors Amdahl USL

14 If concurrency is so difficult then what else can we do?

15

16 Response Time Queueing Theory Utilisation

17 Queueing Theory r = s(2 ρ) / 2(1 ρ) r = mean response time s = service time ρ = utilisation Note: ρ = λ * s

18

19 Little s Law L = λw WIP = Throughput * Cycle Time

20 Little s Law L = λw WIP = Throughput * Cycle Time Bandwidth Delay Product: Bytes in flight = Bandwidth * Latency

21 Little s Law L = λw WIP = Throughput * Cycle Time Bandwidth Delay Product: Bytes in flight = Bandwidth * Latency 80 bytes / 100ns = 800 MB/s :10 LFBs

22 Memory

23 Are all memory operations equal?

24 Sequential Access - Average time in ns/op to sum all longs in a 1GB array?

25 Access Pattern Benchmark Benchmark Score Error Units ============================================ sequential ± ns/op ~1 ns/op

26 Really??? Less than 1ns per operation?

27 Instruction Level Parallelism

28

29 Access Pattern Benchmark Benchmark Score Error Units ============================================ sequential ± ns/op randompage ± ns/op

30 Access Pattern Benchmark Benchmark Score Error Units ============================================ sequential ± ns/op randompage ± ns/op dependentrandompage ± ns/op

31 Access Pattern Benchmark Benchmark Score Error Units ============================================ sequential ± ns/op randompage ± ns/op dependentrandompage ± ns/op randomheap ± ns/op

32 Access Pattern Benchmark Benchmark Score Error Units ============================================ sequential ± ns/op randompage ± ns/op dependentrandompage ± ns/op randomheap ± ns/op dependentrandomheap ± ns/op

33 Access Pattern Benchmark Benchmark Score Error Units ============================================ sequential ± ns/op randompage ± ns/op dependentrandompage ± ns/op randomheap ± ns/op dependentrandomheap ± ns/op ~90 ns/op

34

35 A 100ns cache-miss is a lost opportunity to execute ~1000 instructions on CPU

36 Algorithms & Data Structures

37 Little s Law L = λw Bandwidth Delay Product: Bytes in flight = Bandwidth * Latency 80 bytes / 100ns = 800 MB/s :10 LFBs

38 Little s Law L = λw Bandwidth Delay Product: Bytes in flight = Bandwidth * Latency 80 bytes / 100ns = 800 MB/s :10 LFBs 80 bytes / 15ns = 5.3 GB/s :prefectch

39 Little s Law L = λw Bandwidth Delay Product: Bytes in flight = Bandwidth * Latency 80 bytes / 100ns = 800 MB/s :10 LFBs 80 bytes / 15ns = 5.3 GB/s :prefectch 640 bytes / 15ns = 42.6 GB/s :cachelines

40 Arrays are the most efficient data structure to traverse

41

42 Functional data structures are like sausages, the more you see them being made, the less well you will sleep

43 Branches

44

45 Branch Benchmark Benchmark Score Error Units ============================================= baseline ± us/op

46 Branch Benchmark Benchmark Score Error Units ============================================= baseline ± us/op predictable ± us/op

47 Branch Benchmark Benchmark Score Error Units ============================================= baseline ± us/op predictable ± us/op unpredictable ± us/op

48 What can we do?

49 Count bits as Booleans

50 Wide Registers

51 Math, Data Dependencies, and Instruction Level Parallelism

52

53 Consider Sorting Arrays

54

55 It s a neat hack, and it s more useful now than it was then for two reasons. - Leslie Lamport (2011)

56 The obvious reason is that word size is larger now, with many computers having 64-bit words. - Leslie Lamport (2011)

57 The less obvious reason is that conditional operations are implemented with masking rather than branching. - Leslie Lamport (2011)

58 Branching is more costly on modern multi-issue computers than it was on the computers of the 70s. - Leslie Lamport (2011)

59

60 Work with your CPU caches

61 Memory Access Considerations 1. Temporal: group accesses in time

62 Memory Access Considerations 1. Temporal: group accesses in time 2. Spatial: group access in space

63 Memory Access Considerations 1. Temporal: group accesses in time 2. Spatial: group access in space 3. Pattern: create predictable patterns

64 Batching

65 Batching Amortising Costs 100% 90% 80% 70% 60% Average overhead per item, or operation, in a batch 50% 40% 30% 20% 10% 0%

66 Batching Amortising Costs 100% 90% 80% 70% 60% 50% 40% Words, Cachelines, Pages, Blocks, Frames, etc. 30% 20% 10% 0%

67 In closing

68 Profile, profile, profile...

69 Eliminate Waste Batch to Amortise Access Memory in Patterns Favour Math over Branches Favour Predictable Branches

70 Consider Parallelism - ILP & Task

71 Is it really Turtles all the way down?

72 Rectangles all the way down Is it really Turtles all the way down? Networks: Frames Operating Systems: Pages File systems and storage: Blocks DRAM memory: Banks and Row Buffers CPU cache subsystems: Cache Lines Applications use Arrays plus and interesting data structures are made up of small Arrays

73 I don t care what data structure you use, nothing beats an array - a HFT Programmer

74 Questions? Travel is fatal to prejudice, bigotry, and narrow-mindedness, and many of our people need it sorely on these accounts. Broad, wholesome, charitable views of men and things cannot be acquired by vegetating in one little corner of the earth all one's lifetime. - Mark Twain

Designing for Performance. Martin Thompson

Designing for Performance. Martin Thompson Designing for Performance Martin Thompson - @mjpt777 Feynman is becoming a real pain. He has the greatest scientific honesty of anyone I ve ever meet - William P Rogers The impact of QED cannot be overestimated.

More information

Designing for Performance. Martin Thompson

Designing for Performance. Martin Thompson Designing for Performance Martin Thompson - @mjpt777 Is it difficult writing software that has good performance? RDD (Resume Driven Development) http://www.semiconductors.org/main/2015_international_technology_roadmap_for_semiconductors_itrs/

More information

High Performance Managed Languages. Martin Thompson

High Performance Managed Languages. Martin Thompson High Performance Managed Languages Martin Thompson - @mjpt777 Really, what s your preferred platform for building HFT applications? Why would you build low-latency applications on a GC ed platform? Some

More information

High Performance Managed Languages. Martin Thompson

High Performance Managed Languages. Martin Thompson High Performance Managed Languages Martin Thompson - @mjpt777 Really, what is your preferred platform for building HFT applications? Why do you build low-latency applications on a GC ed platform? Agenda

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Computer Architecture s Changing Definition

Computer Architecture s Changing Definition Computer Architecture s Changing Definition 1950s Computer Architecture Computer Arithmetic 1960s Operating system support, especially memory management 1970s to mid 1980s Computer Architecture Instruction

More information

A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson

A Quest for Predictable Latency Adventures in Java Concurrency. Martin Thompson A Quest for Predictable Latency Adventures in Java Concurrency Martin Thompson - @mjpt777 If a system does not respond in a timely manner then it is effectively unavailable 1. It s all about the Blocking

More information

27. Parallel Programming I

27. Parallel Programming I 771 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

27. Parallel Programming I

27. Parallel Programming I 760 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

27. Parallel Programming I

27. Parallel Programming I The Free Lunch 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism,

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture, I/O and Disk Most slides adapted from David Patterson. Some from Mohomed Younis Main Background Performance of Main : Latency: affects cache miss penalty Access

More information

Reactive design patterns for microservices on multicore

Reactive design patterns for microservices on multicore Reactive Software with elegance Reactive design patterns for microservices on multicore Reactive summit - 22/10/18 charly.bechara@tredzone.com Outline Microservices on multicore Reactive Multicore Patterns

More information

Written Exam / Tentamen

Written Exam / Tentamen Written Exam / Tentamen Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp KTH Royal Institute of

More information

Computer Architecture

Computer Architecture Computer Architecture Architecture The art and science of designing and constructing buildings A style and method of design and construction Design, the way components fit together Computer Architecture

More information

L7: Performance. Frans Kaashoek Spring 2013

L7: Performance. Frans Kaashoek Spring 2013 L7: Performance Frans Kaashoek kaashoek@mit.edu 6.033 Spring 2013 Overview Technology fixes some performance problems Ride the technology curves if you can Some performance requirements require thinking

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Maximizing Face Detection Performance

Maximizing Face Detection Performance Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Course web site: teaching/courses/car. Piazza discussion forum:

Course web site:   teaching/courses/car. Piazza discussion forum: Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start

More information

Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1>

Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1> Chapter 8 Digital Design and Computer Architecture: ARM Edition Sarah L. Harris and David Money Harris Digital Design and Computer Architecture: ARM Edition 215 Chapter 8 Chapter 8 :: Topics Introduction

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Memory Computer Technology

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Norwegian University of Science and Technology Department of Computer and Information Science Page 1 of 13 Contact: Magnus Jahre (952 22 309) TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM Monday 4. June Time:

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 6 Input/Output Israel Koren ECE568/Koren Part.6. CPU performance keeps increasing 26 72-core Xeon

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

PERFORMANCE MEASUREMENT

PERFORMANCE MEASUREMENT Administrivia CMSC 411 Computer Systems Architecture Lecture 3 Performance Measurement and Reliability Homework problems for Unit 1 posted today due next Thursday, 2/12 Start reading Appendix C Basic Pipelining

More information

Multicore Programming

Multicore Programming Multi Programming Parallel Hardware and Performance 8 Nov 00 (Part ) Peter Sewell Jaroslav Ševčík Tim Harris Merge sort 6MB input (-bit integers) Recurse(left) ~98% execution time Recurse(right) Merge

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Control Datapath Today s Topics: technologies Technology trends Impact on performance Hierarchy The principle of locality

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review CSE502 Graduate Computer Architecture Lec 22 Goodbye to Computer Architecture and Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes with Accesses Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects CS 378: Programming for Performance Administration Instructors: Keshav Pingali (Professor, CS department & ICES) 4.126 ACES Email: pingali@cs.utexas.edu TA: Hao Wu (Grad student, CS department) Email:

More information

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101 18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture

More information

Computer Architecture Review. ICS332 - Spring 2016 Operating Systems

Computer Architecture Review. ICS332 - Spring 2016 Operating Systems Computer Architecture Review ICS332 - Spring 2016 Operating Systems ENIAC (1946) Electronic Numerical Integrator and Calculator Stored-Program Computer (instead of Fixed-Program) Vacuum tubes, punch cards

More information

Lec 25: Parallel Processors. Announcements

Lec 25: Parallel Processors. Announcements Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science Performance of Main : Latency: affects cache miss

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast

More information

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY 1 Memories: Review SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: value is stored

More information

Database Systems II. Secondary Storage

Database Systems II. Secondary Storage Database Systems II Secondary Storage CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM

More information

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches CS 61C: Great Ideas in Computer Architecture The Memory Hierarchy, Fully Associative Caches Instructor: Alan Christopher 7/09/2014 Summer 2014 -- Lecture #10 1 Review of Last Lecture Floating point (single

More information

1/19/2009. Data Locality. Exploiting Locality: Caches

1/19/2009. Data Locality. Exploiting Locality: Caches Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby

More information

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 18 Guest Lecturer: Shakir James Plan for Today Announcements No class meeting on Monday, meet in project groups Project demos < 2 weeks, Nov 23 rd Questions

More information

Module 1: Basics and Background Lecture 4: Memory and Disk Accesses. The Lecture Contains: Memory organisation. Memory hierarchy. Disks.

Module 1: Basics and Background Lecture 4: Memory and Disk Accesses. The Lecture Contains: Memory organisation. Memory hierarchy. Disks. The Lecture Contains: Memory organisation Example of memory hierarchy Memory hierarchy Disks Disk access Disk capacity Disk access time Typical disk parameters Access times file:///c /Documents%20and%20Settings/iitkrana1/My%20Documents/Google%20Talk%20Received%20Files/ist_data/lecture4/4_1.htm[6/14/2012

More information

Cycle Time for Non-pipelined & Pipelined processors

Cycle Time for Non-pipelined & Pipelined processors Cycle Time for Non-pipelined & Pipelined processors Fetch Decode Execute Memory Writeback 250ps 350ps 150ps 300ps 200ps For a non-pipelined processor, the clock cycle is the sum of the latencies of all

More information

Multicore Hardware and Parallelism

Multicore Hardware and Parallelism Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB Memory Technology Caches 1 Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Average access time similar

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

EE 4980 Modern Electronic Systems. Processor Advanced

EE 4980 Modern Electronic Systems. Processor Advanced EE 4980 Modern Electronic Systems Processor Advanced Architecture General Purpose Processor User Programmable Intended to run end user selected programs Application Independent PowerPoint, Chrome, Twitter,

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Administrivia. Talks and other opportunities: Expect HW on functions in ASM (printing binary trees) soon

Administrivia. Talks and other opportunities: Expect HW on functions in ASM (printing binary trees) soon Threads 2/9/18 Administrivia Talks and other opportunities: Game designer and developer talk: Wed noon, Alumni Hall Room 302 (extra credit!) Networking, resume, interview: Wed 4pm, Alumni Hall Room 219

More information

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

COSC 6385 Computer Architecture - Memory Hierarchy Design (III) COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses

More information

ECE 152 Introduction to Computer Architecture

ECE 152 Introduction to Computer Architecture Introduction to Computer Architecture Main Memory and Virtual Memory Copyright 2009 Daniel J. Sorin Duke University Slides are derived from work by Amir Roth (Penn) Spring 2009 1 Where We Are in This Course

More information

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches CS 61C: Great Ideas in Computer Architecture Direct Mapped Caches Instructor: Justin Hsia 7/05/2012 Summer 2012 Lecture #11 1 Review of Last Lecture Floating point (single and double precision) approximates

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

Computer Architecture. R. Poss

Computer Architecture. R. Poss Computer Architecture R. Poss 1 ca01-10 september 2015 Course & organization 2 ca01-10 september 2015 Aims of this course The aims of this course are: to highlight current trends to introduce the notion

More information

Advanced cache optimizations. ECE 154B Dmitri Strukov

Advanced cache optimizations. ECE 154B Dmitri Strukov Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache

More information

ECE 588/688 Advanced Computer Architecture II

ECE 588/688 Advanced Computer Architecture II ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:

More information

Transistor: Digital Building Blocks

Transistor: Digital Building Blocks Final Exam Review Transistor: Digital Building Blocks Logically, each transistor acts as a switch Combined to implement logic functions (gates) AND, OR, NOT Combined to build higher-level structures Multiplexer,

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568 Part 6 Input/Output Israel Koren ECE568/Koren Part.6. Motivation: Why Care About I/O? CPU Performance:

More information

ECE468 Computer Organization and Architecture. Memory Hierarchy

ECE468 Computer Organization and Architecture. Memory Hierarchy ECE468 Computer Organization and Architecture Hierarchy ECE468 memory.1 The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Control Input Datapath Output Today s Topic:

More information

CENG3420 Lecture 08: Memory Organization

CENG3420 Lecture 08: Memory Organization CENG3420 Lecture 08: Memory Organization Bei Yu byu@cse.cuhk.edu.hk (Latest update: February 22, 2018) Spring 2018 1 / 48 Overview Introduction Random Access Memory (RAM) Interleaving Secondary Memory

More information

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 02 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on top (a) and in

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

CSEE W4824 Computer Architecture Fall 2012

CSEE W4824 Computer Architecture Fall 2012 CSEE W4824 Computer Architecture Fall 2012 Lecture 8 Memory Hierarchy Design: Memory Technologies and the Basics of Caches Luca Carloni Department of Computer Science Columbia University in the City of

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Tracking Acceleration with FPGAs. Future Tracking, CMS Week 4/12/17 Sioni Summers

Tracking Acceleration with FPGAs. Future Tracking, CMS Week 4/12/17 Sioni Summers Tracking Acceleration with FPGAs Future Tracking, CMS Week 4/12/17 Sioni Summers Contents Introduction FPGAs & 'DataFlow Engines' for computing Device architecture Maxeler HLT Tracking Acceleration 2 Introduction

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Parallelism in Hardware

Parallelism in Hardware Parallelism in Hardware Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3 Moore s Law

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture 1 L E C T U R E 2: M E M O R Y M E M O R Y H I E R A R C H I E S M E M O R Y D A T A F L O W M E M O R Y A C C E S S O P T I M I Z A T I O N J A N L E M E I R E Memory Just

More information

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant Big and Fast Anti-Caching in OLTP Systems Justin DeBrabant Online Transaction Processing transaction-oriented small footprint write-intensive 2 A bit of history 3 OLTP Through the Years relational model

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

Advanced Computer Architecture (CS620)

Advanced Computer Architecture (CS620) Advanced Computer Architecture (CS620) Background: Good understanding of computer organization (eg.cs220), basic computer architecture (eg.cs221) and knowledge of probability, statistics and modeling (eg.cs433).

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache Is the cache indexed with virtual or physical address? To index with a physical address, we

More information

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport Chapter 1: Introduction to High Performance Computing van Neumann Architecture CPU and Memory Speed Motivation

More information