Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Similar documents

Copyright 2010, Elsevier Inc. All rights Reserved

CS 426 Parallel Computing. Parallel Computing Platforms

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved.

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

ECE468 Computer Organization and Architecture. Virtual Memory

ECE4680 Computer Organization and Architecture. Virtual Memory

Adapted from David Patterson s slides on graduate computer architecture

Handout 4 Memory Hierarchy

Hardware-Based Speculation

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

V. Primary & Secondary Memory!

CS 2410 Mid term (fall 2018)

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Cache Performance (H&P 5.3; 5.5; 5.6)

Memory. Objectives. Introduction. 6.2 Types of Memory

Keywords and Review Questions

Pipelined processors and Hazards

Hardware-based Speculation

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Memory Hierarchy. Goal: Fast, unlimited storage at a reasonable cost per bit.

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Memory hierarchy review. ECE 154B Dmitri Strukov

LECTURE 5: MEMORY HIERARCHY DESIGN

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

UNIT I (Two Marks Questions & Answers)

Advanced Computer Architecture

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS161 Design and Architecture of Computer Systems. Cache $$$$$

VIRTUAL MEMORY READING: CHAPTER 9

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Page 1. Multilevel Memories (Improving performance using a little cash )

Solutions for Chapter 7 Exercises

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Princeton University. Computer Science 217: Introduction to Programming Systems. The Memory/Storage Hierarchy and Virtual Memory

EEC 170 Computer Architecture Fall Cache Introduction Review. Review: The Memory Hierarchy. The Memory Hierarchy: Why Does it Work?

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

PCOPP Uni-Processor Optimization- Features of Memory Hierarchy. Uni-Processor Optimization Features of Memory Hierarchy

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Copyright 2012, Elsevier Inc. All rights reserved.

Memory systems. Memory technology. Memory technology Memory hierarchy Virtual memory

Memory. Principle of Locality. It is impossible to have memory that is both. We create an illusion for the programmer. Employ memory hierarchy

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Processes and Tasks What comprises the state of a running program (a process or task)?

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

The University of Adelaide, School of Computer Science 13 September 2018

Multiprocessor Support

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Lecture 25: Board Notes: Threads and GPUs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

A Cache Hierarchy in a Computer System

Handout 2 ILP: Part B

Advanced Caching Techniques

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Computer Architecture CS372 Exam 3

14:332:331. Week 13 Basics of Cache

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

ECE 341 Final Exam Solution

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Chapter 6 Objectives

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part4.1.1: Cache - 2

Agenda Cache memory organization and operation Chapter 6 Performance impact of caches Cache Memories

Effect of memory latency

Memory Hierarchy. Slides contents from:

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CPE300: Digital System Architecture and Design

Page 1. Memory Hierarchies (Part 2)

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

Transistor: Digital Building Blocks

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Advanced Caching Techniques

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

The Processor Memory Hierarchy

Two hours. No special instructions. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date. Time

Transcription:

An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program should be executed. (the boss) ALU (Arithmetic and logic unit): responsible for executing the actual instructions. (the worker) Memory: A collection of locations, each of which is capable of storing instructions or data. Every location has an address, which is used to access the location. 2 1

A Process (a task) An instance of a computer program that is being executed. Components of a process: Memory space. Program data Security information. State of the process Values of registers Program counter, stack pointer, Allocated resources Multitasking (multiprogramming): Gives the illusion that multiple processes are running simultaneously. Each process takes turns running. (time slice) Switch context (process) when blocked or time is up 3 The Memory wall Memory system is often the bottleneck for many applications. Memory performance metrics: latency bandwidth. CPU Memory May consist of interleaved modules latency time 1/bandwidth issue reply 4 2

Memory Latency: An Example 1 GHz processor (1 ns clock) with a multiply-add units capable of executing a multiply/add in each cycle The peak processor rating is 2 GFLOPS (2 * 10 9 floating point operation per second). We want to compute the dot product of two vectors x[0],,x[n-1] and y[0],,y[n-1]. n 1 i 0 DP x[ i]* y[ i] Need two operands every ns for peak processor performance If DRAM has a latency of 100 ns, then it can supply only one operand every 100 cycle. Hence, can do a multiply/add (2 FLOPS) every 200 ns. That is one FLOP every 100 ns. Actual performance = 10 MFLOPS 5 Overcoming the memory wall Use caches: take advantage of spatial and temporal locality address address CPU cache memory CPU cache memory data word Cache miss cache line data word Cache hit Use prefetching: Can improve performance by pre-fetching. In this case, performance depends on memory bandwidth and not its latency. In the example above, if bandwidth is such that one operands can be fetched every 5ns, then can do a multiply/add every 10 ns. That is one FLOP every 5ns (200 MFLOP) still not enough to match the 2GFLOP processor. Use multithreading 6 3

Review of cache operation Issues to deal with: How do we know if a data item is in the cache? If it is, how do we find it? address If it is not, what do we do? CPU data word It boils down to where do we put an item in the cache? how do we identify the items in the cache? cache cache line memory Two solutions put item anywhere in cache (associative cache) associate specific locations to specific items (direct mapped cache) 7 Cache mappings Full associative a new line can be placed at any location in the cache. Direct mapped each cache line has a unique location in the cache to which it will be assigned. n-way set associative each cache line can be place in one of n different locations in the cache. When more than one line in memory can be mapped to several locations in cache we need to decide which line should be replaced or evicted. 8 4

Example Table 2.1: Assignments of a 16-line main memory to a 4-line cache 9 Effect of cache line size Direct mapped cache with line size = 4 words Performance of second pair of loops is bad if cache can hold only 2 lines 10 5

Effect of data layout: An example Consider the following code fragment: for (j = 0; j < 1000; j++) column_sum[j] = 0.0; for (i = 0; i < 1000; i++) column_sum[j] += b[i][j]; The code fragment sums columns of the matrix b into a vector column_sum. The vector column_sum is small and easily fits into the cache The matrix b is accessed in a column order With row major storage, strided access results in very poor performance. What is row major storage? b 1,1 b 1,2... b 1,10 b 2,1 b 2,2... b 2,10......... Stored as b 1,1 b 1,2... b 1,10 b 2,1 b 2,2... b 2,10 b 3,1 b 3,2... b 3,10... b 10,1 b 10,2... b 10,10 11 Change the access pattern We can fix the above code as follows: for (j = 0; j < 1000; j++) column_sum[j] = 0.0; for (i = 0; i < 1000; i++) for (j = 0; j < 1000; j++) column_sum[j] += b[i][j]; In this case, the matrix is traversed in a row-order and performance can be expected to be significantly better --- Why? Memory layouts and order of computation can have a significant impact on the spatial and temporal locality. 12 6

Issues with cache When a CPU writes data to cache, the value in cache may be inconsistent with the value in main memory. Write-through caches handle this by updating the data in main memory at the time it is written to cache. Write-back caches mark data in the cache as dirty. When the cache line is replaced by a new cache line from memory, the dirty line is written to memory. 13 Multiple levels of caches in multi-core processors core core smallest & fastest L1 L2 L1 L2 largest & slowest L3 Off-chip memory 14 7

Multithreading: originally designed as an alternate Approach for Hiding Memory Latency We illustrate threads with the simple example of dot product of two vectors, x and y.: dp = 0 ; for (k = 0; k < 4; k++) partial_product (k*n/4, n/4); dp = 0 ; for (k = 0; k < 4; k++) for (i = 0 ; i < n ; i++) dp += pdp[k] ; dp += x[i] * y[i] void partial_product (a, b); { pdp[k] = 0 ; x for (i = a ; i < a+b ; i++) y pdp[k] += x[i] * y[i] ; return ; pdp } 15 Instead of calling the function partial_product(), create a new thread that will execute that function. dp = 0 ; for (k = 0; k < 4; k++) /* fork threads */ create_thread (partial_product, k*n/4, n/4); Wait until all threads return ; /* join threads */ for (k = 0; k < 4; k++) dp += pdp[k] ; the master thread Forking threads void partial_product (a, b); { pdp[k] = 0 ; for (i = a ; i < a+b ; i++) pdp[k] += x[i] * y[i] ; return ; } Joining threads 16 8

A Thread is a light weight process Threads within the same process Share the memory address space Each has its own registers, program counter and stack pointer The OS schedules processes but a thread library function schedules threads within a process 17 Multithreading for Latency Hiding: Example In the example of slide 16, the first thread accesses a pair of vector elements and waits for them. In the meantime, the second thread can access two other vector elements in the next cycle, and so on. After L cycles, where L is the latency of the memory system, the first thread gets the requested data from memory and can perform B computations, where B is the number of words in a cache block size. When done, the data items for the next thread arrive, and so on. If there are L/B threads, then, we can perform a computation in every clock cycle. Assumptions made: The memory system is capable of servicing multiple outstanding requests, and its bandwidth can keep up with the processing rate. No time penalty for switching threads. time 18 9

Virtual memory If we run a very large program or a program that accesses very large data sets, all of the instructions and data may not fit into main memory. Virtual memory is a cache for secondary storage (swap space). Exploits the principle of spatial and temporal locality. program A program B program C main memory It only keeps the active parts of running programs in main memory (unit of storage = a page). 19 Virtual page numbers When a program is compiled its pages are assigned virtual page numbers. When the program is run, a table is created that maps the virtual page numbers to physical addresses. A page table is used to translate the virtual address into a physical address. bit address bit value Virtual Address Divided into Virtual Page Number and Byte Offset 20 10

Translation-lookaside buffer (TLB) Using a page table has the potential to significantly increase each program s overall run-time. A special address translation cache in the processor. It caches a small number of entries (typically 16 512) from the page table in very fast memory. Page fault attempting to access a valid physical address for a page in the page table but the page is only stored on disk. 21 Instruction Level Parallelism (ILP) Attempts to improve processor performance by having multiple processor components or functional units simultaneously executing instructions. Pipelining - functional units are arranged in stages. Multiple issue - multiple instructions can be simultaneously initiated. 22 11

Pipelining Add the floating point numbers 9.87 10 4 and 6.54 10 3 23 Pipelining example Assume each operation takes one nanosecond. This loop takes about 7000 nanoseconds. Divide the floating point adder into 7 separate pieces of hardware or functional units. Output of one functional unit is input to the next Loop takes 1006 nanoseconds. 24 12

Pipelining Pipelined Addition. Numbers in the table are subscripts of operands/results. 25 Multiple Issue Multiple issue processors replicate functional units and try to simultaneously execute different instructions in a program. for (i = 0; i < 1000; i++) z[i] = x[i] + y[i]; z[3] z[4] z[1] z[2] adder #1 adder #2 VLIW: functional units are scheduled at compile time (static scheduling). Superscalers: functional units are scheduled at run-time (dynamic scheduling). 26 13

Speculation In order to make use of multiple issue, the system must find instructions that can be executed simultaneously. In speculation, the compiler or the processor makes a guess about an instruction, and then executes the instruction on the basis of the guess. z = x + y ; i f ( z > 0) w = x - y ; e l s e w = x + y ; Z will be positive If the system speculates incorrectly, it must go back and recalculate w. 27 Multi-threading Provides a means to continue doing useful work when the currently executing task has stalled (ex. wait for long memory latency) Software-based multithreading (Posix Threads) Hardware traps on a long-latency operation Software then switches the running process to another (context switching) Issues with the overhead Hardware based multithreading For user defined threads or compiler generated threads Hardware support for context switching Replicate registers (including PC and stack pointer) Example: IBM Power5 and Pentium-4 supports SMT 28 14

Hardware multi-threading Fine-grain multithreading Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Coarse-grain multithreading Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but doesn t hide short stalls (such as the stalls resulting from data hazards) SMT in multiple-issue dynamically scheduled processor Schedule instructions from multiple threads Instructions from independent threads execute when function units are available 29 Hardware-based multi-threading 30 15