CMSC 611: Advanced Computer Architecture

Similar documents
Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

CMSC 611: Advanced Computer Architecture. Cache and Memory

CS152 Computer Architecture and Engineering Lecture 19: I/O Systems

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

CENG3420 Lecture 08: Memory Organization

ECE468 Computer Organization and Architecture. Memory Hierarchy

14:332:331. Week 13 Basics of Cache

Memories: Memory Technology

High-Performance Storage Systems

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Outline of Today s Lecture. The Big Picture: Where are We Now?

Lecture 11. Virtual Memory Review: Memory Hierarchy

Recap: Machine Organization

The Memory Hierarchy & Cache

The University of Adelaide, School of Computer Science 13 September 2018

Where Have We Been? Ch. 6 Memory Technology

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer

Chapter Seven Morgan Kaufmann Publishers

ECE232: Hardware Organization and Design

Chapter 5. Topics in Memory Hierachy. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ.

EEC 483 Computer Organization

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Adapted from David Patterson s slides on graduate computer architecture

CSC Memory System. A. A Hierarchy and Driving Forces

I/O CANNOT BE IGNORED

Memory latency: Affects cache miss penalty. Measured by:

Copyright 2012, Elsevier Inc. All rights reserved.

LECTURE 5: MEMORY HIERARCHY DESIGN

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs.

Memory latency: Affects cache miss penalty. Measured by:

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

EE 4683/5683: COMPUTER ARCHITECTURE

Course Administration

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Contents. Memory System Overview Cache Memory. Internal Memory. Virtual Memory. Memory Hierarchy. Registers In CPU Internal or Main memory

Copyright 2012, Elsevier Inc. All rights reserved.

ECE 485/585 Microprocessor System Design

Computer System Components

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Computer System Architecture

Storage Technologies and the Memory Hierarchy

CS61C : Machine Structures

Caches. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CS429: Computer Organization and Architecture

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

1.1 Bits and Bit Patterns. Boolean Operations. Figure 2.1 CPU and main memory connected via a bus. CS11102 Introduction to Computer Science

CS429: Computer Organization and Architecture

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

CENG4480 Lecture 09: Memory 1

COSC 6385 Computer Architecture - Memory Hierarchies (III)

I/O CANNOT BE IGNORED

BCN1043. By Dr. Mritha Ramalingam. Faculty of Computer Systems & Software Engineering

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Advanced Parallel Architecture Lesson 4 bis. Annalisa Massini /2015

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Readings. Storage Hierarchy III: I/O System. I/O (Disk) Performance. I/O Device Characteristics. often boring, but still quite important

Chapter 8. A Typical collection of I/O devices. Interrupts. Processor. Cache. Memory I/O bus. I/O controller I/O I/O. Main memory.

INPUT/OUTPUT DEVICES Dr. Bill Yi Santa Clara University

COSC 6385 Computer Architecture - Memory Hierarchies (II)

CENG 3420 Computer Organization and Design. Lecture 08: Memory - I. Bei Yu

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Storage Systems. Storage Systems

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Large and Fast: Exploiting Memory Hierarchy

Lecture 18: DRAM Technologies

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Advanced Memory Organizations

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

5 Computer Organization

14:332:331. Week 13 Basics of Cache

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory. Lecture 22 CS301

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

UNIT-V MEMORY ORGANIZATION

CS61C - Machine Structures. Week 7 - Disks. October 10, 2003 John Wawrzynek

Module 5a: Introduction To Memory System (MAIN MEMORY)

Memory Hierarchy Y. K. Malaiya

Introduction. The Big Picture: Where are We Now? The Five Classic Components of a Computer

Question?! Processor comparison!

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Chapter 6. Storage & Other I/O

Lecture 18: Memory Systems. Spring 2018 Jason Tang

Transcription:

CMSC 611: Advanced Computer Architecture, I/O and Disk Most slides adapted from David Patterson. Some from Mohomed Younis

Main Background Performance of Main : Latency: affects cache miss penalty Access Time: time between request and word arrives Cycle Time: time between requests Bandwidth: primary concern for I/O & large block Main is DRAM: Dynamic RAM Dynamic since needs to be refreshed periodically Addresses divided into 2 halves (Row/Column) Cache uses SRAM: Static RAM No refresh 6 transistors/bit vs. 1 transistor/bit, 10X area Address not divided: Full address

DRAM Logical Organization 4 Mbit DRAM: square root of bits per RAS/CAS Refreshing prevent access to the DRAM (typically 1-5% of the time) Reading one byte refreshes the entire row Read is destructive and thus data need to be rewritten after reading Cycle time is significantly larger than access time

Performance 1000 100 10 1 Processor- Performance CPU-DRAM Gap Moore s Law CPU DRAM µproc 60%/yr. (2X/1.5yr) Processor- Performance Gap: (grows 50% / year) DRAM 9%/yr. (2X/10 yrs) 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time Problem: Improvements in access time are not enough to catch up Solution: Increase the bandwidth of main memory (improve throughput)

CPU Organization CPU CPU Cache Multiplexor Cache Cache Bus Bus Bus bank 0 bank 1 bank 2 bank 3 a. One-word-wide memory organization b. Wide memory organization c. Interleaved memory organization Simple: CPU, Cache, Bus, same width (32 bits) Wide: CPU/Mux 1 word; Mux/Cache, Bus, N words Interleaved: CPU, Cache, Bus 1 word: N Modules (4 Modules); example is word interleaved organization would have significant effect on bandwidth

Interleaving Access Pattern without Interleaving: CPU Access Bank 0 D1 available Start Access for D1 Start Access for D2 Access Pattern with 4-way Interleaving: Access Bank 1 Access Bank 2 Access Bank 3 CPU We can Access Bank 0 again Bank 0 Bank 1 Bank 2 Bank 3

Independent Banks Original motivation: sequential access Multiple independent accesses: Multiprocessor system / concurrent execution I/O: limiting memory access contention CPU with Hit under n Misses, Non-blocking Cache Multiple access requires per-bank controller, address bus and possibly data bus Superbank: all memory active on one block transfer Bank: portion within a superbank that is word interleaved (or Subbank) Superbanks act as separate memories mapped to the same address space

Avoiding Bank Conflicts The effectiveness of interleaving depends on the frequency that independent requests will go to different banks Bank number = address % number of banks Address within bank = address / words in bank Example: Assuming 128 banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Since 512 is a multiple of 128 Every column in same bank

Avoiding Bank Conflicts Solutions: SW: loop interchange or declaring array not power of 2 ( array padding ) HW: Prime number of banks and modulo interleaving Complexity of modulo & divide per memory access with prime no. banks? Simple address calculation using the Chinese Remainder Theorem

Chinese Remainder Theorem As long as two sets of integers a i and b i follow these rules: b i = x mod a i 0 b i < a i 0 x < a 0 a 1 a 2 a i and a j are co-prime with i j then the integer x has only one solution

Apply to Bank Addressing Modulo interleaving Bank number = b 0, Number of banks = a 0 Address in bank = b 1, Words in bank = a 1 Bank number < number of banks (0 b 0 < a 0 ) Address in bank < words in a bank (0 b 1 < a 1 ) Address < Number of banks words in bank (0 x < a 0 a 1 ) The number of banks (a 0 ) and words in a bank are co-prime (a 1 ) a 1 a power of 2, a 0 prime > 2

Example Bank number = address MOD number of banks Address within bank = address MOD words in bank Bank number = b 0, number of banks = a 0 (= 3 in example) Address within bank = b 1, number of words in bank = a 1 (= 8 in example) Seq. Interleaved Modulo Interleaved Bank Number: 0 1 2 0 1 2 Address within Bank: 0 0 1 2 0 16 8 1 3 4 5 9 1 17 2 6 7 8 18 10 2 3 9 10 11 3 19 11 4 12 13 14 12 4 20 5 15 16 17 21 13 5 6 18 19 20 6 22 14 7 21 22 23 15 7 23 Unambiguous mapping with simple bank addressing

DRAM-specific Optimization DRAM Access Interleaving DRAM must buffer a row of bits internally for the column access Performance can be improved by allowing repeated access to buffer with our another row access time (requires minimal additional cost) Page mode: The buffer acts like a SRAM allowing bit access from the buffer until a row change or a refresh Static column: Similar to page mode but does not require change in CAS to access another bit from the buffer DRAM optimization has been shown to give up to 4x speedup

DRAM-Specific Optimization Bus-based DRAM (RAMBUS) Each chip act as a module with an internal bus replacing CAS and RAS Allows for other accesses to take place between the sending the address and returning the data Each module performs its own refresh Performance can reach 1 byte / 2 ns (500 MB/s per chip) Expensive compared to traditional DRAM

I/O Interface Device drivers Device controller Service queues Interrupt handling Design Issues Performance Expandability Standardization Resilience to failure Impact on Tasks Blocking conditions Priority inversion Access ordering Input/Output Computer Processor Control Datapath Computer Processor Control Datapath Devices Input Output Devices Input Output Networ k

Suppose we have a benchmark that executes in 100 seconds of elapsed time, where 90 seconds is CPU time and the rest is I/O time. If the CPU time improves by 50% per year for the next five years but I/O time does not improve, how much faster will our program run at the end of the five years? Answer: Elapsed Time = CPU time + I/O time After n years CPU time I/O time Elapsed time % I/O time Over five years: Impact of I/O on System Performance 0 90 Seconds 10 Seconds 100 Seconds 10% 1 90 = 60 1. 5 Seconds 2 60 = 40 1. 5 Seconds 3 40 = 27 1. 5 Seconds 4 27 = 18 1. 5 Seconds 5 18 = 12 1. 5 Seconds 10 Seconds 70 Seconds 14% 10 Seconds 50 Seconds 20% 10 Seconds 37 Seconds 27% 10 Seconds 28 Seconds 36% 10 Seconds 22 Seconds 45% CPU improvement = 90/12 = 7.5 BUT System improvement = 100/22 = 4.5

Typical I/O System Processor interrupts Cache - I/O Bus Main I/O Controller I/O Controller I/O Controller Disk Disk Graphics Network The connection between the I/O devices, processor, and memory are usually called (local or internal) bus Communication among the devices and the processor use both protocols on the bus and interrupts

I/O Device Examples Device Behavior Partner Data Rate (KB/sec) Keyboard Input Human 0.01 Mouse Input Human 0.02 Line Printer Output Human 1.00 Floppy disk Storage Machine 50.00 Laser Printer Output Human 100.00 Optical Disk Storage Machine 500.00 Magnetic Disk Storage Machine 5,000.00 Network-LAN Input or Output Machine 20 1,000.00 Graphics Display Output Human 30,000.00

Storage Technology Drivers Driven by the prevailing computing paradigm: 1950s: migration from batch to on-line processing 1990s: migration to ubiquitous computing Computers in phones, books, cars, video cameras, Nationwide fiber optical network with wireless tails Effects on storage industry: Embedded storage: smaller, cheaper, more reliable, lower power Data utilities: high capacity, hierarchically managed storage

Disk History Data density in Mbit/square inch Capacity of Unit Shown in Megabytes source: New York Times, 2/23/98, page C3

Magnetic Disk Purpose: Long term, nonvolatile storage Large, inexpensive, and slow Registers Low level in the memory hierarchy Characteristics: Rely on rotating platters coated with a magnetic surface Use a moveable read/write head to access the disk Platters are rigid (metal or glass) Cache Disk

Organization of a Hard Magnetic Disk Platters Track Typical numbers (depending on the disk size): 500 to 2,000 tracks per surface 32 to 128 sectors per track Sector A sector is the smallest unit that can be read or written to Traditionally all tracks have the same number of sectors: Constant bit density: record more sectors on the outer tracks Recently relaxed: constant bit size, speed varies with track location