C 1. Last Time. CSE 490/590 Computer Architecture. Cache I. Branch Delay Slots (expose control hazard to software)

Similar documents
CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

Lecture 6 - Memory. Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory. Last =me in Lecture 5

PART 1 REFERENCE INFORMATION CONTROL DATA 6400 SYSTEMS CENTRAL PROCESSOR MONITOR

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Lecture-14 (Memory Hierarchy) CS422-Spring

CS252 Spring 2017 Graduate Computer Architecture. Lecture 11: Memory

Outline. EECS Components and Design Techniques for Digital Systems. Lec 06 Using FSMs Review: Typical Controller: state

CS252 Graduate Computer Architecture Spring 2014 Lecture 10: Memory

Scheduling. Scheduling. EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #4 Updated March 16, 2012

Data Structures and Algorithms. The material for this lecture is drawn, in part, from The Practice of Programming (Kernighan & Pike) Chapter 2

Advanced Computer Architecture

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Motor Control. 5. Control. Motor Control. Motor Control

4. Minimax and planning problems

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Video streaming over Vajda Tamás

COMP26120: Algorithms and Imperative Programming

Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware (Technical Report)

COSC 3213: Computer Networks I Chapter 6 Handout # 7

Optimal Crane Scheduling

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

A Matching Algorithm for Content-Based Image Retrieval

Y. Tsiatouhas. VLSI Systems and Computer Architecture Lab

MOBILE COMPUTING. Wi-Fi 9/20/15. CSE 40814/60814 Fall Wi-Fi:

MOBILE COMPUTING 3/18/18. Wi-Fi IEEE. CSE 40814/60814 Spring 2018

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

Network management and QoS provisioning - QoS in Frame Relay. . packet switching with virtual circuit service (virtual circuits are bidirectional);

The Memory Hierarchy Part I

Chapter 8 LOCATION SERVICES

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Parallel Multigrid Preconditioning on Graphics Processing Units (GPUs) for Robust Power Grid Analysis

Utility-Based Hybrid Memory Management

Computer Architecture ELEC3441

! errors caused by signal attenuation, noise.!! receiver detects presence of errors:!

Topic 21: Memory Technology

Topic 21: Memory Technology

Low-Cost WLAN based. Dr. Christian Hoene. Computer Science Department, University of Tübingen, Germany

EC 513 Computer Architecture

Data Structures and Algorithms

Computer Architecture ELEC3441

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 11 - Virtual Memory and Caches

MORPHOLOGICAL SEGMENTATION OF IMAGE SEQUENCES

Page 1. Multilevel Memories (Improving performance using a little cash )

Using CANopen Slave Driver

4 Error Control. 4.1 Issues with Reliable Protocols

Dimmer time switch AlphaLux³ D / 27

Mobile Computing IEEE Standard 9/10/14. CSE 40814/60814 Fall 2014

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

Po,,ll. I Appll I APP2 I I App3 I. Illll Illlllll II Illlll Illll Illll Illll Illll Illll Illll Illll Illll Illll Illll Illlll Illl Illl Illl

Test - Accredited Configuration Engineer (ACE) Exam - PAN-OS 6.0 Version

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Administrivia. CMSC 411 Computer Systems Architecture Lecture 8 Basic Pipelining, cont., & Memory Hierarchy. SPEC92 benchmarks

CENG 477 Introduction to Computer Graphics. Modeling Transformations

EECS 487: Interactive Computer Graphics

Memory Hierarchy. Slides contents from:

Announcements. TCP Congestion Control. Goals of Today s Lecture. State Diagrams. TCP State Diagram

In fmri a Dual Echo Time EPI Pulse Sequence Can Induce Sources of Error in Dynamic Magnetic Field Maps

Location. Electrical. Loads. 2-wire mains-rated. 0.5 mm² to 1.5 mm² Max. length 300 m (with 1.5 mm² cable). Example: Belden 8471

NEWTON S SECOND LAW OF MOTION

MIC2569. Features. General Description. Applications. Typical Application. CableCARD Power Switch

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

Overview of Board Revisions

Coded Caching with Multiple File Requests

CS 152 Computer Architecture and Engineering. Lecture 9 - Virtual Memory

Lecture 9 - Virtual Memory

Memory Hierarchy. 2/18/2016 CS 152 Sec6on 5 Colin Schmidt

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 5 - Pipelining II (Branches, Exceptions)

Chapter 4 Sequential Instructions

Gauss-Jordan Algorithm

Memory hierarchy Outline

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Address Translation

Scattering at an Interface: Normal Incidence

Lecture 7 - Memory Hierarchy-II

A pipeline polish string computer*

Question?! Processor comparison!

Let s get physical - EDA Tools for Mobility

Exercise 3: Bluetooth BR/EDR

The Difference-bit Cache*

A time-space consistency solution for hardware-in-the-loop simulation system

Lecture 4 - Pipelining

ECE 2300 Digital Logic & Computer Organization. Caches

Outline. CS38 Introduction to Algorithms 5/8/2014. Network flow. Lecture 12 May 8, 2014

Lecture 18: Mix net Voting Systems

The Memory Hierarchy & Cache

FIELD PROGRAMMABLE GATE ARRAY (FPGA) AS A NEW APPROACH TO IMPLEMENT THE CHAOTIC GENERATORS

MUX 1. GENERAL DESCRIPTION

Midterm Exam Announcements

STEREO PLANE MATCHING TECHNIQUE

PCMCIA / JEIDA SRAM Card

CS 152, Spring 2011 Section 8

Transcription:

CSE 490/590 Compuer Archiecure Cache I Seve Ko Compuer Sciences and Engineering Universiy a Buffalo Las Time Pipelining hazards Srucural hazards hazards Conrol hazards hazards Sall Bypass Conrol hazards Jump Condiional ranch 2 Branch Delay Slos (expose conrol hazard o sofware) Change he ISA semanics so ha he insrucion ha follows a jump or ranch is always execued gives compiler he flexiiliy o pu in a useful insrucion where normally a pipeline ule would have resuled. I 1 096 ADD I 2 100 BEQZ r1 +200 I 3 104 ADD I 4 304 ADD Delay slo insrucion execued regardless of ranch oucome Oher echniques include more advanced ranch predicion, which can dramaically reduce he ranch penaly... o come laer 3 Branch Pipeline Diagrams (ranch delay slo) ime 0 1 2 3 4 5 6 7.... (I 1 ) 096: ADD IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: BEQZ +200 IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADD IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) 304: ADD IF 4 ID 4 EX 4 MA 4 WB 4 Resource Usage ime 0 1 2 3 4 5 6 7.... IF I 1 I 2 I 3 I 4 ID I 1 I 2 I 3 I 4 EX I 1 I 2 I 3 I 4 MA I 1 I 2 I 3 I 4 WB I 1 I 2 I 3 I 4 4 Why an Insrucion may no e dispached every cycle (CPI>1) Early Read-Only Technologies Full ypassing may e oo expensive o implemen ypically all frequenly used pahs are provided some infrequenly used ypass pahs may increase cycle ime and counerac he enefi of reducing CPI Loads have wo-cycle laency Insrucion afer load canno use load resul MIPS-I ISA defined load delay slos, a sofware-visile pipeline hazard (compiler schedules independen insrucion or insers NOP o avoid hazard). Removed in MIPS-II (pipeline inerlocks added in hardware)» MIPS: Microprocessor wihou Inerlocked Pipeline Sages Condiional ranches may cause ules kill following insrucion(s) if no delay slos Punched cards, From early 1700s hrough Jaquard Loom, Baage, and hen IBM Diode Marix, EDSAC-2 µcode sore Punched paper ape, insrucion sream in Harvard Mk 1 5 IBM Card Capacior ROS IBM Balanced Capacior ROS 6 C 1

Early Read/Wrie Main Technologies Semiconducor Baage, 1800s: Digis sored on mechanical wheels Semiconducor memory egan o e compeiive in early 1970s Inel formed o exploi marke for semiconducor memory Early semiconducor memory was Saic RAM (SRAM). SRAM cell inernals similar o a lach (cross-coupled inverers). Williams Tue, Mancheser Mark 1, 1947 Firs commercial Dynamic RAM (DRAM) was Inel 1103 Mercury Delay Line, Univac 1, 1951 1Ki of sorage on single chip charge on a capacior used o hold value Semiconducor memory quickly replaced core in 70s Also, regeneraive capacior memory on Aanasoff-Berry compuer, and roaing magneic drum memory on IBM 650 7 Modern DRAM Srucure 8 DRAM Archiecure i lines Col. 2M Col. 1 N+M Row 1 Row Decoder N M word lines Row 2N Column Decoder & Sense Amplifiers cell (one i) D Bis sored in 2-dimensional arrays on chip Modern chips have around 4 logical anks on each chip each logical ank physically implemened as many smaller arrays [Samsung, su-70nm DRAM, 2004] 9 DRAM Operaion DRAM Packaging Three seps in read/wrie access o a given ank Row access (RAS) Clock and conrol signals decode row address, enale addressed row (ofen muliple K in row) ilines share charge wih sorage cell small change in volage deeced y sense amplifiers which lach whole row of is sense amplifiers drive ilines full rail o recharge sorage cells charges i lines o known value, required efore nex row access Each sep has a laency of around 15-20ns in modern DRAMs Various DRAM sandards (DDR, RDRAM) have differen ways of encoding he signals for ransmission o he DRAM, u all share same core archiecure C 11 DRAM chip us (4,8,16,32) decode column address o selec small numer of sense amplifier laches (4, 8, 16, or 32 is depending on DRAM package) on read, send lached is ou o chip pins on wrie, change sense amplifier laches which hen charge sorage cells o required value can perform muliple column accesses on same row wihou anoher row access (urs mode) Precharge ~7 lines muliplexed row/column address ~12 Column access (CAS) 10 DIMM (Dual Inline Module) conains muliple chips wih clock/conrol/address signals conneced in parallel (someimes need uffers o drive signals o all chips) pins work ogeher o reurn wide word (e.g., 64-i daa us using 16x4-i pars) 12 2

- Boleneck Performance of high-speed compuers is usually limied y memory andwidh & laency Laency (ime for a single access) access ime >> Processor cycle ime Prolemaic Bandwidh (numer of accesses per uni ime) Increase he us size, ec. Usually OK 13 Performance Processor-DRAM Gap (laency) 1000 100 10 1 1980 1981 1982 1983 1984 1985 1986 1987 µproc 60%/year 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time DRAM Processor- Performance Gap: (grows 50% / year) DRAM 7%/year Four-issue 2GHz superscalar accessing 100ns DRAM could execue 800 insrucions during ime for one memory access! 14 Physical Size Affecs Laency CSE 490/590 Adminisrivia Small Signals have furher o ravel Fan ou o more locaions Big Very imporan o aend Reciaions nex week & he week afer Gues lecures There will e a couple gues lecures lae Fe/early Mar. Quiz 1 Rescheduled Fri, 2/11 Closed ook, in-class Includes lecures unil las Monday (1/31) Review: nex Wed (2/9) 15 16 Hierarchy Relaive Cell Sizes A Small, Fas (RF, SRAM) B holds frequenly used daa Big, Slow (DRAM) On-Chip SRAM in logic chip DRAM on memory chip capaciy: Regiser << SRAM << DRAM why? laency: Regiser << SRAM << DRAM why? andwidh: on-chip >> off-chip why? On a daa access: if daa fas memory low laency access (SRAM) If daa fas memory long laency access (DRAM) [ Foss, Implemening Applicaion-Specific, ISSCC 1996 ] 17 18 C 3

Levels of he Hierarchy Capaciy Access Time Cos Regisers 100s Byes <10s ns Cache K Byes 10-100 ns 1-0.1 cens/i Main M Byes 200ns- 500ns $.0001-.00001 cens /i Disk G Byes, 10 ms (10,000,000 ns) 10-5 -6-10 cens/i Tape infinie sec-min 10-8 Regisers Cache Disk Tape Insr. Operands s Pages Files Saging Xfer Uni prog./compiler 1-8 yes cache cnl 8-128 yes OS 512-4K yes user/operaor Myes Upper Level faser Lower Level 19 Larger Hierarchy: Apple imac G5 Managed " y compiler" 07 Reg L1 Ins L1 L2 DRAM Disk Size 1K 64K 32K 512K 256M 80G Laency Cycles, Time 1, 0.6 ns 3, 1.9 ns Managed " y hardware" 3, 1.9 ns 11, 6.9 ns Managed y OS," hardware," applicaion" 88, 55 ns 10 7, 12 ms Goal: Illusion of large, fas, cheap memory" Le programs address a memory space ha scales o he disk size, a a speed ha is usually as fas as regiser access imac G5" 1.6 GHz" Managemen of Hierarchy Small/fas sorage, e.g., regisers usually specified in insrucion Generally implemened direcly as a regiser file» u hardware migh do hings ehind sofware s ack, e.g., sack managemen, regiser renaming Larger/slower sorage, e.g., main memory usually compued from values in regiser Generally implemened as a hardware-managed cache hierarchy» hardware decides wha is kep in fas memory» u sofware may provide hins, e.g., don cache or prefech 21 (one do per access)! Real Reference Paerns Time! Donald J. Hafield, Jeanee Gerald: Program Resrucuring for Virual. IBM Sysems Journal 10(3): 168-192 (1971) Typical Reference Paerns Common Predicale Paerns n loop ieraions Insrucion feches Sack accesses accesses surouine call argumen access vecor access scalar accesses surouine reurn Time Two predicale properies of memory references: Temporal Localiy: If a locaion is referenced i is likely o e referenced again in he near fuure. Spaial Localiy: If a locaion is referenced i is likely ha locaions near i will e referenced in he near fuure. C 4

(one do per access)! Reference Paerns Spaial Localiy Temporal Localiy Time! Donald J. Hafield, Jeanee Gerald: Program Resrucuring for Virual. IBM Sysems Journal CSE 490/590, Spring 10(3): 2011 168-192 (1971) Caches Caches exploi oh ypes of predicailiy: Exploi emporal localiy y rememering he conens of recenly accessed locaions. Exploi spaial localiy y feching locks of daa around recenly accessed locaions. Inside a Cache Cache Algorihm (Read) Processor copy of main memory locaion 100 100 304 6848 416 Bye Bye Bye CACHE copy of main memory locaion 101 Main Line Look a Processor, search cache ags o find mach. Then eiher Found in cache a.k.a. Reurn copy of daa from cache No in cache a.k.a. MISS Read lock of daa from Main Wai Reurn daa o processor and updae cache Q: Which line do we replace? Placemen Policy Direc-Mapped Cache Numer 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Index Offse V k Se Numer Cache lock 12 can e placed 0 1 2 3 0 1 2 3 4 5 6 7 Fully (2-way) Se Direc Associaive Associaive Mapped anywhere anywhere in only ino se 0 lock 4 (12 mod 4) (12 mod 8) 2 k lines Word or Bye 29 C 5

Direc Map Selecion higher-order vs. lower-order address is 2-Way Se-Associaive Cache Index k V Offse Index k V Offse V 2 k lines Word or Bye Word or Bye Fully Associaive Cache Replacemen Policy V In an associaive cache, which lock from a se should e eviced when he se ecomes full? Offse Word or Bye Random Leas Recenly Used (LRU) LRU cache sae mus e updaed on every access rue implemenaion only feasile for small ses (2-way) pseudo-lru inary ree ofen used for 4-8 way Firs In, Firs Ou (FIFO) a.k.a. Round-Roin used in highly associaive caches No Leas Recenly Used (NLRU) FIFO wih excepion for mos recenly used lock or locks This is a second-order effec. Why? Replacemen only happens on misses 34 Acknowledgemens These slides heavily conain maerial developed and copyrigh y Krse Asanovic (MIT/UCB) David Paerson (UCB) And also y: Arvind (MIT) Joel Emer (Inel/MIT) James Hoe (CMU) John Kuiaowicz (UCB) MIT maerial derived from course 6.823 UCB maerial derived from course CS252 35 C 6