CMSC 313 Lecture 27. System Performance CPU Performance Disk Performance. Announcement: Don t use oscillator in DigSim3

Similar documents
Principles of Computer Architecture. Chapter 10: Trends in Computer. Principles of Computer Architecture by M. Murdocca and V.

Computer Architecture and Organization

Computer Architecture and Organization

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Lecture 8: RISC & Parallel Computers. Parallel computers

CMSC 313 Lecture 26 DigSim Assignment 3 Cache Memory Virtual Memory + Cache Memory I/O Architecture

RAID 0 (non-redundant) RAID Types 4/25/2011

Introduction to Computer Science Lecture 2: Data Manipulation

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

Pipelining, Branch Prediction, Trends

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Parallel Numerics, WT 2013/ Introduction

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Intel released new technology call P6P

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Part VII Advanced Architectures. Feb Computer Architecture, Advanced Architectures Slide 1

5 Computer Organization

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

CPE300: Digital System Architecture and Design

Pipelining and Vector Processing

Unit 9 : Fundamentals of Parallel Processing

Chapter 13 Reduced Instruction Set Computers

Course web site: teaching/courses/car. Piazza discussion forum:

PIPELINE AND VECTOR PROCESSING

Basic Computer Architecture

Spring 2014 Midterm Exam Review

CS311 Lecture: Pipelining and Superscalar Architectures

Multiple Issue ILP Processors. Summary of discussions

Advanced Computer Architecture

Advanced Parallel Architecture. Annalisa Massini /2017

CS 201 The Memory Hierarchy. Gerson Robboy Portland State University

Keywords and Review Questions

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD

Chapter 14 Performance and Processor Design

Lecture 7: Parallel Processing

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Lecture 7: Parallel Processing

Quiz for Chapter 1 Computer Abstractions and Technology

Fundamentals of Computer Design

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

ARSITEKTUR SISTEM KOMPUTER. Wayan Suparta, PhD 17 April 2018

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors


CS425 Computer Systems Architecture

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Designing for Performance. Patrick Happ Raul Feitosa

Pipelining and Vector Processing

Parallel Numerics, WT 2017/ Introduction. page 1 of 127

Module 5 Introduction to Parallel Processing Systems

UNIT I (Two Marks Questions & Answers)

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

ECE 172 Digital Systems. Chapter 4.2 Architecture. Herbert G. Mayer, PSU Status 6/10/2018

CS 1013 Advance Computer Architecture UNIT I

Storage systems. Computer Systems Architecture CMSC 411 Unit 6 Storage Systems. (Hard) Disks. Disk and Tape Technologies. Disks (cont.

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

Parallel Computing: Parallel Architectures Jin, Hai

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Lecture 1: Introduction

EC 513 Computer Architecture

PIPELINING AND VECTOR PROCESSING

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

Taxonomy of Parallel Computers, Models for Parallel Computers. Levels of Parallelism

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Computer Architecture Spring 2016

INSTRUCTION LEVEL PARALLELISM

5008: Computer Architecture

COURSE DESCRIPTION. CS 232 Course Title Computer Organization. Course Coordinators

Multiprocessors & Thread Level Parallelism

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Instructor Information

Multiprocessors - Flynn s Taxonomy (1966)

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Advanced Computer Architecture. The Architecture of Parallel Computers

Course on Advanced Computer Architectures

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

Pipelining. CSC Friday, November 6, 2015

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)

Advanced Computer Architecture

Lecture 2 Parallel Programming Platforms

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

BlueGene/L (No. 4 in the Latest Top500 List)

UNIT- 5. Chapter 12 Processor Structure and Function

EKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Parallel Architecture, Software And Performance

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

5 Computer Organization

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Transcription:

System Performance CPU Performance Disk Performance CMSC 313 Lecture 27 Announcement: Don t use oscillator in DigSim3 UMBC, CMSC313, Richard Chang <chang@umbcedu>

Bottlenecks The performance of a process might be limited by the performance of CPU Memory system I/O Such a process is said to be CPU bound, memory bound or I/O bound UMBC, CMSC313, Richard Chang <chang@umbcedu>

Amdahl s Law Attributed to George Amdahl in 1967 A formula that relates the speedup of the overall system resulting from the speedup of a component: S = 1 / [ ( 1 - f ) + f / k ] S = overall speedup f = fraction of work performed by faster component k = speedup of new component UMBC, CMSC313, Richard Chang <chang@umbcedu>

Amdahl s Law Example Suppose typical jobs on a system spend: 70% of their time running in the CPU 30% of their time waiting for the disk A processor that is 15x faster would result in: S = 1 / [ ( 1-07 ) + 07 / 15 ] = 13043 A disk drive that is 25x faster would result in: S = 1 / [ ( 1-03 ) + 03 / 25 ] = 12195 UMBC, CMSC313, Richard Chang <chang@umbcedu>

System Performance The overall performance of a computer system depends on the performance of a mixture of processes Some processes might be CPU bound, others memory or I/O bound Cannot represent performance as a single number Megahertz Myth, MIPS, MFLOPS Benchmarks can be and have been manipulated Whetstone, Linpack, Dhrystone, SPEC OTOH, speeding up the CPU & I/O does help UMBC, CMSC313, Richard Chang <chang@umbcedu>

CPU Performance CPU time/program = time/cycle x cycles/instruction x instructions/program Higher clock rate reduces time/cycle RISC machines reduce cycles/instruction CISC machines reduce instructions/program UMBC, CMSC313, Richard Chang <chang@umbcedu>

Other CPU Optimizations Integrated Floating Point Unit (FPU) Parallel execution units Instruction pipelining Branch prediction Speculative execution Code optimization UMBC, CMSC313, Richard Chang <chang@umbcedu>

The Case for RISC Architecture Reduced Instruction Set Computer features minimal instructions, load/store architecture, simple addressing modes, instructions with fixed length, lots of registers Instructions can be prefetched Simple instructions executed quickly Memory is cheap Pipelining is easier UMBC, CMSC313, Richard Chang <chang@umbcedu>

10-3 Instruction Frequency Chapter 10: Trends in Computer Architecture Frequency of occurrence of instruction types for a variety of languages The percentages do not sum to 100 due to roundoff (Adapted from Knuth, D E, An Empirical Study of FORTRAN Programs, Software Practice and Experience, 1, 105-133, 1971) Statement Assignment If Call Loop Goto Other Average Percent of Time 47 23 15 6 3 7 Principles of Computer Architecture by M Murdocca and V Heuring 1999 M Murdocca and V Heuring

10-4 Chapter 10: Trends in Computer Architecture Complexity of Assignments Percentages showing complexity of assignments and procedure calls (Adapted from Tanenbaum, A, Structured Computer Organization, 4/e, Prentice Hall, Upper Saddle River, New Jersey, 1999) Percentage of Number of Terms in Assignments Percentage of Number of Locals in Procedures Percentage of Number of Parameters in Procedure Calls 0 22 41 1 80 17 19 2 15 20 15 3 3 14 9 4 2 8 7 5 0 20 8 Principles of Computer Architecture by M Murdocca and V Heuring 1999 M Murdocca and V Heuring

Pipelining Fetch-execute cycle broken down into stages Each stage performed by different units Next instruction begins before current one finishes Ideally, 1 instruction finished per cycle Branch instructions mess up the pipeline Keeping pipeline filled may be difficult UMBC, CMSC313, Richard Chang <chang@umbcedu>

10-7 Chapter 10: Trends in Computer Architecture Four-Stage Instruction Pipeline Instruction Fetch Decode Operand Fetch Execute (ALU Op and Writeback) Principles of Computer Architecture by M Murdocca and V Heuring 1999 M Murdocca and V Heuring

10-8 Chapter 10: Trends in Computer Architecture Pipeline Behavior Pipeline behavior during a memory reference and during a branch Time 1 2 3 4 5 6 7 8 Instruction Fetch Decode Operand Fetch Execute Memory Reference addcc ld srl subcc be nop nop nop addcc ld srl subcc be nop nop addcc ld srl subcc be nop addcc ld srl subcc be ld Principles of Computer Architecture by M Murdocca and V Heuring 1999 M Murdocca and V Heuring

10-9 Chapter 10: Trends in Computer Architecture Filling the Load Delay Slot SPARC code, (a) with a nop inserted, and (b) with srl migrated to nop position srl %r3, %r5 addcc %r1, 10, %r1 ld %r1, %r2 nop subcc %r2, %r4, %r4 be label (a) addcc %r1, 10, %r1 ld %r1, %r2 srl %r3, %r5 subcc %r2, %r4, %r4 be label (b) Principles of Computer Architecture by M Murdocca and V Heuring 1999 M Murdocca and V Heuring

Register Windows Used in SPARC machines, not popular Use registers for parameter passing CPU has more registers than program sees Parameters passed by register Function call depth does not vary that much UMBC, CMSC313, Richard Chang <chang@umbcedu>

10-10 Chapter 10: Trends in Computer Architecture Call-Return Behavior Call-return behavior as a function of nesting depth and time (Adapted from Stallings, W, Computer Organization and Architecture: Designing for Performance, 4/e, Prentice Hall, Upper Saddle River, 1996) Time in Units of Calls/Returns Window depth = 5 Nesting Depth Principles of Computer Architecture by M Murdocca and V Heuring 1999 M Murdocca and V Heuring

10-11 Chapter 10: Trends in Computer Architecture User view of RISC I registers SPARC Registers %g0 %g7 %i0 %i7 %l0 %l7 %o0 %o7 R0 R7 R8 R15 R16 R23 R24 R31 32 bits Global Variables Incoming Parameters Local Variables Outcoming Parameters Principles of Computer Architecture by M Murdocca and V Heuring 1999 M Murdocca and V Heuring

10-12 Chapter 10: Trends in Computer Architecture Overlapping Register Windows CWP = 8 R0 R7 Globals R0 R7 Globals R8 R15 R16 R23 R24 R31 Ins Locals Outs Procedure A Overlap R8 R15 R16 R23 R24 R31 Ins Locals Outs CWP = 24 Procedure B Principles of Computer Architecture by M Murdocca and V Heuring 1999 M Murdocca and V Heuring

Superscalar Architecture Parallel execution units (eg, duplicate ALUs) Instruction-Level Parallelism (versus Program-Level Parallelism) dependencies Out-of-order execution Dynamic scheduling Static scheduling VLIW = Very Long Instruction Word EPIC = Explicitly Parallel Instruction Computer = VLIW with variable length instructions UMBC, CMSC313, Richard Chang <chang@umbcedu>

10-18 Chapter 10: Trends in Computer Architecture 128-Bit IA-64 Instruction Word 128 bits 8 bit Template 40 bit Instruction 40 bit Instruction 40 bit Instruction 13 bit Op Code 6 bit Predicate 7 bit GPR 7 bit GPR 7 bit GPR 40 bits Principles of Computer Architecture by M Murdocca and V Heuring 1999 M Murdocca and V Heuring

Parallel Processing Many processors connected Program-Level Parallelism Shared memory versus message passing Process must be parallelizable Hard to keep processors busy How do we program these things??? UMBC, CMSC313, Richard Chang <chang@umbcedu>

10-21 Flynn Taxonomy Instruction Stream Uniprocessor Out (a) Stream Chapter 10: Trends in Computer Architecture Out n Stream n Controller PE n PE 2 PE 1 PE 0 Out 2 Stream 2 Out 1 Instruction Stream Stream 1 Out 0 Interconnection Network Stream 0 (b) Classification of architectures according to the Flynn taxonomy: (a) SISD; (b) SIMD; (c) MIMD; (d) MISD (c) Instruction Stream n Uniprocessor Out n Stream n Instruction Stream 1 Uniprocessor Out 1 Stream 1 Interconnection Network Instruction Stream 0 Uniprocessor Out 0 Stream 0 (d) Instruction Stream 0 Instruction Stream 1 Instruction Stream n In Vector Unit 0 Vector Unit 1 Vector Unit n Out Principles of Computer Architecture by M Murdocca and V Heuring 1999 M Murdocca and V Heuring

10-22 Chapter 10: Trends in Computer Architecture Network Topologies Network topologies: (a) crossbar; (b) bus; (c) ring; (d) mesh; (e) star; (f) tree; (g) perfect shuffle; (h) hypercube (a) (b) (c) (d) (e) (f) Principles of Computer Architecture by M Murdocca and V Heuring (g) (h) 1999 M Murdocca and V Heuring

Disk Performance Seek time = time to move disk arm to track Rotational delay = time for sector to be positioned Access time = seek time + rotational delay Transfer time = time to read/write a sector Capacity of hard disks increased dramatically Average access time (~8ms) hasn t improved much When several processes use the same disk, scheduling becomes important UMBC, CMSC313, Richard Chang <chang@umbcedu>

8-18 Chapter 8: Input and Output A Magnetic Disk with Three Platters Top surface not used Spindle Comb Head Surface 3 Surface 2 Surface 1 Surface 0 Air cushion Read/write head (1 per surface) 5 µm Surface Bottom surface not used Platter Direction of arm (comb) motion Principles of Computer Architecture by M Murdocca and V Heuring 1999 M Murdocca and V Heuring

8-20 Chapter 8: Input and Output Organization of a Disk Platter with a 1:2 Interleave Factor 6 14 7 15 Sector 0 8 1 9 Inter-sector gap Interleave factor 1:2 13 5 Track 0 10 2 Inter-track gap 12 4 11 3 Track Principles of Computer Architecture by M Murdocca and V Heuring 1999 M Murdocca and V Heuring

Seek Time The seek time is the time necessary to move the arm from its previous location to the correct track Industry standard is average seek time computed by averaging all possible seeks Average seek time tends to overstate actual seek times because of locality of disk references UMBC, CMSC313, Richard Chang <chang@umbcedu>

Rotational Delay Time for requested sector to rotate under head Typical rotation speeds 3,600-7,200 rpm disks rotated at 3,600rpm in 1975 On average disks rotate half way to desired sector 6,000rpm => 05/ (6,000rpm / 60,000 ms/min) = 5ms Some disks can read data out of order into a buffer so a full track can be transferred in one rotation UMBC, CMSC313, Richard Chang <chang@umbcedu>

Transfer Time Time to read or write a sector Approximately equal to time for a sector to pass fully under the head Function of block size, rotation speed, recording density and speed of electronics 6,000rpm, 64 sectors per track, 512 bytes per sector 6,000rpm = 100 revs per second => 10ms per rev 64 sectors => 10ms/64 = 0156ms for sector to pass under head 512 bytes => 1KB takes 0312ms => 32 KB/ms = 32 MB/s UMBC, CMSC313, Richard Chang <chang@umbcedu>

Semester overview Sample Final Exam Next Time UMBC, CMSC313, Richard Chang <chang@umbcedu>