Lecture 1: Introduction and Fundamental Concepts 1

Similar documents
CSE 305. Computer Architecture

Computer Architecture

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

Chapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

Response Time and Throughput

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

CS2410 Computer Architecture. Flynn s Taxonomy

TDT4255 Computer Design. Lecture 1. Magnus Jahre

CS654 Advanced Computer Architecture. Lec 2 - Introduction

Performance, Power, Die Yield. CS301 Prof Szajda

Appendix D. Controller Implementation

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

The Computer Revolution. Classes of Computers. Chapter 1

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

Chapter 1. and Technology

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CSE2021 Computer Organization. Computer Abstractions and Technology

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

EECS2021E EECS2021E. The Computer Revolution. Morgan Kaufmann Publishers September 12, Chapter 1 Computer Abstractions and Technology 1

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

Chapter 1. The Computer Revolution

UNIVERSITY OF MORATUWA

Computer Architecture ELEC3441

Rechnerstrukturen

William Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved.

Elementary Educational Computer

Arquitectura de Computadores

Chapter 1. Computer Abstractions and Technology

Multiprocessors. HPC Prof. Robert van Engelen

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Computer Graphics Hardware An Overview

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

COMPUTER ARCHITECTURE AND OPERATING SYSTEMS (CS31702)

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

Chapter 3. Floating Point Arithmetic

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors

Outline Marquette University

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

EIE/ENE 334 Microprocessors

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Python Programming: An Introduction to Computer Science

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Instruction and Data Streams

Fundamentals of. Chapter 1. Microprocessor and Microcontroller. Dr. Farid Farahmand. Updated: Tuesday, January 16, 2018

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Systems - HS

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

CS3350B Computer Architecture CPU Performance and Profiling

Quiz: Bad Chinese Food Challenge. Three aspects of design. Lessons from the market. From market to silicon. Eighty percent of success is showing up.

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

Improving Template Based Spike Detection

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

EECS2021. EECS2021 Computer Organization. EECS2021 Computer Organization. Morgan Kaufmann Publishers September 14, 2016

UNIVERSITY OF MORATUWA

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

GE FUNDAMENTALS OF COMPUTING AND PROGRAMMING UNIT III

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods.

1 Enterprise Modeler

n Explore virtualization concepts n Become familiar with cloud concepts

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

How do we evaluate algorithms?

Designing a learning system

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

ECE331: Hardware Organization and Design

CS 111 Green: Program Design I Lecture 27: Speed (cont.); parting thoughts

n Some thoughts on software development n The idea of a calculator n Using a grammar n Expression evaluation n Program organization n Analysis

Lecture 5. Counting Sort / Radix Sort

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

ELEG 5173L Digital Signal Processing Introduction to TMS320C6713 DSK

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

ΕΠΛ 605 Εργαστήριο 5. Παναγιώτα Νικολάου 11/10/18. Slides from: Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin

Recursion. Computer Science S-111 Harvard University David G. Sullivan, Ph.D. Review: Method Frames

Designing a learning system

Cluster Computing Spring 2004 Paul A. Farrell

3D Model Retrieval Method Based on Sample Prediction

Threads and Concurrency in Java: Part 1

Chapter 1. Computer Abstractions and Technology. Lesson 2: Understanding Performance

Lecture 28: Data Link Layer

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago

Threads and Concurrency in Java: Part 1

Transcription:

Uderstadig Performace Lecture : Fudametal Cocepts ad Performace Aalysis CENG 332 Algorithm Determies umber of operatios executed Programmig laguage, compiler, architecture Determie umber of machie istructios executed per operatio Processor ad memory system Determie how fast istructios are executed I/O system (icludig OS) Determies how fast I/O operatios are executed [Lecture slides are adapted from the referece book: Computer Orgaizatio ad Desig, Patterso & Heessy, 20, MKP] 2 Lecture - Fudametal Cocepts Below Your Program Applicatio software Writte i high-level laguage System software Compiler: traslates HLL code to machie code Operatig System: service code Hadlig iput/output Maagig memory ad storage Schedulig tasks & sharig resources Hardware Processor, memory, I/O cotrollers Levels of Program Code High-level laguage Level of abstractio closer to problem domai Provides for productivity ad portability Assembly laguage Textual represetatio of istructios Hardware represetatio Biary digits (bits) Ecoded istructios ad data Lecture - Fudametal Cocepts 4 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts

Compoets of a Computer The BIG Picture Same compoets for all kids of computer Desktop, server, embedded Iput/output icludes User-iterface devices Display, keyboard, mouse Storage devices Hard disk, CD/DVD, flash Network adapters For commuicatig with other computers Iside the Processor (CPU) Datapath: performs operatios o data Cotrol: sequeces datapath, memory,... Cache memory Small fast SRAM memory for immediate access to data Lecture - Fudametal Cocepts Lecture - Fudametal Cocepts Iside the Processor AMD Barceloa: 4 processor cores Relative Performace Performace Executio Time If X is times faster tha Y? Performace Performace X Y Executio Time Executio Time Y X Example: A program takes 0s o machie A ad 5s o machieb ET B / ET A = 5s / 0s =.5 So, A is.5 times faster tha B 7 Lecture - Fudametal Cocepts 8 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 2

Measurig Time Elapsed time Total respose time, icludig all aspects Processig, I/O, OS overhead, idle time Determies system performace Time spet processig a give job Discouts I/O time, other jobs shares Differet programs are affected differetly by CPU ad system performace Throughput Total work doe per uit time e.g., istructios executed per secod, tasks per hour, etc. 9 Lecture - Fudametal Cocepts Number of Clock Cycles Clock Cycle Time # of CCs Istructio Cout Cycles per Istructio CC Time Clock Rate Istructio Cout CPI Clock Cycle Time Istructio Cout CPI Clock Rate 0 Lecture - Fudametal Cocepts Example A program rus o two differet computers: A ad B. If both computers have the same ISA, which oe is faster ad by how much? Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI =.2 A B B A ICCPI CCT A A I 2.0 250ps I500ps ICCPI CCT B B I.2500ps I 00ps I 00ps.2 I500ps Lecture - Fudametal Cocepts A is faster by this much Istructio Cout ad CPI Performace improved by Reducig umber of clock cycles Icreasig clock rate Istructio Cout for a program Determied by program, ISA ad compiler Average Cycles Per Istructio (CPI) Determied by CPU hardware If differet istructios have differet CPI Average CPI affected by istructio mix 2 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 3

CPI i More Detail If differet istructio classes take differet umbers of cycles Number of Clock Cycles Weighted average CPI i (CPIi ICi) CPI Example A high-level program is compiled by two differet compilers. Each code sequece has istructios from three istructio classes: A, B, C. Give CPI values for each istructio class, fid average CPI? A B C CPI for class 2 3 IC i sequece 2 2 IC i sequece 2 4 # of CCs CPI IC i ICi CPIi IC Relative frequecy Sequece : IC = 5 Clock Cycles = 2 + 2 + 2 3 = 0 Avg. CPI = 0/5 = 2.0 Sequece 2: IC = Clock Cycles = 4 + 2 + 3 = 9 Avg. CPI = 9/ =.5 3 Lecture - Fudametal Cocepts 4 Lecture - Fudametal Cocepts MIPS MIPS: Millio Istructios Per Secod Does t accout for differeces i ISAs betwee computers ad differeces i complexity betwee istructios So, it is ot a good performace metric. Uiprocessor Performace Istructio cout MIPS Executio time0 Istructio cout Istructio cout CPI 0 Clock rate Clock rate CPI0 CPI varies betwee programs o a give CPU Costraied by power, istructio-level parallelism, memory latecy 5 Lecture - Fudametal Cocepts Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 4

Multiprocessors Multicore microprocessors More tha oe processor per chip Requires explicitly parallel programmig Compare with istructio level parallelism Hardware executes multiple istructios at oce Hidde from the programmer Hard to do Programmig for performace Load balacig Optimizig commuicatio ad sychroizatio 7 Lecture - Fudametal Cocepts SPEC CPU Bechmark Programs used to measure performace Supposedly typical of actual workload Stadard Performace Evaluatio Corp (SPEC) Develops bechmarks for CPU, I/O, Web, SPEC CPU200 Elapsed time to execute a selectio of programs Negligible I/O, so focuses o CPU performace Normalize relative to referece machie Summarize as geometric mea of performace ratios CINT200 (iteger) ad CFP200 (floatig-poit) Executio time ratio i i 8 Lecture - Fudametal Cocepts CINT200 for Optero X4 235 Name Descriptio IC 0 9 CPI Tc (s) Exec time Ref time SPECratio perl Iterpreted strig processig 2,8 0.75 0.40 37 9,777 5.3 bzip2 Block-sortig compressio 2,389 0.85 0.40 87 9,50.8 gcc GNU C Compiler,050.72 0.47 24 8,050. mcf Combiatorial optimizatio 33 0.00 0.40,345 9,20.8 go Go game (AI),58.09 0.40 72 0,490 4. hmmer Search gee sequece 2,783 0.80 0.40 890 9,330 0.5 sjeg Chess game (AI) 2,7 0.9 0.48 37 2,00 4.5 libquatum Quatum computer simulatio,23. 0.40,047 20,720 9.8 h24avc Video compressio 3,02 0.80 0.40 993 22,30 22.3 ometpp Discrete evet simulatio 587 2.94 0.40 90,250 9. astar Games/path fidig,082.79 0.40 773 7,020 9. xalacbmk XML parsig,058 2.70 0.40,43,900.0 Geometric mea.7 Fudametal Cocepts Takig advatage of parallelism The priciple of locality Focus o the commo case Amdahl s Law Processor performace Equatio High cache miss rates 9 Lecture - Fudametal Cocepts 20 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 5

) Takig Advatage of Parallelism Icreasig throughput of a server computer via multiple processors or multiple disks Detailed HW desig Carry look ahead adders uses parallelism to speed up Multiple memory baks searched i parallel Pipeliig: overlap istructio executio to reduce the total time to complete a istructio sequece. Classic 5-stage pipelie: ) Istructio Fetch (Ifetch), 2) ister Read (), 3) Execute (ALU), 4) Data Memory Access (Dmem), 5) ister Write () 2 Lecture - Fudametal Cocepts Pipelied Istructio Executio I s t r. O r d e r Ifetch Ifetch Time (clock cycles) Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle Cycle 7 ALU Ifetch DMem ALU Ifetch DMem 22 Lecture - Fudametal Cocepts ALU DMem ALU DMem Limits to pipeliig Hazards prevet ext istructio from executig durig its desigated clock cycle Structural hazards: attempt to use the same hardware to do two differet thigs at oce Data hazards: Istructio depeds o result of prior istructio still i the pipelie Cotrol hazards: Caused by delay betwee the fetchig of istructios ad decisios about chages i cotrol flow (braches ad jumps). 2) The Priciple of Locality The Priciple of Locality: Program access a relatively small portio of the address space at ay istat of time. Two Differet Types of Locality: Temporal Locality (Locality i Time): If a item is refereced, it will ted to be refereced agai soo (e.g., loops, reuse) Spatial Locality (Locality i Space): If a item is refereced, items whose addresses are close by ted to be refereced soo (e.g., straight-lie code, array access) Last 30 years, HW relied o locality for memory performace. P $ MEM 23 Lecture - Fudametal Cocepts 24 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts

Capacity Access Time Cost Tape ifiite sec-mi ~$ / GByte 25 Levels of the Memory Hierarchy CPU isters 00s Bytes 300 500 ps (0.3-0.5 s) L ad L2 Cache 0s-00s K Bytes ~ s - ~0 s $000s/ GByte Mai Memory G Bytes 80s- 200s ~ $00/ GByte Disk 0s T Bytes, 0 ms (0,000,000 s) ~ $ / GByte isters L Cache L2 Cache Memory Disk Tape Istr. Operads Blocks Blocks Pages Files Lecture - Fudametal Cocepts Stagig Xfer Uit prog./compiler -8 bytes cache ctl 32-4 bytes cache ctl 4-28 bytes OS 4K-8K bytes user/operator Mbytes Upper Level faster Larger Lower Level 3) Focus o the Commo Case I makig a desig trade-off, favor the frequet case over the ifrequet case E.g., Istructio fetch ad decode uit used more frequetly tha multiplier, so optimize it first. Frequet case is ofte simpler ad ca be doe faster tha the ifrequet case E.g., overflow is rare whe addig 2 umbers, so improve performace by optimizig commo case of o overflow What is frequet case ad how much performace improved by makig it faster => Amdahl s Law 2 Lecture - Fudametal Cocepts 4) Amdahl s Law gaied from some faster mode of executio is limited by the fractio of the time durig which faster mode is used. ETold overall ETew ETaffected ETew ETuaffected improvemet factor ExTimeew ExTimeold overall ExTime ExTime old ew Fractio Fractio Fractio Fractio ET affected Theoretical Maximum: maximum - Fractio 27 ET old ET old Lecture - Fudametal Cocepts 28 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 7

Amdahl s Law example Example: A ew 0X faster CPU is placed i a computig system where 40% of time is for CPU ad 0% of time is for I/O. What is? 5) Processor performace equatio ist cout Cycle time CPU time = Secods = Istructios x Cycles x Secods Program Program Istructio Cycle CPI overall Fractio 0.4 0.4 0 Fractio 0.4.5 Ist. Cout CPI Clock Rate Program X Compiler X (X) Ist. Set. X X Orgaizatio X X 0X vs. just.x faster? Techology X Lecture - Fudametal Cocepts 30 Lecture - Fudametal Cocepts Pitfall: Amdahl s Law Improvig a aspect of a computer ad expectig a proportioal improvemet i overall performace T improved Taffected T improvemet factor uaffected Example: multiply accouts for 80s/00s How much improvemet i multiply performace to get 5 overall? 80 20 20 Ca t be doe! Fallacy: Low Power at Idle Look back at X4 power bechmark At 00% load: 295W At 50% load: 24W (83%) At 0% load: 80W (%) Google data ceter Mostly operates at 0% 50% load At 00% load less tha % of the time Cosider desigig processors to make power proportioal to load Corollary: make the commo case fast 3 Lecture - Fudametal Cocepts 32 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 8

Pitfall: MIPS as a Performace Metric MIPS: Millios of Istructios Per Secod Does t accout for Differeces i ISAs betwee computers Differeces i complexity betwee istructios Istructio cout MIPS Executio time 0 Istructio cout Istructio cout CPI 0 Clock rate Clock rate CPI0 CPI varies betwee programs o a give CPU Cocludig Remarks Cost/performace is improvig Due to uderlyig techology developmet Hierarchical layers of abstractio I both hardware ad software Istructio set architecture The hardware/software iterface Executio time: the best performace measure Power is a limitig factor Use parallelism to improve performace 33 Lecture - Fudametal Cocepts 34 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 9