CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

Similar documents
Response Time and Throughput

Performance, Power, Die Yield. CS301 Prof Szajda

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

The Computer Revolution. Classes of Computers. Chapter 1

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

Chapter 1. The Computer Revolution

EECS2021E EECS2021E. The Computer Revolution. Morgan Kaufmann Publishers September 12, Chapter 1 Computer Abstractions and Technology 1

Chapter 1. Computer Abstractions and Technology. Lesson 2: Understanding Performance

Chapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST

CS3350B Computer Architecture CPU Performance and Profiling

Performance of computer systems

Transistors and Wires

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

Lecture 2: Performance

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

TDT4255 Computer Design. Lecture 1. Magnus Jahre

CSE2021 Computer Organization. Computer Abstractions and Technology

Multicore and Parallel Processing

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

Computer Performance. Relative Performance. Ways to measure Performance. Computer Architecture ELEC /1/17. Dr. Hayden Kwok-Hay So

Computer Organization & Assembly Language Programming (CSE 2312)

Chapter 1. and Technology

Rechnerstrukturen

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

COMPUTER ORGANIZATION AND DESIGN

Outline Marquette University

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Fundamentals of Quantitative Design and Analysis

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 2: Figures of Merit and Evaluation Methodologies

CSE 141: Computer Architecture. Professor: Michael Taylor. UCSD Department of Computer Science & Engineering

Lec 25: Parallel Processors. Announcements

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Advanced Computer Architecture (CS620)

COMPUTER ARCHITECTURE AND OPERATING SYSTEMS (CS31702)

An Introduction to Parallel Architectures

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

CMSC 611: Advanced Computer Architecture

Chapter 1. Computer Abstractions and Technology

Course web site: teaching/courses/car. Piazza discussion forum:

Copyright 2012, Elsevier Inc. All rights reserved.

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

EECS4201 Computer Architecture

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

ECE 486/586. Computer Architecture. Lecture # 2

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

Computer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Review: latency vs. throughput

PERFORMANCE MEASUREMENT

EITF20: Computer Architecture Part1.1.1: Introduction

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition. Chapter 1. Computer Abstractions and Technology

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Computer Architecture. Minas E. Spetsakis Dept. Of Computer Science and Engineering (Class notes based on Hennessy & Patterson)

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Lecture 1: CS/ECE 3810 Introduction

CpE 442 Introduction to Computer Architecture. The Role of Performance

CIT 668: System Architecture

EE282H: Computer Architecture and Organization. EE282H: Computer Architecture and Organization -- Course Overview

EIE/ENE 334 Microprocessors

LECTURE 1. Introduction

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Online Course Evaluation. What we will do in the last week?

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group

EE141- Spring 2007 Introduction to Digital Integrated Circuits

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

LECTURE 1. Introduction

Computer Architecture

Lecture 1: Introduction

Lecture #1. Teach you how to make sure your circuit works Do you want your transistor to be the one that screws up a 1 billion transistor chip?

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

COMPUTER ORGANIZATION AND DESI

Parallel Computing. Parallel Computing. Hwansoo Han

Concepts Introduced. Classes of Computers. Classes of Computers (cont.) Great Architecture Ideas. personal computers (PCs)

Computer Architecture. What is it?

What is this class all about?

CS 152 Computer Architecture and Engineering

Syllabus. Chapter 1. Course Goals. Course Information. Course Goals. Course Information

Designing for Performance. Patrick Happ Raul Feitosa

Instructor Information

How What When Why CSC3501 FALL07 CSC3501 FALL07. Louisiana State University 1- Introduction - 1. Louisiana State University 1- Introduction - 2

Computer Architecture

Power-Aware Compile Technology. Xiaoming Li

ECE 486/586. Computer Architecture. Lecture # 3

Part 1 of 3 -Understand the hardware components of computer systems

Computer Architecture Homework Set # 1 COVER SHEET Please turn in with your own solution

Computer Architecture!

ECE 154A. Architecture. Dmitri Strukov

1.6 Computer Performance

CO Computer Architecture and Programming Languages CAPL. Lecture 15

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The Art of Parallel Processing

EE586 VLSI Design. Partha Pande School of EECS Washington State University

Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Transcription:

CSCI 402: Computer Architectures Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI Contents 1.7 - End of Chapter 1 Power wall The multicore era SPEC Bechmark Several fallacies and pitfalls First homework is assigned. 2 1

What is Power Wall? For the past 30 years, life is good Moore s Law: doubling transistors every 18 months More transistors/mm 2 à more activity / area Also, vendors simply increase CPU clock rate to speed up CPU Automatic! However, the free lunch is over! But why is there a problem now? More transistors/mm 2 à Higher power density (watt/cm 2 ) à Higher temperature Fact: total power consumption of world s PCs: 1992: 10 Mwatts (87M CPUs) 2001: 9000 Mwatts (500M CPUs) That s 4 Hoover Dams! 4x Q: Guess how much money 1Mwatts will cost for one year? 35*24*1000*(0.12$ per Kwh) à $1,051,200 3 Power Density (watts/cm 2 ) Watts/cm 2 1000 100 10 1 Power was doubling every 4 years i38 Hot plate i48 Nuclear Reactor Pentium Pro Pentium Pentium 4 Pentium III Pentium II 1.5µ 1µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 0.13µ 0.1µ 0.07µ Surpassed hot-plate power density in 0.5 Not too long to reach nuclear reactor Rocket Nozzle Courtesy : New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies Fred Pollack, Intel Corp. Micro32 conference key note 1999. 4 2

5 The Power Wall 10 Generations FIGURE. Clock rate and Power for Intel x8 microprocessors over 30 years. Pentium 4 made a dramatic jump in clock rate and power but less so in performance. The Prescott thermal problems led to the abandonment of the Pentium 4 line. The Core 2 line reverts to a simpler pipeline with lower clock rates. The Core i5 pipelines follow in its footsteps. 3

Tips to Understand Electricity, Voltage, and Leakage Electricity runs through circuits in the same way as water runs through plumbing Voltage is like water pressure Electric current is the amount of water A capacitor is like a tank or container If you increase water pressure, you can fill a tank more quickly. If you increase the voltage on a CPU chip, it can perform its operations more quickly However, at a higher voltage the CPU uses more power and generates more heat If all you need is the difference between on and off, then it doesn't matter if you are talking about a 100 watt light bulb or an LED that uses a fraction of a watt When the distance between the wires gets small, some current leaks across to nearby wires (As chips become smaller and smaller) Source: http://pclt.sites.yale.edu/circuit-size-voltage-and-heat 7 The Power Wall (Cont.) Metrics: Power vs Energy (watt vs joule) For CMOS, the primary source of energy consumption is dynamic energy. i.e., the energy needed to switch from 0 to 1, or from 1 to 0 The other source is from leakage (current that flows even when a transistor is off), usually account for 40% on servers! The energy required per transistor to go 1->0->1 Energy Capacitive load Voltage 2 The power required per transistor: Power Capacitive load Voltage 2 Frequency 30 5V 1.5V 300 10 4

Power Example 1 Suppose we build a new CPU, which has 15% of capacitive load reduction 15% voltage and 15% frequency reduction 2 Pnew Cold 0.85 (Vold 0.85) Fold 0.85 4 0.85 0.52 2 P C V F old old n Today s situation of the power wall old n We can t reduce voltage further n It will make transistors more leaky n We can t remove more heat by cooling fans n 100 watts is already too much! n How can we improve performance? 11 old Entering the Multicore Era Multicore processors: more than one core per chip In the past, programmers rely on hardware innovation to double program performance No code change at all Today, programmers need to rewrite programs to take advantage of multicore processors Moreover, with more cores, programmers must continue to improve program performance constantly ( scalability ) It requires explicitly parallel programming Different from instruction level parallelism: Hardware executes multiple instructions at once Hidden from the programmer Hard to do Programming for performance!(not only for correctness) Load balancing Optimizing communication and synchronization More and more cores per chip are upcoming à thus, even harder!! 13 5

History of Processor Performance Constrained by power, instruction-level parallelism, memory latency 14 Benchmarking SPEC CPU Benchmark The programs used to measure performance Is a typical representation of actual workload The latest version is SPEC CPU2017 CINT200 (integer) and CFP200 (floating-point) Measure elapsed time to execute a number of programs No I/O involved, only focused on CPU performance Then normalize it relative to a reference machine Summarize result as Geometric Mean of performance ratios n n ÕExecution timeratio i i 1 15

Performance Ratio SPECratio time on SPARCstation 10/40 time on target machine SPEC ratio larger è Performance better 1 CINT200 for Intel Core i7 920 12 integer compute programs 17 7

Why not use Arithmetic Mean (or Average)? Scenario I Test T1 Test T2 Machine A: 10 sec 100 sec Machine B: 1 sec 1000 sec Reference: 1 sec 100 sec A:.1 + 1 B: 1 +.1 A, B are same Scenario II Test T1 Test T2 Machine A: 10 sec 100 sec Machine B: 1 sec 1000 sec Reference: 1 sec 10 sec A:.1 +.1 B: 1 +.01 B is faster Scenario III Test T1 Test T2 Machine A: 10 sec 100 sec Machine B: 1 sec 1000 sec Reference: 1 sec 1000 sec A:.1 + 10 B: 1 + 1 A is faster 18 There is a New SPEC Power Benchmark (#ops/watt) FIGURE 1.19 SPECpower_ssj2008 running on a dual socket 2. GHz Intel Xeon X550 with 1 GB of DRAM and one 100 GB SSD disk. 19 8

Fallicies and Pitfalls Appear in every chapter Fallacy: are commonly held misconceptions that you will encounter Pitfall: are easily made mistakes Only true in a very limited context They help you avoid making the same mistakes 20 1 st Fallacy: Low Power at Idle Computers at low utilization will use a little bit of power Look back at the previous Intel core i7 power benchmark At 100% load: 258W At 50% load: 170W (%) At 10% load: 121W (47%!) à Best performance ever: still 33% E.g., Google data centers Mostly operates at 10% 50% load At 100% load less than 1% of the time Current research: consider designing processors to use power proportional to workload 21 9

1 st Pitfall If you improve one aspect of a computer, then you would expect a proportional Amdahl s Law: improvement in overall performance T improvement factor affected T improved + unaffected n E.g., Multiply accounts for 80s of total time of 100s. n How much improvement in multiply in order to get 5 overall speedup? 80 20 + 20 n Is it possible? n T 22 2 nd Fallacy: Designing for performance and designing for energy efficiency are unrelated 2 nd Pitfall: Using a subset of the Performance Equation as a performance metric. Even using 2 components is not correct. CPU Time Instructions Program Clock cycles Instruction Seconds Clock cycle 23 10

3 rd Pitfall: Use MIPS as a Performance Metric MIPS: Millions Instructions Per Second Why? Does not account for differences in the complexity between instructions Instruction count MIPS Execution time 10 Instruction count Instruction count CPI 10 Clock rate Clock rate CPI 10 24 An Example of MIPS Two different compilers are used on a 100 MHz machine with three classes of instructions: Class A, B, C, which require 1, 2, and 3 cycles. Both compilers are used to compile a program. The 1st compiler's code: 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. The 2nd compiler's code: 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. Q1:Which code will be faster according to execution time? Q2: Which code will be faster according to MIPS? 25 11

Q1:Which code will be faster according to execution time? Exectime clockcycles clockrate time 1 time Answer to Q1 (5 + 2 + 3) 10 cycles 100 10 cycles / sec IC CPI clockrate (5mil 1) + (1mil 2) + (1mil 3) cycles 100 10 cycles / sec 2 (10 + 2 + 3) 10 100 10 10 sec 0.1sec 100 (10mil 1) + (1mil 2) + (1mil 3) cycles 100 10 cycles / sec 15 sec 0.15sec 100 Code 1 Faster than Code 2 wrt execution time 2 Q2: Which code will be faster according to MIPS? Answer to Q2 Code 2 is # of instructions Faster wrt MIPS! - MIPS 10 execution time 7 10 instr - MIPS1 10 70MIPS 0.1sec 12 10 instr - MIPS2 10 80MIPS 0.15 27 12

Remarks from Chapter 1 Performance/Cost ratio keeps improving Due to underlying technology development Hierarchical layers of abstraction exist in both hardware and software Instruction set architecture (ISA) The hardware/software interface Execution time: the best performance measure! Power is a limiting factor Hence we follow a different path now: use parallelism to improve performance 28 Your First Homework 1) Book excercises: 1.5, 1., 1.7, 1.9, 1.14.1 (typo: 10 à 10 ) 2) An additional question: Look at the most recent Top One supercomputer, suppose each person can calculate one floating-point operation per second, then how many Days it would take for all people on the earth to calculate what the Top One supercomputer can compute in only 1 second? Popluation can round off to billions. 1s VS Due time: 11:59PM, January 31 on Wednesday midnight No late homework will be accepted. Submit it through Canvas TA already posted it to Canvas now. 29 13