TDT 4260 lecture 2 spring semester 2015

Similar documents
Copyright 2012, Elsevier Inc. All rights reserved.

Fundamentals of Quantitative Design and Analysis

EECS4201 Computer Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Transistors and Wires

ECE 486/586. Computer Architecture. Lecture # 2

Lecture 1: Introduction

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

CSE 502 Graduate Computer Architecture

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

Advanced Computer Architecture (CS620)

The Computer Revolution. Classes of Computers. Chapter 1

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Chapter 1: Fundamentals of Quantitative Design and Analysis

TDT4255 Computer Design. Lecture 1. Magnus Jahre

Performance of computer systems

Chapter 1. Computer Abstractions and Technology. Lesson 2: Understanding Performance

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

Chapter 1. The Computer Revolution

Chapter 1. Computer Abstractions and Technology. Adapted by Paulo Lopes, IST

Advanced Computer Architecture Week 1: Introduction. ECE 154B Dmitri Strukov

CS/EE 6810: Computer Architecture

Outline Marquette University

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

An Introduction to Parallel Architectures

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Fundamentals of Computer Design

EECS2021E EECS2021E. The Computer Revolution. Morgan Kaufmann Publishers September 12, Chapter 1 Computer Abstractions and Technology 1

TDT 4260 lecture 3 spring semester 2015

Introduction to Computer Architecture II

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

EECS2021. EECS2021 Computer Organization. EECS2021 Computer Organization. Morgan Kaufmann Publishers September 14, 2016

Lecture 2: Performance

Performance, Power, Die Yield. CS301 Prof Szajda

Advanced Computer Architecture Week 1: Introduction. ECE 154B Dmitri Strukov

Computer Organization & Assembly Language Programming (CSE 2312)

Chapter 1. and Technology

Computer Architecture

Lecture 1: CS/ECE 3810 Introduction

Computer Architecture. Minas E. Spetsakis Dept. Of Computer Science and Engineering (Class notes based on Hennessy & Patterson)

1.13 Historical Perspectives and References

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Response Time and Throughput

Advanced Computer Architecture Week 1: Introduction. ECE 154B Dmitri Strukov

Computer Architecture

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

ECE 486/586. Computer Architecture. Lecture # 3

Rechnerstrukturen

Designing for Performance. Patrick Happ Raul Feitosa

Advanced Computer Architecture Week 1: Introduction. ECE 154B Dmitri Strukov

LECTURE 1. Introduction

Course web site: teaching/courses/car. Piazza discussion forum:

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Microarchitecture Overview. Performance

ECE 154A. Architecture. Dmitri Strukov

Chapter 1. Computer Abstractions and Technology

Computer Architecture!

DEPARTMENT OF ECE IV YEAR ECE EC6009 ADVANCED COMPUTER ARCHITECTURE LECTURE NOTES

Computer Architecture Lecture 1: Fundamentals of Quantitative Design and Analysis (Chapter 1)

Fundamentals of Computers Design

Computer Architecture. R. Poss

Fundamentals of Computer Design

45-year CPU Evolution: 1 Law -2 Equations

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Microarchitecture Overview. Performance

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Microprocessor Trends and Implications for the Future

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Computer Architecture

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

EITF20: Computer Architecture Part1.1.1: Introduction

LECTURE 1. Introduction

COMPUTER ORGANIZATION AND DESIGN

Multicore Hardware and Parallelism

EE282H: Computer Architecture and Organization. EE282H: Computer Architecture and Organization -- Course Overview

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Moore s Law. CS 6534: Tech Trends / Intro. Good Ol Days: Frequency Scaling. The Power Wall. Charles Reiss. 24 August 2016

Phase Change Memory An Architecture and Systems Perspective

How What When Why CSC3501 FALL07 CSC3501 FALL07. Louisiana State University 1- Introduction - 1. Louisiana State University 1- Introduction - 2

CS 6534: Tech Trends / Intro

Parallelism and Concurrency. COS 326 David Walker Princeton University

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallelism in Hardware

Exercise 1 Due 02.November 2010, 12:15pm

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

Parallel Algorithm Engineering

Lecture 1: Introduction

CMSC 411 Computer Systems Architecture Lecture 2 Trends in Technology. Moore s Law: 2X transistors / year

Computer Architecture. What is it?

Lecture: Benchmarks, Pipelining Intro. Topics: Performance equations wrap-up, Intro to pipelining

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CSE2021 Computer Organization. Computer Abstractions and Technology

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Review: latency vs. throughput

Transcription:

1 TDT 4260 lecture 2 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU

2 Lecture overview Chapter 1: Fundamentals of Quantitative Design and Analysis, continued Technology trends Power & energy, costs Performance, metrics Speedup, Amdahl s law

3 Trends in Technology Integrated circuit technology (Moore s Law) Transistor density: 35%/year Die size: 10-20%/year Combined effect: 40-55%/year DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM, but slower Magnetic disk capacity: 40%/year 15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM

4 Bandwidth and Latency Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors 300-1200X improvement for memory and disks Latency or response time Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks Numbers are computed based on current systems vs. systems in the early 80s Details in Fig 1.10 (1982 2010)

5 Latency Lags Bandwidth Log-log plot of bandwidth and latency milestones

6 Current Trends in Architecture Cannot continue to exploit more Instruction-Level parallelism (ILP) Limited single processor performance improvement since 2003 New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP) These forms of parallelism require explicit restructuring of the application Increasing performance with technology generations is now to a larger extent up to the programmer

7 Transistors and Wires Feature size is the minimum size of transistor or wire in the x- or y-dimension 10 microns in 1971 to.032 microns (32nm) in 2011 miniaturization Integration density scales quadratically Transistor performance scales linearly Wire delay does not improve with feature size since it is proportional to the length of the wire On-chip communication becomes an increasing problem On-chip data locality becomes increasingly important

8 The Memory Wall Main Memory Latency Processor Performance Relative Performance 100000 10000 1000 100 10 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 Year The Processor Memory Gap Consequence: deeper memory hierarchies P Registers L1 cache L2 cache L3 cache Memory - - - Complicates understanding of performance cache usage has an increasing influence on performance 2010

9 I/O Pin Problem # I/O signaling pins number drives chip cost high frequency operation on the circuit board is a challenge Projections from ITRS (International Technology Roadmap for Semiconductors) From PACT paper by Huh, Burger and Keckler 2001

10 Power and Energy Remember: Energy (Joule) = Power (Watt) * time (second) 1 Watt = 1 Joule/second Problem: Get power in, get power out Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power, higher than average power consumption Clock rate can be reduced dynamically to limit power consumption

11 TDP, a recent example (Parallelization of a PARSEC application [in Workshop at SC 12]) Red dashed line is the TDP of each processor Low-power Sandy Bridge core i5 (laptop) close to TDP Application is not «challenging enough» for the server node Sandy Bridge EP Vectorization with SSE and AVX is very energy efficient! (much better performance, but almost free!)

12 More details in (Learn more --- not part of the course) Performance and energy impact of parallelization and vectorization techniques in modern microprocessors Juan M. Cebrian, Lasse Natvig, and Jan Christian Meyer Journal of Computing, November 2013, pages 1-15. NTNU-Video from our PP4EE-seminar http://video.adm.ntnu.no/openvideo/pres/52526d7a44e9c

13 Dynamic Energy and Power Dynamic energy Transistor switch from 0 1 or 1 0 The energy of a single transition is ½ x Capacitive load x Voltage 2 Dynamic power ½ x Capacitive load x Voltage 2 x Frequency switched Reducing clock rate reduces power, not energy

14 Clock rates stopped increasing Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air

15 Reducing Power Do nothing well Alternative formulation: Only power on the units that are currently needed Dynamic Voltage-Frequency Scaling (DVFS) Common in modern microprocessors Low power state for DRAM, disks Overclocking Intel Turbo Boost Technology (TBT) AMD PowerNow

16 Static Power Static power consumption Power static = Current static x Voltage Scales with number of transistors Even if they are idle (but powered) Power gating Turn off the power supply to units that are not in use Race to halt Typical embedded systems (Eg. Nordic semiconductor)

17 Trends in Cost Cost driven down by learning curve Yield DRAM: price closely tracks cost Microprocessors: price depends on volume 10% less for each doubling of volume

18 Cost and COTS Cost to produce one unit include (development cost / # sold units) benefit of large volume COTS commodity off the shelf much better performance/price pr. component strong influence on the selection of components for building supercomputers in more than 20 years Recent example: Mont Blanc project and Exynos 5 (next two slides)

19 Mont Blanc project Philippo Mantovani Mont Blanc 1 Mont Blanc 2 (Mont Blanc 3?) put Europe on the map of supercomputer vendors ARM is a partner Coordinated by UPC/BSC Alex Ramirez presented the Mont Blanc project at NTNU, November 2012 http://video.adm.ntnu.no/openvideo/pres/50c5c63af0f02

20 They consider using ARM Mali GPUs Competition between Nvidia GPU and Mali GPU still going on

21 back to Manufacturing ICs Yield: proportion of working dies per wafer

22 NEW TOPIC PERFORMANCE

23 Defining Performance Which airplane has the best performance? Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 100 200 300 400 500 Passenger Capacity Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 2000 4000 6000 8000 10000 Cruising Range (miles) Boeing 777 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 500 1000 1500 Cruising Speed (mph) Boeing 747 BAC/Sud Concorde Douglas DC- 8-50 0 100000 200000 300000 400000 Passengers x mph

24 Response Time Book definition: Time from issuing a command to its completion This is often referred to as the turn-around time Alternative response time definition: Time from issue to first response Execution time is the time the processor is busy execution the program Turn-around time includes the time the process waits to be executed, execution time does not

25 Response Time and Throughput Throughput Total work done per unit time How are response time and throughput affected by Replacing the processor with a faster version? Adding more processors?

26 Measuring Execution Time Elapsed time/wall clock time Total turn-around time, including all aspects Processing, I/O, OS overhead, idle time Determines system performance CPU time Time spent processing a given job Discounts I/O time Comprised of user CPU time and system CPU time Different programs are affected differently by CPU and system performance Time is the only unambiguous performance measure

27 Speedup Speedup = Performance of system / Performance Baseline Remember that Performance = 1/ Execution time Speedup = Execution Time Baseline / Execution time of system Parallel systems: Speedup = Parallel Performance / Sequential Performance Note: Use the best sequential algorithm, not the parallel algorithm with p = 1

28 Superlinear speedup

29 Benchmarks Benchmark types Kernels, toy programs and synthetic benchmarks Disadvantage: Too easy for compiler writers and computer architects to cheat BUT, can be very useful to help understanding of the interplay between architecture and software Embedded benchmarks: EEMBC, MiBench, etc. Desktop benchmarks: SPEC 2006, PARSEC, etc. NEW; ParVec from CARD- NTNU Server benchmarks: TPC-C, TPC-H, etc. Computers should work well for a collection of programs Average performance is the key metric Benchmarks are often assembled into suites

30 Summarizing and Reporting Results Averages can be computed in different ways Arithmetic mean Harmonic mean Geometric mean Complete and precise description of what you have measured and reported is mandatory! Reproducibility of experiments is very important You should include enough information for an independent researcher to repeat your experiment

31 Smith, CACM oct. 1988

32 Principles of Computer Design Take Advantage of Parallelism e.g. multiple processors, overlap computation with communication/data retrieval, memory banks, pipelining, multiple functional units Principle of Locality Reuse of data and instructions Focus on the Common Case Amdahl s Law (Demonstrates how much a serial part of an application limits its parallelization) Sometimes called pessimistic wrt. parallelization

33 Amdahl s Law (1967) (fixed problem size) If a fraction s of a (uniprocessor) computation is inherently serial, the speedup is at most 1/s Total work in computation serial fraction s parallel fraction p s + p = 1 (100%) S(n) = Time(1) / Time(n) = (s + p) / [s +(p/n)] = 1 / [s + (1-s) / n] Out of scale = n / [1 + (n -1)s] pessimistic and famous

34 Gustafson s law (1987) (scaled problem size, fixed execution time) Total execution time on parallel computer with n processors is fixed serial fraction s parallel fraction p s + p = 1 (100%) S (n) = Time (1)/Time (n) = (s + p n)/(s + p ) = s + p n = s + (1-s )n = n +(1-n)s Reevaluating Amdahl's law, John L. Gustafson, CACM May 1988, pp 532-533. Not a new law, but Amdahl s law with changed assumptions