Lecture 1: Introduction

Similar documents
Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

EECS4201 Computer Architecture

Fundamentals of Quantitative Design and Analysis

CSE 502 Graduate Computer Architecture

Transistors and Wires

Introduction to Computer Architecture II

Fundamentals of Computer Design

Fundamentals of Computers Design

ECE 486/586. Computer Architecture. Lecture # 2

TDT 4260 lecture 2 spring semester 2015

Advanced Computer Architecture Week 1: Introduction. ECE 154B Dmitri Strukov

Advanced Computer Architecture Week 1: Introduction. ECE 154B Dmitri Strukov

Chapter 1: Fundamentals of Quantitative Design and Analysis

Computer Architecture Lecture 1: Fundamentals of Quantitative Design and Analysis (Chapter 1)

Advanced Computer Architecture (CS620)

Instructor Information

Fundamentals of Computer Design

DEPARTMENT OF ECE IV YEAR ECE EC6009 ADVANCED COMPUTER ARCHITECTURE LECTURE NOTES

Advanced Computer Architecture Week 1: Introduction. ECE 154B Dmitri Strukov

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Outline Marquette University

CS/EE 6810: Computer Architecture

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Computer Architecture. Minas E. Spetsakis Dept. Of Computer Science and Engineering (Class notes based on Hennessy & Patterson)

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Advanced Computer Architecture Week 1: Introduction. ECE 154B Dmitri Strukov

Performance of computer systems

An Introduction to Parallel Architectures

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

45-year CPU Evolution: 1 Law -2 Equations

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Multicore Hardware and Parallelism

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Computer Architecture

1.13 Historical Perspectives and References

Computer Systems Architecture

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

Lec 25: Parallel Processors. Announcements

Course web site: teaching/courses/car. Piazza discussion forum:

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Lecture 1: Introduction

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallelism in Hardware

Parallel Systems I The GPU architecture. Jan Lemeire

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

The Computer Revolution. Classes of Computers. Chapter 1

ECE 2162 Intro & Trends. Jun Yang Fall 2009

Lecture 2: Performance

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Moore s Law. CS 6534: Tech Trends / Intro. Good Ol Days: Frequency Scaling. The Power Wall. Charles Reiss. 24 August 2016

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

CS 654 Computer Architecture Summary. Peter Kemper

Multiple Issue ILP Processors. Summary of discussions

Lect. 2: Types of Parallelism

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Computer Systems Architecture

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Module 5 Introduction to Parallel Processing Systems

ECE 588/688 Advanced Computer Architecture II

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

When and Where? Course Information. Expected Background ECE 486/586. Computer Architecture. Lecture # 1. Spring Portland State University

Multicore and Parallel Processing

Chap. 4 Multiprocessors and Thread-Level Parallelism

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

Multiprocessors & Thread Level Parallelism

Advanced Processor Architecture

CS 6534: Tech Trends / Intro

Parallel Computing. Parallel Computing. Hwansoo Han

Advanced Computer Architecture

THREAD LEVEL PARALLELISM

10th August Part One: Introduction to Parallel Computing

Parallel Computing: Parallel Architectures Jin, Hai

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Lecture 7: Parallel Processing

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Computer Architecture. Fall Dongkun Shin, SKKU

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Computer Architecture

Computer Architecture

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Static Compiler Optimization Techniques

Lecture 7: Parallel Processing

Control Hazards. Prediction

EC 513 Computer Architecture

Computer Systems Architecture Spring 2016

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

Transcription:

Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline structures Cache memories Implementations; example: Logic design and synthesis Computer Architecture A Quantitative Approach, Fifth Edition Fundamentals Technology trends Fundamentals of Quantitative Design and Analysis Performance evaluation methodologies Instruction Set Architecture Elsevier Inc. All rights 2 Traditional Computer Architecture The term architecture is used here to describe the attribute of a system as seen by the programmer, i.e., the conceptual structure and functional behavior as distinct from the organization of the data flow and controls, the logic design, and the physical implementation. Gene Amdahl, IBM Journal R&D, April 1964 Growth in Processor Performance Two major driving forces: Technologies and Architecture Transistor speed improves with each generation VLSI process technology Architecture innovations: RISC and Instruction-Level Parallelism Cache memories 1986-2002: >50% annual improvement! 1

Technology Drives for High-Performance VLSI technology: faster transistors and larger transistor budget 250nm 130nm 90nm 65nm 45nm 32nm 22nm 14nm 10nm (2016) 5-7nm 2300 transistors in Intel 4004 ~400M in Intel Core 2 Extreme 2.6B in Intel Xeon Westmere (2011 release with 32nm) approx. 4B today How to make effective use of those transistors? How to use one billion transistors? Bit-level parallelism Move from 32-bit to 64-bit Instruction-level parallelism (ILP) Deep pipeline Execute multiple instructions per cycle Program locality Large caches, more branch prediction resources Thread-level parallelism (TLP) Single Processor Performance Move to multi-processor Introduction Continued - New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP) RISC These require explicit restructuring of the application CPU Performance For sequential program: CPU time = #Inst CPI Clock cycle time To improve performance Faster clock time Reduce #inst Reduce CPI or increase IPC Classes of Computers Personal Mobile Device (PMD) e.g. smart phones, tablet computers Emphasis on energy efficiency and real-time Desktop Computing Emphasis on price-performance Servers Emphasis on availability, scalability, throughput Clusters / Warehouse Scale Computers Used for Software as a Service (SaaS) Emphasis on availability and price-performance Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks Embedded Computers Emphasis: price Classes of Computers 2

Parallelism Classes of parallelism in applications: Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of Computers Integrated circuit technology Transistor density: 35%/year Die size: 10-20%/year Integration overall: 40-55%/year Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/graphic Processor Units (GPUs) Thread-Level Parallelism Request-Level Parallelism DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM Flynn s Taxonomy Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) Vector architectures Multimedia extensions Graphics processor units Classes of Computers Bandwidth and Latency Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors 300-1200X improvement for memory and disks Multiple instruction streams, single data stream (MISD) No commercial implementation Multiple instruction streams, multiple data streams (MIMD) Tightly-coupled MIMD Loosely-coupled MIMD Latency or response time Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks Defining Computer Architecture Old view of computer architecture: Instruction Set Architecture (ISA) design i.e. decisions regarding: registers, memory addressing, addressing modes, instruction operands, available operations, control flow instructions, instruction encoding Defining Computer Architecture Bandwidth and Latency Real computer architecture: Specific requirements of the target machine Design to maximize performance within constraints: cost, power, and availability Includes ISA, microarchitecture, hardware Log-log plot of bandwidth and latency milestones (1982-2010) 3

Transistors and Wires Feature size Minimum size of transistor or wire in x or y dimension 10 microns in 1971 to.032 microns in 2011 to 22nm (.022 microns) in 2012; 14nm in 2014-15; 10nm in 2016; Transistor performance scales linearly Wire delay does not improve with feature size! Integration density scales quadratically Power Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air Power and Energy Problem: Get power in, get power out Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power, higher than average power consumption Reducing Power Techniques for reducing power: Do nothing well Dynamic Voltage-Frequency Scaling Low power state for DRAM, disks Overclocking, turning off cores Clock rate can be reduced dynamically to limit power consumption Energy per task is often a better measurement Dynamic Energy and Power Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage 2 Dynamic power ½ x Capacitive load x Voltage 2 x Frequency switched Static Power Static power consumption Current static x Voltage Scales with number of transistors To reduce: power gating Reducing clock rate reduces power, not energy 4

Trends in Cost Cost driven down by learning curve Yield Trends in Cost Measuring Performance Typical performance metrics: Response time Throughput Speedup of X relative to Y Measuring Performance Execution time Y / Execution time X DRAM: price closely tracks cost Microprocessors: price depends on volume 10% less for each doubling of volume Execution time Wall clock time: includes all system overheads CPU time: only computation time Benchmarks Kernels (e.g. matrix multiply) Toy programs (e.g. sorting) Synthetic benchmarks (e.g. Dhrystone) Benchmark suites (e.g. SPEC06fp, TPC-C) Integrated Circuit Cost Integrated circuit Trends in Cost Principles of Computer Design Take Advantage of Parallelism e.g. multiple processors, disks, memory banks, pipelining, multiple functional units Principles Principle of Locality Reuse of data and instructions Bose-Einstein formula: Focus on the Common Case Amdahl s Law Defects per unit area = 0.016-0.057 defects per square cm (2010) N = process-complexity factor = 11.5-15.5 (40 nm, 2010) Dependability Module reliability Mean time to failure (MTTF) Mean time to repair (MTTR) Mean time between failures (MTBF) = MTTF + MTTR Availability = MTTF / MTBF Dependability Principles of Computer Design The Processor Performance Equation Principles 5

Principles of Computer Design Principles Instruction-Level Parallelism Different instruction types having different CPIs for (i=0; i<n; i++) X[i] = a*x[i]; // let R3=&X[0],R4=&X[N] // and F0=a LOOP: LD.D F2, 0(R3) MUL.D F2, F2, F0 S.D F2, 0(R3) DADD R3, R3, 8 BNE R3, R4, LOOP What instructions are parallel? How to schedule those instructions? Instruction-Level Parallelism Pipelining: The first step to instructionlevel parallelism Instruction-Level Parallelism Find independent instructions through dependence analysis Hardware approaches => Dynamically scheduled superscalar Most commonly used today: Intel Pentium, AMD, IBM Power, Sun UltraSparc, and MIPS families Software approaches => (1) Static scheduled superscalar, or (2) VLIW Instruction-Level Parallelism Modern Superscalar Processors Pipeline + Multi-issue Example: Intel Core and Core 2, IBM Power/PowerPC, Sun UltraSparc, SGI MIPS Multi-issue and Deep pipelining Dynamic scheduling and speculative execution High bandwidth L1 caches and large L2/L3 caches 6

Modern Superscalar Processor Learning challenges: Complexity!!! Many design ideas are no longer intuitive How does it improve performance? Does the processor still work correctly? Must have a big picture Memory Subsystem Many applications are memory-bound CPU speeds increases fast; memory speed cannot match up The cost of one cache miss can be close to executing 1000 instructions For a memory intensive program: If a CPU spends 60% time waiting for memory, what s the point of improving CPU speed up by 10x? Modern Superscalar Processor Maintain register data flow Register renaming Instruction scheduling Maintain control flow Branch prediction Speculative execution and recovery Maintain memory data flow Load and store queues Memory dependence speculation Memory System Performance Memory Stall CPI = #Misses per inst miss penalty in cycles = % Mem Inst Miss rate Miss Penalty Assume 30% memory instruction, 2% miss rate, 300-cycle miss penalty. How much is memory stall CPI? Memory Subsystem Cache Design A typical memory hierarchy today: Proc/Regs L1-Cache Bigger L2-Cache L3-Cache (optional) Memory Disk, Tape, etc. Faster Cache hierarchy: exploits program locality Basic principles of cache designs Hardware cache optimizations Application cache optimizations Prefetching techniques We will also study virtual memory Here we focus on L1/L2/L3 caches, virtual memory and main memory 7

Obstacles in Performance Improvement Recent performance improvement has dropped to 20% annually since 2002 Limited ILP in programs to explore Fast increase of memory latency relative to processor speed Fast increase of processor power consumption Solution: Building multiprocessors Emerging Issues in Computer Design Power consumption must be controlled Back to old processor designs for power efficiency Low power designs in cache and main memory Overheating becomes a serious issue Thermal sensor in many places: processor, memory, hard drives New techniques for thermal control and thermal management Multiprocessor Systems Must exploit thread-level parallelism for further performance improvement Shared-memory multiprocessors: Cooperating programs see the same memory address How to build them? Cache coherence Memory consistency Emerging Issues in Computer Design Reliability is decreasing with finer feature size: Soft error, long-term durability Process variation in fabrication complicates processor design: reliability, performance, power, thermal Novel supports for parallel processing are in demand: new cache coherence protocols, transactional memory, heterogeneous multicore design And many other issues Multi-Core Processors Just like multiprocessor but on a single chip The natural way to utilize billions of transistors Higher performance/power numbers Require multiprogramming or parallel workloads Cannot ignore the demand for high memory bandwidth Why Study Computer Architecture As a hardware designer/researcher know how to design processor, cache, storage, graphics, interconnect, and so on As a system designer know how to build a computer system using the best components available As a software designer know how to get the best performance from the hardware 8

Blackboard Learn & Class Web site All class materials will be on class web site. Some including grades on Blackboard Learn Syllabus Lecture Schedule Homework assignments Readings Grades Discussions 9