Lecture 1: Introduction - PDF Free Download

Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline structures Cache memories Implementations; example: Logic design and synthesis Computer Architecture A Quantitative Approach, Fifth Edition Fundamentals Technology trends Fundamentals of Quantitative Design and Analysis Performance evaluation methodologies Instruction Set Architecture Elsevier Inc. All rights 2 Traditional Computer Architecture The term architecture is used here to describe the attribute of a system as seen by the programmer, i.e., the conceptual structure and functional behavior as distinct from the organization of the data flow and controls, the logic design, and the physical implementation. Gene Amdahl, IBM Journal R&D, April 1964 Growth in Processor Performance Two major driving forces: Technologies and Architecture Transistor speed improves with each generation VLSI process technology Architecture innovations: RISC and Instruction-Level Parallelism Cache memories 1986-2002: >50% annual improvement! 1

Technology Drives for High-Performance VLSI technology: faster transistors and larger transistor budget 250nm 130nm 90nm 65nm 45nm 32nm 22nm 14nm 10nm (2016) 5-7nm 2300 transistors in Intel 4004 ~400M in Intel Core 2 Extreme 2.6B in Intel Xeon Westmere (2011 release with 32nm) approx. 4B today How to make effective use of those transistors? How to use one billion transistors? Bit-level parallelism Move from 32-bit to 64-bit Instruction-level parallelism (ILP) Deep pipeline Execute multiple instructions per cycle Program locality Large caches, more branch prediction resources Thread-level parallelism (TLP) Single Processor Performance Move to multi-processor Introduction Continued - New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP) RISC These require explicit restructuring of the application CPU Performance For sequential program: CPU time = #Inst CPI Clock cycle time To improve performance Faster clock time Reduce #inst Reduce CPI or increase IPC Classes of Computers Personal Mobile Device (PMD) e.g. smart phones, tablet computers Emphasis on energy efficiency and real-time Desktop Computing Emphasis on price-performance Servers Emphasis on availability, scalability, throughput Clusters / Warehouse Scale Computers Used for Software as a Service (SaaS) Emphasis on availability and price-performance Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks Embedded Computers Emphasis: price Classes of Computers 2

Parallelism Classes of parallelism in applications: Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of Computers Integrated circuit technology Transistor density: 35%/year Die size: 10-20%/year Integration overall: 40-55%/year Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/graphic Processor Units (GPUs) Thread-Level Parallelism Request-Level Parallelism DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM Flynn s Taxonomy Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) Vector architectures Multimedia extensions Graphics processor units Classes of Computers Bandwidth and Latency Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors 300-1200X improvement for memory and disks Multiple instruction streams, single data stream (MISD) No commercial implementation Multiple instruction streams, multiple data streams (MIMD) Tightly-coupled MIMD Loosely-coupled MIMD Latency or response time Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks Defining Computer Architecture Old view of computer architecture: Instruction Set Architecture (ISA) design i.e. decisions regarding: registers, memory addressing, addressing modes, instruction operands, available operations, control flow instructions, instruction encoding Defining Computer Architecture Bandwidth and Latency Real computer architecture: Specific requirements of the target machine Design to maximize performance within constraints: cost, power, and availability Includes ISA, microarchitecture, hardware Log-log plot of bandwidth and latency milestones (1982-2010) 3

Transistors and Wires Feature size Minimum size of transistor or wire in x or y dimension 10 microns in 1971 to.032 microns in 2011 to 22nm (.022 microns) in 2012; 14nm in 2014-15; 10nm in 2016; Transistor performance scales linearly Wire delay does not improve with feature size! Integration density scales quadratically Power Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air Power and Energy Problem: Get power in, get power out Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power, higher than average power consumption Reducing Power Techniques for reducing power: Do nothing well Dynamic Voltage-Frequency Scaling Low power state for DRAM, disks Overclocking, turning off cores Clock rate can be reduced dynamically to limit power consumption Energy per task is often a better measurement Dynamic Energy and Power Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage 2 Dynamic power ½ x Capacitive load x Voltage 2 x Frequency switched Static Power Static power consumption Current static x Voltage Scales with number of transistors To reduce: power gating Reducing clock rate reduces power, not energy 4

Trends in Cost Cost driven down by learning curve Yield Trends in Cost Measuring Performance Typical performance metrics: Response time Throughput Speedup of X relative to Y Measuring Performance Execution time Y / Execution time X DRAM: price closely tracks cost Microprocessors: price depends on volume 10% less for each doubling of volume Execution time Wall clock time: includes all system overheads CPU time: only computation time Benchmarks Kernels (e.g. matrix multiply) Toy programs (e.g. sorting) Synthetic benchmarks (e.g. Dhrystone) Benchmark suites (e.g. SPEC06fp, TPC-C) Integrated Circuit Cost Integrated circuit Trends in Cost Principles of Computer Design Take Advantage of Parallelism e.g. multiple processors, disks, memory banks, pipelining, multiple functional units Principles Principle of Locality Reuse of data and instructions Bose-Einstein formula: Focus on the Common Case Amdahl s Law Defects per unit area = 0.016-0.057 defects per square cm (2010) N = process-complexity factor = 11.5-15.5 (40 nm, 2010) Dependability Module reliability Mean time to failure (MTTF) Mean time to repair (MTTR) Mean time between failures (MTBF) = MTTF + MTTR Availability = MTTF / MTBF Dependability Principles of Computer Design The Processor Performance Equation Principles 5

Principles of Computer Design Principles Instruction-Level Parallelism Different instruction types having different CPIs for (i=0; i<n; i++) X[i] = a*x[i]; // let R3=&X[0],R4=&X[N] // and F0=a LOOP: LD.D F2, 0(R3) MUL.D F2, F2, F0 S.D F2, 0(R3) DADD R3, R3, 8 BNE R3, R4, LOOP What instructions are parallel? How to schedule those instructions? Instruction-Level Parallelism Pipelining: The first step to instructionlevel parallelism Instruction-Level Parallelism Find independent instructions through dependence analysis Hardware approaches => Dynamically scheduled superscalar Most commonly used today: Intel Pentium, AMD, IBM Power, Sun UltraSparc, and MIPS families Software approaches => (1) Static scheduled superscalar, or (2) VLIW Instruction-Level Parallelism Modern Superscalar Processors Pipeline + Multi-issue Example: Intel Core and Core 2, IBM Power/PowerPC, Sun UltraSparc, SGI MIPS Multi-issue and Deep pipelining Dynamic scheduling and speculative execution High bandwidth L1 caches and large L2/L3 caches 6

Modern Superscalar Processor Learning challenges: Complexity!!! Many design ideas are no longer intuitive How does it improve performance? Does the processor still work correctly? Must have a big picture Memory Subsystem Many applications are memory-bound CPU speeds increases fast; memory speed cannot match up The cost of one cache miss can be close to executing 1000 instructions For a memory intensive program: If a CPU spends 60% time waiting for memory, what s the point of improving CPU speed up by 10x? Modern Superscalar Processor Maintain register data flow Register renaming Instruction scheduling Maintain control flow Branch prediction Speculative execution and recovery Maintain memory data flow Load and store queues Memory dependence speculation Memory System Performance Memory Stall CPI = #Misses per inst miss penalty in cycles = % Mem Inst Miss rate Miss Penalty Assume 30% memory instruction, 2% miss rate, 300-cycle miss penalty. How much is memory stall CPI? Memory Subsystem Cache Design A typical memory hierarchy today: Proc/Regs L1-Cache Bigger L2-Cache L3-Cache (optional) Memory Disk, Tape, etc. Faster Cache hierarchy: exploits program locality Basic principles of cache designs Hardware cache optimizations Application cache optimizations Prefetching techniques We will also study virtual memory Here we focus on L1/L2/L3 caches, virtual memory and main memory 7

Obstacles in Performance Improvement Recent performance improvement has dropped to 20% annually since 2002 Limited ILP in programs to explore Fast increase of memory latency relative to processor speed Fast increase of processor power consumption Solution: Building multiprocessors Emerging Issues in Computer Design Power consumption must be controlled Back to old processor designs for power efficiency Low power designs in cache and main memory Overheating becomes a serious issue Thermal sensor in many places: processor, memory, hard drives New techniques for thermal control and thermal management Multiprocessor Systems Must exploit thread-level parallelism for further performance improvement Shared-memory multiprocessors: Cooperating programs see the same memory address How to build them? Cache coherence Memory consistency Emerging Issues in Computer Design Reliability is decreasing with finer feature size: Soft error, long-term durability Process variation in fabrication complicates processor design: reliability, performance, power, thermal Novel supports for parallel processing are in demand: new cache coherence protocols, transactional memory, heterogeneous multicore design And many other issues Multi-Core Processors Just like multiprocessor but on a single chip The natural way to utilize billions of transistors Higher performance/power numbers Require multiprogramming or parallel workloads Cannot ignore the demand for high memory bandwidth Why Study Computer Architecture As a hardware designer/researcher know how to design processor, cache, storage, graphics, interconnect, and so on As a system designer know how to build a computer system using the best components available As a software designer know how to get the best performance from the hardware 8

Blackboard Learn & Class Web site All class materials will be on class web site. Some including grades on Blackboard Learn Syllabus Lecture Schedule Homework assignments Readings Grades Discussions 9