POWER3: Next Generation 64-bit PowerPC Processor Design
|
|
- Marvin Johnson
- 5 years ago
- Views:
Transcription
1 POWER3: Next Generation 64-bit PowerPC Processor Design Authors Mark Papermaster, Robert Dinkjian, Michael Mayfield, Peter Lenk, Bill Ciarfella, Frank O Connell, Raymond DuPont High End Processor Design, IBM Server Group Development, Austin, Texas with plans for increased frequency by as much as 40% over the POWER3-II architecture, with more design tuning combined with a move to IBM s newest breakthrough technology - Silicon on Insulator (SOI). Abstract IBM s new POWER3 microprocessor integrates the high-bandwidth and floating point capabilities of its POWER2 architecture predecessor into a fully scaleable 64-bit PowerPC* symmetric multi-processor (SMP) implementation. Based on PowerPC Architecture*, this microprocessor contains the fundamental design features that are planned to be used in the CPUs for the next three generations of RISC System / 6000* targeted at the numeric intensive computing (NIC), high-end analysis, graphics, commercial workstation and server markets. This paper provides an overview of how processor microarchitecture, silicon technology, packaging technology, and systems architecture can be leveraged to produce outstanding high-performance computational capabilities. What follows is a description of the processor design point, the execution core, and key features - such as hardware prefetch - to reduce latency to memory. Design The POWER3 microprocessor objectives were to continue the POWER2 architecture tradition of bringing real solutions to IBM RISC System/6000 customers high compute needs, while adding 64-bit addressability, double-word interger operations, and symmetric multiprocessor support in the PowerPC Architecture. To satisfy compute intensive requirements, the POWER3 design contains a highly superscalar core which comprises eight execution units, fed by a high bandwidth memory interface supporting four floating point operations per cycle. The technology strategy of the POWER3 design was to produce a highly sophisticated processor core and memory subsystem in an advanced, but well-established technology. POWER3-II design is the next step, planned to result in an increase of frequency by up to 50% by tuning the design and moving into IBM s cutting-edge copper technology - CMOS7S. The POWER3-III design is step three, Floating FPU1 Floating FPU2 Branch/Dispatch Memory Mgmt Instruction Cache IU FXU1 Processor Overview FXU2 FXU3 Bus Interface : L2 Control, Clock Figure 1 shows the block diagram of the POWER3 processor, which comprises eight execution units, a 32KB instruction cache, 64KB data cache, and an on board bus interface unit () that controls both the L2 bus interface and the memory bus interface. Two of the three fixed point units (FXUs) are single cycle execution for the bulk of the integer arithmetic instructions. The third unit executes the multi-cycle integer instructions such as multiply and divide. The two floating point units (FPUs) are fully independent, each containing dedicated hardware for square root and divide routines as well as fused multiply-add instruction execution. The FPUs are fully pipelined with three cycle latency, single cycle throughput. Two load store units provide the data to sustain four floating point operations per cycle. A 16-entry store queue buffer prevents stores from stalling the machine while loads are being performed. Loads are also executed speculatively, improving data throughput. The branch execution unit employs dynamic branch prediction, with four pending predicted branches supported. The branch target address LS1 LS2 Memory Mgmt Data Cache DU L2 Cache 6XX Bus 1-16 MB Figure 1. POWER3 Block Diagram
2 cache contains 256 entries ( by 2 way associative), and the branch history table has 2048 entries. The instructions are speculatively executed with a unique register renaming scheme that involves a total of 64 virtual rename registers (32 fixed and 32 floating point), and a total of 40 physical rename registers actually implemented (16 fixed point and 24 floating point). The on board contains the interface logic end processors shipping today with the STREAM memory benchmark. This benchmark defines execution to be out of main memory and not L2 MB/SEC STREAM MEMORY BANDWIDTH Instruction Cache IPU FXU DEC HP SGI SUN POWER C180 Origin 2000 Ultra 43P 260 5/300E 250 MHz Enterprise 200 MHz 6001 *Using the STREAM benchmark for uniporcessor data as of 9/98 FPU IFU Figure 3. High Bandwidth Performance cache. When applications are executed out of the L2 cache, POWER3 processor will perform even faster. High Bandwidth: Data Cache DCMMU supporting up to 16 Mbytes L2, 6XX system bus protocols, and dedicated hardware to reduce latency to memory. Containing 15 million transistors, the POWER3 processor die is shown in Figure 2. It is manufactured in IBM s 0.25 micron hybrid CMOS 6S2 technology, with five levels of interconnect metallurgy. System Level Bandwidth Data Cache Figure 2. POWER3 processor die photo A key challenge of the POWER3 processor was to design a high bandwidth system interface to feed a wide superscalar processor core. Using IBM packaging technology's high I/O count, the POWER3 processor was implemented with separate, independent 16 byte memory bus and 32byte L2 bus, each with separate address, data, and control lines, achieving 6.4GBps to the L2 at 200 MHz. As an example, Figure 3 shows the POWER3 processor capability in comparison to other high Figure 4 is a block diagram of the data memory subsystem. The 64 KB data cache is implemented as a Content Addressable Memory (CAM) based-cache with a long line size ( bytes). The array is way set associative and eight way interleaved (four way by line and two way by doubleword). The interleaving of the data cache effectively provides a multiported array function 8 Byte 8 Byte Load Data Store Data D-Cache CRB 32 SBB 64 Bus Interface 8 Byte Load Data XX Bus Private L2 Bus Figure 4. High Bandwidth Interface
3 provided there is no access conflict between the subarray banks. The bandwidth and concurrency of operations in this data cache are impressive and achieve the goal of maintaining the high throughput of the predecessor POWER1 and POWER2 architecture processors,* while adding SMP and 64-bit addressability. The data cache has wide internal busing to perform the following highly parallel operations: A) Eight-byte read for Load/Store #1 B) Eight-byte read for Load/Store #2 C) Eight-byte write for the Store Queue D) byte cache line write from the Cache Reload Buffer (CRB) E ) 64-byte half line read to Cache Storeback Buffer (CSB) The porting and controls of the data cache are such that (assuming no interleave collisions) any four of operations A through E can occur in the same cycle, with operations C and E being the only exclusive ones. processing path of the POWER3 processor from instruction decode and dispatch to instruction completion. The Instruction Buffer can contain up to 12 instructions while the Dispatch Buffer can hold up to four instructions. If the Instruction Buffer is empty the Dispatch Buffer can be loaded directly from the instruction cache. Up to four instructions can be dispatched per cycle. Dispatch is in order to the execution unit queues. Eight instructions can be issued from the execution unit queues to the eight execution units in one cycle. Issue and execution are out of order, with a total of 32 outstanding instructions tracked by the Completion Buffer. Up to four instructions can be completed per cycle from the Completion Buffer Sequential Instructions I-Cache The byte CRB and the byte CSB create a pipelined interface with the. This consists of a 32 byte bus that sends data from the to the data cache CRB and a 16 byte bus that sends data from the data cache CSB to the. The data cache was carefully designed to not be a bottleneck to system performance under any conditions. High Bandwidth: Instruction Cache Figure 5 shows the instruction cache block diagram. The 32K byte instruction cache is also way set associative, 2 way interleaved (on a line basis), with byte lines. The interleaving permits a byte cache write from the CRB to one interleave, while an eight instruction (32-byte) fetch is done to the Instruction Buffers from the other interleave. The instruction cache read has the additional feature of being able to access eight sequential instructions at a time from anywhere within a given line. This allows the instruction cache to send eight sequential instructions to the Instruction Buffer in a single cycle. Decode-to-Completion Bandwidth Cache Reload Buffer 32 Bus Interface XX Bus Private L2 Bus Figure 5. Instruction Processin This instruction processing bandwidth gives the POWER3 processor a very high utilization efficiency, which is reflected in the outstanding performance on the Linpak 1000x1000 benchmark (TPP). (See performance section below.) Reduced Latency Memory Subsystem To ensure that potentially needed data and instructions are available to keep the core from stalling, the POWER3 processor designers invested in two key latency reduction techniques. The high instruction bandwidth from the instruction cache is maintained throughout the instruction
4 First, all caches are non-blocking. The instruction cache supports two outstanding misses, and the data cache supports up to four. Second, the POWER3 processor implements sequential instruction and data access detection algorithms in hardware, which permit the prefetch of cache lines to closer levels of the memory hierarchy. This reduces the negative performance impact of increasing memory latencies, particularly on technical workloads. These programs often access memory in regular, sequential patterns. The POWER3 processor prefetches up to four separate data streams with a depth of two to four lines for each stream. Compared with the base design without hardware prefetch, the prefecthing engine improves sustained performance by greater than 2.5X on loops such as those found in double precision A times X plus Y (DAXPY) compared to the base design without hardware prefetch. Programs with these regular, sequential patterns contained within the L2 cache will execute nearly as fast as if the data were contained in the L1 cache. Instructions are prefetched into the L1 cache up to one sequential line ahead of the line currently being accessed on the predicted path. These architectural features not only enhance performance for the current 200 MHz POWER3 processor, but they also enable higher frequency versions to scale well in performance. System Implementation The system interface is designed to allow flexibility in system implementation from low cost, bus-based systems to more complex switch-based configurations providing greater address and data bandwidth. combine to cover the wide spectrum of demands that characterize technical and commercial computing. Applications may be limited by the rate of computational speed or by the rate of data delivery to the computational units. They may be primarily fixed point intensive, primarily floating point intensive, or some combination of these characteristics. POWER3 processor s well balanced design handles these challenges with its eight execution units, wide data paths, non-blocking cache and prefetch engine, and many other features. Two standard benchmarks show the remarkable performance of the POWER3 processor. On the Linpak 1000 X 1000 (TPP) benchmark, the POWER3 processor (200 MHz) runs at 632 MFLOPS per CPU, and on the STREAM Benchmark, the POWER3 processor sustains over 1.1GBps memory bandwidth. The outstanding TPP performance illustrates the ability of the POWER3 processor to sustain close to peak floating point performance, while the STREAM benchmark proves the POWER3 processor's ability to sustain close to peak memory performance. Its SPECfp95 performance of 30.1 shows a combination of these attributes in running an entire application suite. Due to its robust floating-point performance and high memory bandwidth, the POWER3 processor will also provide outstanding graphics performance. The RS/6000* 43P Model 260 with its POWER GTX3000P* Graphics Accelerator and 200 MHz POWER3 processor will yield an industry leading CDRS (OpenGL) benchmark rating of greater than 215 providing leadership performance in many CAD industry applications(1). The POWER3 processor design supports Modified Exclusive Shared Invalid (MESI) snoop-oriented SMP cache coherence along with remote processor bus protocols for increased throughput and large system topologies. The split transaction bus allows it to achieve up to 90% of available data bandwidth running a DAXPY type workload. This flexibility is possible because of IBM s advanced packaging technology which allows for the POWER3 processor s 1088 I/O including 748 signal I/O to maintain the high bandwidth needed to support high frequency processors. PowerPC Architecture 64-bit SMP scalable POWER1 POWER3-III POWER3-II POWER3 200 MHz P2SC+ 270 mm² 160 MHz 256 mm² P2SC 135 MHz POWER2 355 mm² single die 5 chip Deep Blue processor CPU core Up to 500 MHz Figure 6. POWER3 Roadmap Performance The POWER3 processor excels in real application performance precisely because its many facilities POWER3 processor-based RS/6000 systems will set new standards for application performance in the forthcoming years.
5 Rev the Engine Figure 6 shows the future roadmap of the POWER3 processor family. The second design point design is well along in its implementation in IBM s industry leading CMOS7S process, which provides technology performance gains associated with shrinking channel lengths to.18 micron drawn and a reduction in RC delay with the copper interconnect In addition to mapping technology, the POWER3-II processor is planned to improve commercial performance by adding set associative L2 support and fractional bus modes to support the higher frequencies. The technology map and tuning are planned to rapidly scale the POWER3 processor frequency to the 300 to 500 MHz implementations which is planned to achieve 30+ SPECint95 and 70+ SPECfp95. Work is already underway to apply IBM s recently announced SOI technology to the POWER roadmap of products. SOI technology is projected to give higher frequencies while at the same time reducing the power requirements. Summary In Summary, the POWER3 processor is very robust, delivering real performance on real applications for the next generations of RISC System 6000 solutions. It utilizes IBM s superior silicon technology, packaging technology, and microarchitecture and systems expertise to produce systems with outstanding performance in both commercial and technical computing. Any performance data contained in this document was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements quoted in this paper may have been made on development-level systems. Actual results may vary. Users of this paper should verify the applicable data for their specific environment. All benchmark values are provided AS IS and no warranties or guarantees are expressed or implied by IBM. Linpak TPP (Toward Peak Performance) - n=1000 is the array size. The results are measured in MFLOPS. Linpak Benchmarks from: STREAM is a program which J. McCalpin of University of Virginia developed and measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels. The results reported in this paper are the fastest TRIAD program using a uniprocessor machine. STREAM Benchmark from: GPC/OPC results - CDRS-03, DX-03, DRV-04, Light-01 and AW advs-01 are weighted geometric means of individual viewset metrics. The viewsets were developed by ISVs (Independent Software Vendors) with the assistance of OPC (OPENGL Performance Characterization) member companies. Larger values indicate better performance. CDRS Benchmark from: Biographies Mark Papermaster is the Manager of High End Processor Development, Robert Dinkjian is a Senior Technical Staff Member, Michael Mayfield is a Senior Technical Staff Member, Peter Lenk is a Senior Engineer, Raymond DuPont is a Senior Engineer, all in the High End Processor Development Group. Bill Ciarfella is a Senior Engineer and Frank O Connell is a Senior Engineer in the Processor Performance Group. All authors are members of the IBM Server Group, Austin, Texas. References 1. The GXT3000P Graphics Accelerator Notes *PowerPC, PowerPC Architecture, IBM RISC System/6000, RS/6000, POWER GTX3000P, POWER Architecture, POWER2 Architecture are trademarks of the IBM Corporation. IBM may have patents or pending patent applications covering subject matter in this paper. The furnishing of this presentation does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, 500 Columbus Avenue, Thornwood, NY, USA. All statements regarding IBM s future direction and intent are subject to change or withdraw without notice, and represent goals and objectives only. Contact your IBM local Branch Office or IBM Authorized Reseller for the full text of a specific Statement of General Direction.
PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors
PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors Peter Sandon Senior PowerPC Processor Architect IBM Microelectronics All information in these materials is subject to
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationPowerPC 740 and 750
368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order
More informationPower 7. Dan Christiani Kyle Wieschowski
Power 7 Dan Christiani Kyle Wieschowski History 1980-2000 1980 RISC Prototype 1990 POWER1 (Performance Optimization With Enhanced RISC) (1 um) 1993 IBM launches 66MHz POWER2 (.35 um) 1997 POWER2 Super
More informationPortland State University ECE 588/688. IBM Power4 System Microarchitecture
Portland State University ECE 588/688 IBM Power4 System Microarchitecture Copyright by Alaa Alameldeen 2018 IBM Power4 Design Principles SMP optimization Designed for high-throughput multi-tasking environments
More information1. PowerPC 970MP Overview
1. The IBM PowerPC 970MP reduced instruction set computer (RISC) microprocessor is an implementation of the PowerPC Architecture. This chapter provides an overview of the features of the 970MP microprocessor
More information1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola
1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device
More informationSAS Enterprise Miner Performance on IBM System p 570. Jan, Hsian-Fen Tsao Brian Porter Harry Seifert. IBM Corporation
SAS Enterprise Miner Performance on IBM System p 570 Jan, 2008 Hsian-Fen Tsao Brian Porter Harry Seifert IBM Corporation Copyright IBM Corporation, 2008. All Rights Reserved. TABLE OF CONTENTS ABSTRACT...3
More informationInside Intel Core Microarchitecture
White Paper Inside Intel Core Microarchitecture Setting New Standards for Energy-Efficient Performance Ofri Wechsler Intel Fellow, Mobility Group Director, Mobility Microprocessor Architecture Intel Corporation
More informationHP PA-8000 RISC CPU. A High Performance Out-of-Order Processor
The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA
More informationJim Keller. Digital Equipment Corp. Hudson MA
Jim Keller Digital Equipment Corp. Hudson MA ! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um "## Continued Performance Leadership
More informationTECHNOLOGY BRIEF. Compaq 8-Way Multiprocessing Architecture EXECUTIVE OVERVIEW CONTENTS
TECHNOLOGY BRIEF March 1999 Compaq Computer Corporation ISSD Technology Communications CONTENTS Executive Overview1 Notice2 Introduction 3 8-Way Architecture Overview 3 Processor and I/O Bus Design 4 Processor
More informationMIPS R5000 Microprocessor. Technical Backgrounder. 32 kb I-cache and 32 kb D-cache, each 2-way set associative
MIPS R5000 Microprocessor Technical Backgrounder Performance: SPECint95 5.5 SPECfp95 5.5 Instruction Set ISA Compatibility Pipeline Clock System Interface clock Caches TLB Power dissipation: Supply voltage
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationEECS 322 Computer Architecture Superpipline and the Cache
EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:
More information620 Fills Out PowerPC Product Line
620 Fills Out PowerPC Product Line New 64-Bit Processor Aimed at Servers, High-End Desktops by Linley Gwennap MICROPROCESSOR BTAC Fetch Branch Double Precision FPU FP Registers Rename Buffer /Tag Predict
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u
More informationThis Material Was All Drawn From Intel Documents
This Material Was All Drawn From Intel Documents A ROAD MAP OF INTEL MICROPROCESSORS Hao Sun February 2001 Abstract The exponential growth of both the power and breadth of usage of the computer has made
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationIBM POWER4: a 64-bit Architecture and a new Technology to form Systems
IBM POWER4: a 64-bit Architecture and a new Technology to form Systems Rui Daniel Gomes de Macedo Fernandes Departamento de Informática, Universidade do Minho 4710-057 Braga, Portugal ruif@net.sapo.pt
More informationRon Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group
Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Outline Motivation Background Threading Fundamentals
More informationNext Generation Technology from Intel Intel Pentium 4 Processor
Next Generation Technology from Intel Intel Pentium 4 Processor 1 The Intel Pentium 4 Processor Platform Intel s highest performance processor for desktop PCs Targeted at consumer enthusiasts and business
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationPerspectives on the Memory Wall. John D. McCalpin, Ph.D IBM Global Microprocessor Development Austin, TX
Perspectives on the Memory Wall John D. McCalpin, Ph.D IBM Global Microprocessor Development Austin, TX The Memory Wall In December, 1994, Bill Wulf and Sally McKee published a short paper: Hitting the
More informationMicroelectronics. Moore s Law. Initially, only a few gates or memory cells could be reliably manufactured and packaged together.
Microelectronics Initially, only a few gates or memory cells could be reliably manufactured and packaged together. These early integrated circuits are referred to as small-scale integration (SSI). As time
More informationAgenda. System Performance Scaling of IBM POWER6 TM Based Servers
System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies
More informationCSC 631: High-Performance Computer Architecture
CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:
More informationEvolution of Computers & Microprocessors. Dr. Cahit Karakuş
Evolution of Computers & Microprocessors Dr. Cahit Karakuş Evolution of Computers First generation (1939-1954) - vacuum tube IBM 650, 1954 Evolution of Computers Second generation (1954-1959) - transistor
More informationAll About the Cell Processor
All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationAdvanced cache optimizations. ECE 154B Dmitri Strukov
Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache
More informationDigital Leads the Pack with 21164
MICROPROCESSOR REPORT THE INSIDERS GUIDE TO MICROPROCESSOR HARDWARE VOLUME 8 NUMBER 12 SEPTEMBER 12, 1994 Digital Leads the Pack with 21164 First of Next-Generation RISCs Extends Alpha s Performance Lead
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationPOWER7: IBM's Next Generation Server Processor
POWER7: IBM's Next Generation Server Processor Acknowledgment: This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002 Outline
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationPower Technology For a Smarter Future
2011 IBM Power Systems Technical University October 10-14 Fontainebleau Miami Beach Miami, FL IBM Power Technology For a Smarter Future Jeffrey Stuecheli Power Processor Development Copyright IBM Corporation
More informationLike scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures
Superscalar Architectures Have looked at examined basic architecture concepts Starting with simple machines Introduced concepts underlying RISC machines From characteristics of RISC instructions Found
More informationA brief History of INTEL and Motorola Microprocessors Part 1
Eng. Guerino Mangiamele ( Member of EMA) Hobson University Microprocessors Architecture A brief History of INTEL and Motorola Microprocessors Part 1 The Early Intel Microprocessors The first microprocessor
More informationFreescale Semiconductor, I
Copyright (c) Institute of Electrical Freescale and Electronics Semiconductor, Engineers. Reprinted Inc. with permission. This material is posted here with permission of the IEEE. Such permission of the
More information6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU
1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high
More informationAlpha AXP Workstation Family Performance Brief - OpenVMS
DEC 3000 Model 500 AXP Workstation DEC 3000 Model 400 AXP Workstation INSIDE Digital Equipment Corporation November 20, 1992 Second Edition EB-N0102-51 Benchmark results: SPEC LINPACK Dhrystone X11perf
More informationParallel Computer Architecture
Parallel Computer Architecture What is Parallel Architecture? A parallel computer is a collection of processing elements that cooperate to solve large problems fast Some broad issues: Resource Allocation:»
More informationMIPS R4300I Microprocessor. Technical Backgrounder-Preliminary
MIPS R4300I Microprocessor Technical Backgrounder-Preliminary Table of Contents Chapter 1. R4300I Technical Summary... 3 Chapter 2. Overview... 4 Introduction... 4 The R4300I Microprocessor... 5 The R4300I
More informationPowerPC 620 Case Study
Chapter 6: The PowerPC 60 Modern Processor Design: Fundamentals of Superscalar Processors PowerPC 60 Case Study First-generation out-of-order processor Developed as part of Apple-IBM-Motorola alliance
More informationIBM's POWER5 Micro Processor Design and Methodology
IBM's POWER5 Micro Processor Design and Methodology Ron Kalla IBM Systems Group Outline POWER5 Overview Design Process Power POWER Server Roadmap 2001 POWER4 2002-3 POWER4+ 2004* POWER5 2005* POWER5+ 2006*
More informationIntel released new technology call P6P
P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits
More informationA Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing
A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing Kentaro Shimada *1, Tatsuya Kawashimo *1, Makoto Hanawa *1, Ryo Yamagata *2, and Eiki Kamada *2 *1 Central Research Laboratory,
More informationThe World s First Seventh-Generation x86 Processor: Delivering the Ultimate Performance for Cutting-Edge Software Applications
AMD Athlon Processor Architecture The World s First Seventh-Generation x86 Processor: Delivering the Ultimate Performance for Cutting-Edge Software Applications ADVANCED MICRO DEVICES, INC. One AMD Place
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationUltraSparc-3 Aims at MP Servers
UltraSparc-3 Aims at MP Servers Sun s Next Speed Demon Handles 11.2 Gbytes/s of Chip I/O Bandwidth by Peter Song Kicking its processor clock speeds into a higher gear, Sun disclosed that its next-generation
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationComputer Organization. 8 th Edition. Chapter 2 p Computer Evolution and Performance
William Stallings Computer Organization and Architecture 8 th Edition Chapter 2 p Computer Evolution and Performance ENIAC - background Electronic Numerical Integrator And Computer Eckert and Mauchly University
More informationPOWER7: IBM's Next Generation Server Processor
Hot Chips 21 POWER7: IBM's Next Generation Server Processor Ronald Kalla Balaram Sinharoy POWER7 Chief Engineer POWER7 Chief Core Architect Acknowledgment: This material is based upon work supported by
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationChapter 18 Parallel Processing
Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD
More informationThe T0 Vector Microprocessor. Talk Outline
Slides from presentation at the Hot Chips VII conference, 15 August 1995.. The T0 Vector Microprocessor Krste Asanovic James Beck Bertrand Irissou Brian E. D. Kingsbury Nelson Morgan John Wawrzynek University
More informationUniprocessors. HPC Fall 2012 Prof. Robert van Engelen
Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationThe UltraSPARC -IIi Processor. Technology White Paper
The UltraSPARC -IIi Processor Technology White Paper 1997, 1998 Sun Microsystems, Inc. All rights reserved. Printed in the United States of America. 901 San Antonio Road, Palo Alto, California 94303 U.S.A
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationThe PowerPC RISC Family Microprocessor
The PowerPC RISC Family Microprocessors In Brief... The PowerPC architecture is derived from the IBM Performance Optimized with Enhanced RISC (POWER) architecture. The PowerPC architecture shares all of
More informationSGI Challenge Overview
CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived
More informationIBM Single Chip RISC Processor (RSC)
IBM Single Chip RISC Processor (RSC) C. R. Moore, D. M. Baker, J.S. Muhich, and R.E. East Advanced Workstation Division International Business Machines Corporation Austin, Texas Abstract A highly in.d
More informationDigital Semiconductor Alpha Microprocessor Product Brief
Digital Semiconductor Alpha 21164 Microprocessor Product Brief March 1995 Description The Alpha 21164 microprocessor is a high-performance implementation of Digital s Alpha architecture designed for application
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationA Multiprocessor system generally means that more than one instruction stream is being executed in parallel.
Multiprocessor Systems A Multiprocessor system generally means that more than one instruction stream is being executed in parallel. However, Flynn s SIMD machine classification, also called an array processor,
More informationSuperscalar Processors
Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance
More informationLecture 8: RISC & Parallel Computers. Parallel computers
Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer
More informationby M. T. Vaden L. J. Merkel C. R. Moore J. Reese Potter
Design considerations T. R. M. for the PowerPC 601 microprocessor by M. T. Vaden L. J. Merkel C. R. Moore J. Reese Potter The PowerPC 601 microprocessor (601) is the first member of a family of processors
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationPOWER3: The next generation of PowerPC processors
POWER3: The next generation of PowerPC processors by F. P. O Connell S. W. White The POWER3 processor is a high-performance microprocessor which excels at technical computing. Designed by IBM and deployed
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationLecture 7: Implementing Cache Coherence. Topics: implementation details
Lecture 7: Implementing Cache Coherence Topics: implementation details 1 Implementing Coherence Protocols Correctness and performance are not the only metrics Deadlock: a cycle of resource dependencies,
More informationChapter 18. Parallel Processing. Yonsei University
Chapter 18 Parallel Processing Contents Multiple Processor Organizations Symmetric Multiprocessors Cache Coherence and the MESI Protocol Clusters Nonuniform Memory Access Vector Computation 18-2 Types
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationCase Study IBM PowerPC 620
Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,
More informationWhite Paper. First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem)
White Paper First the Tick, Now the Tock: Next Generation Intel Microarchitecture (Nehalem) Introducing a New Dynamically and Design- Scalable Microarchitecture that Rewrites the Book On Energy Efficiency
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationSuperscalar Machines. Characteristics of superscalar processors
Superscalar Machines Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any performance
More informationCPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner
CS104 Computer Organization and rogramming Lecture 20: Superscalar processors, Multiprocessors Robert Wagner Faster and faster rocessors So much to do, so little time... How can we make computers that
More information06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli
06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationEach Milliwatt Matters
Each Milliwatt Matters Ultra High Efficiency Application Processors Govind Wathan Product Manager, CPG ARM Tech Symposia China 2015 November 2015 Ultra High Efficiency Processors Used in Diverse Markets
More informationPOWER9 Announcement. Martin Bušek IBM Server Solution Sales Specialist
POWER9 Announcement Martin Bušek IBM Server Solution Sales Specialist Announce Performance Launch GA 2/13 2/27 3/19 3/20 POWER9 is here!!! The new POWER9 processor ~1TB/s 1 st chip with PCIe4 4GHZ 2x Core
More informationTen Reasons to Optimize a Processor
By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationThe ARM10 Family of Advanced Microprocessor Cores
The ARM10 Family of Advanced Microprocessor Cores Stephen Hill ARM Austin Design Center 1 Agenda Design overview Microarchitecture ARM10 o o Memory System Interrupt response 3. Power o o 4. VFP10 ETM10
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More information