POWER3: Next Generation 64-bit PowerPC Processor Design

Size: px

Start display at page:

Download "POWER3: Next Generation 64-bit PowerPC Processor Design"

Marvin Johnson
5 years ago
Views:

1 POWER3: Next Generation 64-bit PowerPC Processor Design Authors Mark Papermaster, Robert Dinkjian, Michael Mayfield, Peter Lenk, Bill Ciarfella, Frank O Connell, Raymond DuPont High End Processor Design, IBM Server Group Development, Austin, Texas with plans for increased frequency by as much as 40% over the POWER3-II architecture, with more design tuning combined with a move to IBM s newest breakthrough technology - Silicon on Insulator (SOI). Abstract IBM s new POWER3 microprocessor integrates the high-bandwidth and floating point capabilities of its POWER2 architecture predecessor into a fully scaleable 64-bit PowerPC* symmetric multi-processor (SMP) implementation. Based on PowerPC Architecture*, this microprocessor contains the fundamental design features that are planned to be used in the CPUs for the next three generations of RISC System / 6000* targeted at the numeric intensive computing (NIC), high-end analysis, graphics, commercial workstation and server markets. This paper provides an overview of how processor microarchitecture, silicon technology, packaging technology, and systems architecture can be leveraged to produce outstanding high-performance computational capabilities. What follows is a description of the processor design point, the execution core, and key features - such as hardware prefetch - to reduce latency to memory. Design The POWER3 microprocessor objectives were to continue the POWER2 architecture tradition of bringing real solutions to IBM RISC System/6000 customers high compute needs, while adding 64-bit addressability, double-word interger operations, and symmetric multiprocessor support in the PowerPC Architecture. To satisfy compute intensive requirements, the POWER3 design contains a highly superscalar core which comprises eight execution units, fed by a high bandwidth memory interface supporting four floating point operations per cycle. The technology strategy of the POWER3 design was to produce a highly sophisticated processor core and memory subsystem in an advanced, but well-established technology. POWER3-II design is the next step, planned to result in an increase of frequency by up to 50% by tuning the design and moving into IBM s cutting-edge copper technology - CMOS7S. The POWER3-III design is step three, Floating FPU1 Floating FPU2 Branch/Dispatch Memory Mgmt Instruction Cache IU FXU1 Processor Overview FXU2 FXU3 Bus Interface : L2 Control, Clock Figure 1 shows the block diagram of the POWER3 processor, which comprises eight execution units, a 32KB instruction cache, 64KB data cache, and an on board bus interface unit () that controls both the L2 bus interface and the memory bus interface. Two of the three fixed point units (FXUs) are single cycle execution for the bulk of the integer arithmetic instructions. The third unit executes the multi-cycle integer instructions such as multiply and divide. The two floating point units (FPUs) are fully independent, each containing dedicated hardware for square root and divide routines as well as fused multiply-add instruction execution. The FPUs are fully pipelined with three cycle latency, single cycle throughput. Two load store units provide the data to sustain four floating point operations per cycle. A 16-entry store queue buffer prevents stores from stalling the machine while loads are being performed. Loads are also executed speculatively, improving data throughput. The branch execution unit employs dynamic branch prediction, with four pending predicted branches supported. The branch target address LS1 LS2 Memory Mgmt Data Cache DU L2 Cache 6XX Bus 1-16 MB Figure 1. POWER3 Block Diagram

2 cache contains 256 entries ( by 2 way associative), and the branch history table has 2048 entries. The instructions are speculatively executed with a unique register renaming scheme that involves a total of 64 virtual rename registers (32 fixed and 32 floating point), and a total of 40 physical rename registers actually implemented (16 fixed point and 24 floating point). The on board contains the interface logic end processors shipping today with the STREAM memory benchmark. This benchmark defines execution to be out of main memory and not L2 MB/SEC STREAM MEMORY BANDWIDTH Instruction Cache IPU FXU DEC HP SGI SUN POWER C180 Origin 2000 Ultra 43P 260 5/300E 250 MHz Enterprise 200 MHz 6001 *Using the STREAM benchmark for uniporcessor data as of 9/98 FPU IFU Figure 3. High Bandwidth Performance cache. When applications are executed out of the L2 cache, POWER3 processor will perform even faster. High Bandwidth: Data Cache DCMMU supporting up to 16 Mbytes L2, 6XX system bus protocols, and dedicated hardware to reduce latency to memory. Containing 15 million transistors, the POWER3 processor die is shown in Figure 2. It is manufactured in IBM s 0.25 micron hybrid CMOS 6S2 technology, with five levels of interconnect metallurgy. System Level Bandwidth Data Cache Figure 2. POWER3 processor die photo A key challenge of the POWER3 processor was to design a high bandwidth system interface to feed a wide superscalar processor core. Using IBM packaging technology's high I/O count, the POWER3 processor was implemented with separate, independent 16 byte memory bus and 32byte L2 bus, each with separate address, data, and control lines, achieving 6.4GBps to the L2 at 200 MHz. As an example, Figure 3 shows the POWER3 processor capability in comparison to other high Figure 4 is a block diagram of the data memory subsystem. The 64 KB data cache is implemented as a Content Addressable Memory (CAM) based-cache with a long line size ( bytes). The array is way set associative and eight way interleaved (four way by line and two way by doubleword). The interleaving of the data cache effectively provides a multiported array function 8 Byte 8 Byte Load Data Store Data D-Cache CRB 32 SBB 64 Bus Interface 8 Byte Load Data XX Bus Private L2 Bus Figure 4. High Bandwidth Interface

3 provided there is no access conflict between the subarray banks. The bandwidth and concurrency of operations in this data cache are impressive and achieve the goal of maintaining the high throughput of the predecessor POWER1 and POWER2 architecture processors,* while adding SMP and 64-bit addressability. The data cache has wide internal busing to perform the following highly parallel operations: A) Eight-byte read for Load/Store #1 B) Eight-byte read for Load/Store #2 C) Eight-byte write for the Store Queue D) byte cache line write from the Cache Reload Buffer (CRB) E ) 64-byte half line read to Cache Storeback Buffer (CSB) The porting and controls of the data cache are such that (assuming no interleave collisions) any four of operations A through E can occur in the same cycle, with operations C and E being the only exclusive ones. processing path of the POWER3 processor from instruction decode and dispatch to instruction completion. The Instruction Buffer can contain up to 12 instructions while the Dispatch Buffer can hold up to four instructions. If the Instruction Buffer is empty the Dispatch Buffer can be loaded directly from the instruction cache. Up to four instructions can be dispatched per cycle. Dispatch is in order to the execution unit queues. Eight instructions can be issued from the execution unit queues to the eight execution units in one cycle. Issue and execution are out of order, with a total of 32 outstanding instructions tracked by the Completion Buffer. Up to four instructions can be completed per cycle from the Completion Buffer Sequential Instructions I-Cache The byte CRB and the byte CSB create a pipelined interface with the. This consists of a 32 byte bus that sends data from the to the data cache CRB and a 16 byte bus that sends data from the data cache CSB to the. The data cache was carefully designed to not be a bottleneck to system performance under any conditions. High Bandwidth: Instruction Cache Figure 5 shows the instruction cache block diagram. The 32K byte instruction cache is also way set associative, 2 way interleaved (on a line basis), with byte lines. The interleaving permits a byte cache write from the CRB to one interleave, while an eight instruction (32-byte) fetch is done to the Instruction Buffers from the other interleave. The instruction cache read has the additional feature of being able to access eight sequential instructions at a time from anywhere within a given line. This allows the instruction cache to send eight sequential instructions to the Instruction Buffer in a single cycle. Decode-to-Completion Bandwidth Cache Reload Buffer 32 Bus Interface XX Bus Private L2 Bus Figure 5. Instruction Processin This instruction processing bandwidth gives the POWER3 processor a very high utilization efficiency, which is reflected in the outstanding performance on the Linpak 1000x1000 benchmark (TPP). (See performance section below.) Reduced Latency Memory Subsystem To ensure that potentially needed data and instructions are available to keep the core from stalling, the POWER3 processor designers invested in two key latency reduction techniques. The high instruction bandwidth from the instruction cache is maintained throughout the instruction

4 First, all caches are non-blocking. The instruction cache supports two outstanding misses, and the data cache supports up to four. Second, the POWER3 processor implements sequential instruction and data access detection algorithms in hardware, which permit the prefetch of cache lines to closer levels of the memory hierarchy. This reduces the negative performance impact of increasing memory latencies, particularly on technical workloads. These programs often access memory in regular, sequential patterns. The POWER3 processor prefetches up to four separate data streams with a depth of two to four lines for each stream. Compared with the base design without hardware prefetch, the prefecthing engine improves sustained performance by greater than 2.5X on loops such as those found in double precision A times X plus Y (DAXPY) compared to the base design without hardware prefetch. Programs with these regular, sequential patterns contained within the L2 cache will execute nearly as fast as if the data were contained in the L1 cache. Instructions are prefetched into the L1 cache up to one sequential line ahead of the line currently being accessed on the predicted path. These architectural features not only enhance performance for the current 200 MHz POWER3 processor, but they also enable higher frequency versions to scale well in performance. System Implementation The system interface is designed to allow flexibility in system implementation from low cost, bus-based systems to more complex switch-based configurations providing greater address and data bandwidth. combine to cover the wide spectrum of demands that characterize technical and commercial computing. Applications may be limited by the rate of computational speed or by the rate of data delivery to the computational units. They may be primarily fixed point intensive, primarily floating point intensive, or some combination of these characteristics. POWER3 processor s well balanced design handles these challenges with its eight execution units, wide data paths, non-blocking cache and prefetch engine, and many other features. Two standard benchmarks show the remarkable performance of the POWER3 processor. On the Linpak 1000 X 1000 (TPP) benchmark, the POWER3 processor (200 MHz) runs at 632 MFLOPS per CPU, and on the STREAM Benchmark, the POWER3 processor sustains over 1.1GBps memory bandwidth. The outstanding TPP performance illustrates the ability of the POWER3 processor to sustain close to peak floating point performance, while the STREAM benchmark proves the POWER3 processor's ability to sustain close to peak memory performance. Its SPECfp95 performance of 30.1 shows a combination of these attributes in running an entire application suite. Due to its robust floating-point performance and high memory bandwidth, the POWER3 processor will also provide outstanding graphics performance. The RS/6000* 43P Model 260 with its POWER GTX3000P* Graphics Accelerator and 200 MHz POWER3 processor will yield an industry leading CDRS (OpenGL) benchmark rating of greater than 215 providing leadership performance in many CAD industry applications(1). The POWER3 processor design supports Modified Exclusive Shared Invalid (MESI) snoop-oriented SMP cache coherence along with remote processor bus protocols for increased throughput and large system topologies. The split transaction bus allows it to achieve up to 90% of available data bandwidth running a DAXPY type workload. This flexibility is possible because of IBM s advanced packaging technology which allows for the POWER3 processor s 1088 I/O including 748 signal I/O to maintain the high bandwidth needed to support high frequency processors. PowerPC Architecture 64-bit SMP scalable POWER1 POWER3-III POWER3-II POWER3 200 MHz P2SC+ 270 mm² 160 MHz 256 mm² P2SC 135 MHz POWER2 355 mm² single die 5 chip Deep Blue processor CPU core Up to 500 MHz Figure 6. POWER3 Roadmap Performance The POWER3 processor excels in real application performance precisely because its many facilities POWER3 processor-based RS/6000 systems will set new standards for application performance in the forthcoming years.

5 Rev the Engine Figure 6 shows the future roadmap of the POWER3 processor family. The second design point design is well along in its implementation in IBM s industry leading CMOS7S process, which provides technology performance gains associated with shrinking channel lengths to.18 micron drawn and a reduction in RC delay with the copper interconnect In addition to mapping technology, the POWER3-II processor is planned to improve commercial performance by adding set associative L2 support and fractional bus modes to support the higher frequencies. The technology map and tuning are planned to rapidly scale the POWER3 processor frequency to the 300 to 500 MHz implementations which is planned to achieve 30+ SPECint95 and 70+ SPECfp95. Work is already underway to apply IBM s recently announced SOI technology to the POWER roadmap of products. SOI technology is projected to give higher frequencies while at the same time reducing the power requirements. Summary In Summary, the POWER3 processor is very robust, delivering real performance on real applications for the next generations of RISC System 6000 solutions. It utilizes IBM s superior silicon technology, packaging technology, and microarchitecture and systems expertise to produce systems with outstanding performance in both commercial and technical computing. Any performance data contained in this document was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements quoted in this paper may have been made on development-level systems. Actual results may vary. Users of this paper should verify the applicable data for their specific environment. All benchmark values are provided AS IS and no warranties or guarantees are expressed or implied by IBM. Linpak TPP (Toward Peak Performance) - n=1000 is the array size. The results are measured in MFLOPS. Linpak Benchmarks from: STREAM is a program which J. McCalpin of University of Virginia developed and measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels. The results reported in this paper are the fastest TRIAD program using a uniprocessor machine. STREAM Benchmark from: GPC/OPC results - CDRS-03, DX-03, DRV-04, Light-01 and AW advs-01 are weighted geometric means of individual viewset metrics. The viewsets were developed by ISVs (Independent Software Vendors) with the assistance of OPC (OPENGL Performance Characterization) member companies. Larger values indicate better performance. CDRS Benchmark from: Biographies Mark Papermaster is the Manager of High End Processor Development, Robert Dinkjian is a Senior Technical Staff Member, Michael Mayfield is a Senior Technical Staff Member, Peter Lenk is a Senior Engineer, Raymond DuPont is a Senior Engineer, all in the High End Processor Development Group. Bill Ciarfella is a Senior Engineer and Frank O Connell is a Senior Engineer in the Processor Performance Group. All authors are members of the IBM Server Group, Austin, Texas. References 1. The GXT3000P Graphics Accelerator Notes *PowerPC, PowerPC Architecture, IBM RISC System/6000, RS/6000, POWER GTX3000P, POWER Architecture, POWER2 Architecture are trademarks of the IBM Corporation. IBM may have patents or pending patent applications covering subject matter in this paper. The furnishing of this presentation does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, 500 Columbus Avenue, Thornwood, NY, USA. All statements regarding IBM s future direction and intent are subject to change or withdraw without notice, and represent goals and objectives only. Contact your IBM local Branch Office or IBM Authorized Reseller for the full text of a specific Statement of General Direction.

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors

PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors Peter Sandon Senior PowerPC Processor Architect IBM Microelectronics All information in these materials is subject to