Niagara2: A Highly Threaded Server-on-a-Chip. Robert Golla Principal Architect Sun Microsystems
|
|
- Sybil Pope
- 5 years ago
- Views:
Transcription
1 Niagara2: A Highly Threaded Server-on-a-hip Robert Golla Principal Architect Sun icrosystems October 10, 2006
2 ontributors Jama Barreh Jeff Brooks William Bryg Bruce hang Robert Golla Greg Grohoski Rick Hetherington Paul Jordan ark Luttrell ark cpherson Shimon uller hris Olson Bikram Saha anish Shah ichael Wong Page 2
3 Agenda hip Overview Throughput omputing Sparc core rossbar cache Networking PI-Express Power Status Summary Page 3
4 FSR Niagara2 hip Overview 8 Sparc cores, 8 threads each Data Bank 0 B0 Data Bank 1 B1 U0 U1 B3 Data Bank 3 B2 Data Bank 2 DU PEU PSR NU SPAR ore 0 TAG0 SII TAG2 SPAR ore 2 ESR SPAR ore 1 TAG1 TAG3 SPAR ore 3 X FSR SPAR ore 5 TAG5 TAG7 SPAR ore 7 SIO SPAR ore 4 TAG4 U TAG6 SPAR ore 6 A EFU Data Bank 4 B4 Data Bank 5 B5 U2 U3 B7 Data Bank 7 B6 Data Bank 6 RDP RTX TDS FSR Shared 4B, 8-banks, 16-way associative Four dual-channel FBDI memory controllers Two 10/1 Gb Enet ports One PI-Express x8 1.0A port 342 mm^2 die size in 65 nm 711 signal I/O, 1831 total Page 4
5 Niagara2 hip Overview Sparc ore Sparc ore Sparc ore Sparc ore Sparc ore Sparc ore Sparc ore Sparc ore 2x10/1 GE 8x9 ache rossbar Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7 I/O NIU (Ethernet) System Interface Unit PI-EX emory ontrol 0 emory ontrol 1 emory ontrol 2 emory ontrol 3 FBDI FBDI FBDI FBDI Full 8x9 crossbar switch onnects every core to every bank and vice-versa Supports 8 byte writes from a core to a bank Supports 16 byte reads from a bank to core One port for core to read/write IO System interface unit connects networking and IO to memory Page 5
6 Throughput omputing emory Latency ompute Time Threads For a single thread Time Single Thread emory is THE bottleneck to improving performance ommercial server workloads exhibit poor memory locality Only a modest throughput speedup is possible by reducing compute time onventional single-thread processors optimized for ILP have low utilizations With many threads It s possible to find something to execute every cycle Significant throughput speedups are possible Page 6 Processor utilization is much higher
7 Engineering Solutions Design Problem Double UltraSparc T1's throughput and throughput/watt Improve UltraSparc T1's FP single-thread and throughput performance inimize required area for these improvements onsidered doubling number of UltraSparc T1 cores 16 cores of 4 threads each Takes too much die area No area left for improving FP performance Page 7
8 Engineering Solutions Probabilistic odelling Generate synthetic traces for each thread with an Total IP Relative Throughput Performance ultithread_performance_vs._threads (12/2002) UltraSparc T1 Niagara2 instruction/miss profile that matches TP- Schedule ready threads to run on some number of execution units End simulation once simulated distributions are close to actual distributions Works very well for simple scalar cores running lots of threads on transactional workloads Within 10 percent of a detailed cycle accurate simulator Detailed cycle accurate simulator not available at beginning of the project Page 8
9 Engineering Solutions Decided to increase the number of threads per core and increase execution bandwidth 8 threads per core x 8 cores = 64 threads total 2 EXUs per core ore than doubles UltraSparc T1 s throughput Doubling threads is more area efficient than doubling cores Integrate FGU into core pipeline 6 cycle FP latency Threads running FP are non-blocking Enhance Niagara2 s cryptography Added more ciphers Enhanced existing public key support Page 9
10 Throughput hanges Niagara2 throughput changes vs. UltraSparc T1 Add instruction buffers after L1 instruction cache for each thread Add new pipe stage pick hoose 2 threads out of 8 to execute each cycle Increase execution units from 1 to 2 Increase set associativity of L1 instruction cache to 8 Increase size of fully associative DTLB from 64 to 128 entries Increase banks from 4 to 8 15 percent performance loss with only 4 banks and 64 threads Increase threads from 4 to 8 Page 10
11 Sparc ore Block Diagram IFU Instruction Fetch Unit 16 KB I$, 32B lines, 8-way SA SPU TLU EXU0 FGU IFU EXU1 LSU U/ HWTW Gasket rossbar/ 64-entry fully-associative ITLB EXU0/1 Integer Execution Units 4 threads share each unit 8 register windows/thread 160 IRF entries/thread LSU Load/Store Unit 8 threads share LSU 8KB D$, 16B lines, 4-way SA 128-entry fully-associative DTLB FGU Floating-Point/Graphics Unit 8 threads share FGU 32 FRF entries/thread SPU Stream Processing Unit ryptographic coprocessor TLU Trap Logic Unit Updates machine state, handles exceptions and interrupts U emory anagement Unit Hardware tablewalk (HWTW) 8KB, 64KB, 4B, 256B pages Page 11
12 ore Pipeline 8 stage integer pipeline Fetch ache Pick Decode Execute em Bypass W 3-cycle load-use penalty emory (data translation, access tag/data array) Bypass (late way select, data formatting, data forwarding) 12 stage floating-point pipeline Fetch ache Pick Decode Execute Fx1 Fx2 Fx3 Fx4 Fx5 FB FW 6-cycle latency for dependent FP ops Longer pipeline for divide/sqrt Page 12
13 Integer/LSU Pipeline Thread Group 0 P IB0-3 D E B W F LSU B W IFU Thread Group 1 P D E B IB4-7 W Instruction cache is shared by all 8 threads Least-recently-fetched algorithm used to select next thread to fetch Each thread is written into thread-specific instruction buffer Decouples fetch from pick Each thread statically assigned to one of 2 thread groups Pick chooses 1 ready thread each cycle within each thread group Picking within each thread group is independent of the other Least-recently-picked algorithm used to select next thread to execute Decode resolves resource hazards not handled during pick Page 13
14 Integer/LSU Pipeline Threads are Thread Group 0 P0 IB0-3 D2 E0 3 B1 W2 F2 6 LSU 4 B1 W6 IFU Thread Group 1 P5 D7 E6 4 B7 W6 IB4-7 interleaved between pipeline stages with very few restrictions Any thread can be at fetch or cache stage Threads are split into 2 thread groups before pick stage Load/store and floating-point units are shared between all 8 threads Up to 1 thread from either thread group can be scheduled on a shared unit Page 14
15 Stream Processing Unit ryptographic coprocessor One per core A Sources To FGU ultiply Result From FGU A Scratchpad 160x64b, 2R/1W rs1 A Execution Hash Engine rs2 Store Data, Address DA Engine ipher Engines Address, Data to/from Runs in parallel w/core at same frequency Two independent sub-units odular Arithmetic Unit RSA, binary and integer polynomial elliptic curve (E) Shares FGU multiplier ipher/hash Unit R4, DES/3DES, AES- 128/192/256 D5, SHA-1, SHA-256 Designed to achieve wire-speed on both 10Gb Ethernet ports Facilitates wirespeed encryption and decryption DA engine shares core s crossbar port Page 15
16 rossbar onnects 8 cores to 8 Banks and I/O ~90 GB/s write Sparc ore0 Sparc ore1 Sparc ore2 Sparc ore3 Sparc ore4 Sparc ore5 Sparc ore6 Sparc ore7 Non-blocking, pipelined switch 8 load/store requests and 8 data returns can be done at the same time Divided into 2 parts PX ~180 GB/s read Bank0 Bank1 Bank2 B0 ux B7 ux Bank3 Bank4 Bank5 Bank6 Bank7 PX processor to cache PX cache to processor Arbitration for a target is required Priority given to oldest requestor to maintain fairness and order Three cycle arbitration protocol Request, arbitrate and then grant Page 16
17 ache 4 B cache PX Request PX Return 16 way set associative Replayed iss lookup Tag Array Input Queue Arbiter iss Buffer Fill Buffer iss Request to emory miss hit Valid Array iss Request I/O Request 64B emory Read Fill Request Data Array 16B 64B Line Fill Output Queue 16B 16B Write-back Buffer Directory Invalidation Packet Arbiter 64B emory Write 64B Eviction I/O data 64B I/O Write Buffer 8 banks 64 byte line size cache is write-back, write-allocate L1 data cache is writethru Support for partial stores oherency is managed by the cache Directories maintained for all 16 L1 caches Data transfers between the and a core are done in 16 byte packets Page 17
18 Integrated Networking 42 GB/s read 21 GB/s write Pipelined memory accesses tolerate relaxed ordering rossbar NIU (Ethernet) 10 GE Ethernet FBDIs System Interface Unit PI-Ex 2.5 GHz Integrate networking for better overall performance All network data is sourced from and destined to main memory Integration minimizes impact of memory Get networking closer to memory to reduce latency Able to take full advantage of higher memory bandwidth Eliminates inherent inefficiencies of I/O protocol translation Page 18
19 Networking Features Line Rate Packet lassification (~30 pkt/s) Based on Layer 1/2/3/4 of the protocol stack ultiple DA Engines atches DAs to threads Binding flexibility between DAs and ports 16 transmit + 16 receive DA channels Virtualization Support Supports up to 8 partitions Interrupts may be bound to different hardware threads Dual Ethernet ports 2 dual-speed As (10G/1G) with integrated serdes Page 19
20 PI-Express PI-Express operates at 2.5 Gb/s per lane per direction 350 Hz DA/PIO ache Lines TLP Packets Data anagement IOU Interrupt Point-to-point, dual-simplex chip interconnect Transfers are in packets with headers and max data payloads from 128B to 512B IOU supports I/O virtualization and process device isolation by using PIE s BDF# SI Support 250 Hz 2.5 Gb/s XIT SR RV SR 128b 32b 128b 32b PI Express ore Transaction Layer Data Link and Physical Layer X8 Serdes X8 Serdes 16b 16b Event queue accumulates SIs Allows many SIs to be serviced upon an interrupt Total I/O bandwidth is 3-4 GB/s with max payload sizes of 128B to 512B Page 20
21 Power anagement Limit speculation Sequential prefetch of instruction cache lines Predict conditional branches as not-taken Predict loads hit in the data cache Hardware tablewalk search control Extensive clock gating Datapath ontrol blocks Arrays Power throttling 3 external power throttle pins Inject stall cycles into the decode stage based on state of these pins If power_throttle_pins[2:0]==n then n stalls in window of 8, n is 0-7 Affects all threads Page 21
22 Niagara2 System Status First silicon arrived at the end of ay Booted Solaris in 5 days urrent systems are fully operational Expect systems to ship in 2H2007 Page 22
23 Summary Niagara2 combines all major server functions on one chip Integrated networking Integrated PI-Express Embedded wire-speed cryptography Niagara2 has improved performance vs. UltraSparc T1 Better integer throughput and throughput/watt (2x) Improved integer single-thread performance (1.4x) Better floating-point throughput (10x) Better floating-point single-thread performance (5x) Enables new generation of power-efficient, fullysecure datacenters Page 23
24 Thank you... Page 24
Niagara-2: A Highly Threaded Server-on-a-Chip. Greg Grohoski Distinguished Engineer Sun Microsystems
Niagara-2: A Highly Threaded Server-on-a-Chip Greg Grohoski Distinguished Engineer Sun Microsystems August 22, 2006 Authors Jama Barreh Jeff Brooks Robert Golla Greg Grohoski Rick Hetherington Paul Jordan
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationOPENSPARC T1 OVERVIEW
Chapter Four OPENSPARC T1 OVERVIEW Denis Sheahan Distinguished Engineer Niagara Architecture Group Sun Microsystems Creative Commons 3.0United United States License Creative CommonsAttribution-Share Attribution-Share
More informationReference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded
Reference Case Study of a Multi-core, Multithreaded Processor The Sun T ( Niagara ) Computer Architecture, A Quantitative Approach, Fourth Edition, by John Hennessy and David Patterson, chapter. :/C:8
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last
More informationM7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle
M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III 2005-4-12 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationImprove performance by increasing instruction throughput
Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8ns fetch ALU Data access
More informationItanium 2 Processor Microarchitecture Overview
Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More information1. PowerPC 970MP Overview
1. The IBM PowerPC 970MP reduced instruction set computer (RISC) microprocessor is an implementation of the PowerPC Architecture. This chapter provides an overview of the features of the 970MP microprocessor
More informationPipelining and Vector Processing
Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC
More informationMulti-core Programming Evolution
Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationAn Oracle White Paper April Oracle's Sun SPARC Enterprise T5120/T5220 and Oracle's Sun SPARC Enterprise T5140/T5240 Server Architecture
An Oracle White Paper April 2010 Oracle's Sun SPARC Enterprise T5120/T5220 and Oracle's Sun SPARC Enterprise T5140/T5240 Server Architecture Introduction...1 The Evolution of Chip Multithreading...3 Rule-Changing
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationPentium IV-XEON. Computer architectures M
Pentium IV-XEON Computer architectures M 1 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction
More information6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU
1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationSuperscalar Processors
Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance
More informationAdvanced Caching Techniques (2) Department of Electrical Engineering Stanford University
Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors
William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationJim Keller. Digital Equipment Corp. Hudson MA
Jim Keller Digital Equipment Corp. Hudson MA ! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um "## Continued Performance Leadership
More informationThread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007
Thread-level Parallelism for the Masses Kunle Olukotun Computer Systems Lab Stanford University 2007 The World has Changed Process Technology Stops Improving! Moore s law but! Transistors don t get faster
More informationSPARC ENTERPRISE T5120 AND T5220 SERVER ARCHITECTURE
SPARC ENTERPRISE T5120 AND T5220 SERVER ARCHITECTURE Unleashing the UltraSPARC T2 Processor with Innovative Multi-core Multi-thread Technology White Paper October 2007 TABLE OF CONTENTS EXECUTIVE SUMMARY
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationPowerPC 740 and 750
368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationPower 7. Dan Christiani Kyle Wieschowski
Power 7 Dan Christiani Kyle Wieschowski History 1980-2000 1980 RISC Prototype 1990 POWER1 (Performance Optimization With Enhanced RISC) (1 um) 1993 IBM launches 66MHz POWER2 (.35 um) 1997 POWER2 Super
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationRon Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group
Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Outline Motivation Background Threading Fundamentals
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationGemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.
Gemini: A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor Sanjiv Kapil Gemini Architect Sun Microsystems, Inc. Design Goals Designed for compute-dense, transaction oriented systems (webservers,
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationSPARC ENTERPRISE T5440 SERVER ARCHITECTURE. Unleashing UltraSPARC T2 Plus Processors with Innovative Multi-core Multi-thread Technology
SPARC ENTERPRISE T5440 SERVER ARCHITECTURE Unleashing UltraSPARC T2 Plus Processors with Innovative Multi-core Multi-thread Technology White Paper July 2009 TABLE OF CONTENTS THE ULTRASPARC T2 PLUS PROCESSOR
More informationCS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines
CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per
More informationLike scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures
Superscalar Architectures Have looked at examined basic architecture concepts Starting with simple machines Introduced concepts underlying RISC machines From characteristics of RISC instructions Found
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationBOBCAT: AMD S LOW-POWER X86 PROCESSOR
ARCHITECTURES FOR MULTIMEDIA SYSTEMS PROF. CRISTINA SILVANO LOW-POWER X86 20/06/2011 AMD Bobcat Small, Efficient, Low Power x86 core Excellent Performance Synthesizable with smaller number of custom arrays
More informationCS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links
CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links John Wawrzynek, Krste Asanovic, with John Lazzaro and Yunsup Lee (TA) UC Berkeley Fall 2010 Unit-Transaction Level
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationHigh-Performance Cryptography in Software
High-Performance Cryptography in Software Peter Schwabe Research Center for Information Technology Innovation Academia Sinica September 3, 2012 ECRYPT Summer School: Challenges in Security Engineering
More informationAn Oracle White Paper June Oracle's SPARC T4-1, SPARC T4-2, SPARC T4-4, and SPARC T4-1B Server Architecture
An Oracle White Paper June 2012 Oracle's SPARC T4-1, SPARC T4-2, SPARC T4-4, and SPARC T4-1B Server Architecture Introduction... 1 SPARC T4 Processor... 4 The First Eight-Core Threaded SPARC System-on-a-Chip
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III 2004-11-18 Dave Patterson (www.cs.berkeley.edu/~patterson) John Lazzaro (www.cs.berkeley.edu/~lazzaro) www-inst.eecs.berkeley.edu/~cs152/
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationCOSC 6385 Computer Architecture - Memory Hierarchy Design (III)
COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses
More informationPerformance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.
Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution
More informationCase Study IBM PowerPC 620
Case Study IBM PowerPC 620 year shipped: 1995 allowing out-of-order execution (dynamic scheduling) and in-order commit (hardware speculation). using a reorder buffer to track when instruction can commit,
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationHardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.
Hardware and Software solutions for scaling highly threaded processors Denis Sheahan Distinguished Engineer Sun Microsystems Inc. Agenda Chip Multi-threaded concepts Lessons learned from 6 years of CMT
More informationTechniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company
Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor David Johnson Systems Technology Division Hewlett-Packard Company Presentation Overview PA-8500 Overview uction Fetch Capabilities
More informationECE260: Fundamentals of Computer Engineering
Pipelining James Moscola Dept. of Engineering & Computer Science York College of Pennsylvania Based on Computer Organization and Design, 5th Edition by Patterson & Hennessy What is Pipelining? Pipelining
More information" # " $ % & ' ( ) * + $ " % '* + * ' "
! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File
More informationLeveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD
Leveraging OpenSPARC ESA Round Table 2006 on Next Generation Microprocessors for Space Applications G.Furano, L.Messina TEC- OpenSPARC T1 The T1 is a new-from-the-ground-up SPARC microprocessor implementation
More informationSuperscalar Processor Design
Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design Super-scalar Organization Fetch Instruction
More informationPortland State University ECE 588/688. IBM Power4 System Microarchitecture
Portland State University ECE 588/688 IBM Power4 System Microarchitecture Copyright by Alaa Alameldeen 2018 IBM Power4 Design Principles SMP optimization Designed for high-throughput multi-tasking environments
More informationTopics in computer architecture
Topics in computer architecture Sun Microsystems SPARC P.J. Drongowski SandSoftwareSound.net Copyright 1990-2013 Paul J. Drongowski Sun Microsystems SPARC Scalable Processor Architecture Computer family
More informationILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)
Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case
More informationCENG 3420 Computer Organization and Design. Lecture 08: Cache Review. Bei Yu
CENG 3420 Computer Organization and Design Lecture 08: Cache Review Bei Yu CEG3420 L08.1 Spring 2016 A Typical Memory Hierarchy q Take advantage of the principle of locality to present the user with as
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationUltraSPARC User s Manual
UltraSPARC User s Manual UltraSPARC-I UltraSPARC-II July 1997 901 San Antonio Road Palo Alto, CA 94303 Part No: 802-7220-02 This July 1997-02 Revision is only available online. The only changes made were
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationEECS 322 Computer Architecture Improving Memory Access: the Cache
EECS 322 Computer Architecture Improving emory Access: the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow
More informationCaches. See P&H 5.1, 5.2 (except writes) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University
s See P&.,. (except writes) akim Weatherspoon CS, Spring Computer Science Cornell University What will you do over Spring Break? A) Relax B) ead home C) ead to a warm destination D) Stay in (frigid) Ithaca
More informationInstruction Pipelining Review
Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationChapter-5 Memory Hierarchy Design
Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or
More informationReorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)
Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers
More informationPowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors
PowerPC TM 970: First in a new family of 64-bit high performance PowerPC processors Peter Sandon Senior PowerPC Processor Architect IBM Microelectronics All information in these materials is subject to
More informationShow Me the $... Performance And Caches
Show Me the $... Performance And Caches 1 CPU-Cache Interaction (5-stage pipeline) PCen 0x4 Add bubble PC addr inst hit? Primary Instruction Cache IR D To Memory Control Decode, Register Fetch E A B MD1
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationSuperscalar Machines. Characteristics of superscalar processors
Superscalar Machines Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any performance
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationCPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces
CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces Zvonimir Z. Bandic, Sr. Director Robert Golla, Sr. Fellow Dejan Vucinic,
More informationCaches. See P&H 5.1, 5.2 (except writes) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University
s See P&.,. (except writes) akim Weatherspoon CS, Spring Computer Science Cornell University What will you do over Spring Break? A) Relax B) ead home C) ead to a warm destination D) Stay in (frigid) Ithaca
More informationSam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation
Next Generation Itanium Processor Overview Gary Hammond Principal Architect Enterprise Platform Group Corporation August 27-30, 2001 Sam Naffziger Lead Circuit Architect Microprocessor Technology Lab HP
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More information