Poulson: An 8 Core 32 nm Next Generation I ntel I tanium. Processor. Steve Undy I ntel I NTEL CONFI DENTI AL
|
|
- Tiffany Walton
- 6 years ago
- Views:
Transcription
1 Poulson: An 8 Core 32 nm Next Generation I ntel I tanium Processor Steve Undy I ntel 1 I NTEL CONFI DENTI AL
2 This slide MUST be used with any slides rem oved from this presentation Legal Disclaim er I NFORMATI ON I N THI S DOCUMENT I S PROVI DED I N CONNECTI ON WI TH I NTEL PRODUCTS. NO LI CENSE, EXPRESS OR I MPLI ED, BY ESTOPPEL OR OTHERWI SE, TO ANY I NTELLECTUAL PROPERTY RI GHTS I S GRANTED BY THI S DOCUMENT. EXCEPT AS PROVI DED I N I NTEL S TERMS AND CONDI TI ONS OF SALE FOR SUCH PRODUCTS, I NTEL ASSUMES NO LI ABI LI TY WHATSOEVER, AND I NTEL DI SCLAI MS ANY EXPRESS OR I MPLI ED WARRANTY, RELATI NG TO SALE AND/ OR USE OF I NTEL PRODUCTS I NCLUDI NG LI ABI LI TY OR WARRANTI ES RELATI NG TO FI TNESS FOR A PARTI CULAR PURPOSE, MERCHANTABI LI TY, OR I NFRI NGEMENT OF ANY PATENT, COPYRI GHT OR OTHER I NTELLECTUAL PROPERTY RI GHT. I NTEL PRODUCTS ARE NOT I NTENDED FOR USE I N MEDI CAL, LI FE SAVI NG, OR LI FE SUSTAI NI NG APPLI CATI ONS. I ntel m ay m ake changes to specifications and product descriptions at any tim e, without notice. Software and workloads used in perform ance tests m ay have been optim ized for perform ance only on I ntel m icroprocessors. Perform ance tests, such as SYSm ark and MobileMark, are m easured using specific com puter system s, com ponents, software, operations and functions. Any change to any of those factors m ay cause the results to vary. You should consult other inform ation and perform ance tests to assist you in fully evaluating your contem plated purchases, including the perform ance of that product when com bined with other products. Configurations: [ describe config + what test used + who did testing]. For m ore inform ation go to http: / / / perform ance I ntel does not control or audit the design or im plem entation of third party benchm arks or Web sites referenced in this docum ent. I ntel encourages all of its custom ers to visit the referenced Web sites or others where sim ilar perform ance benchm arks are reported and confirm whether the referenced benchm arks are accurate and reflect perform ance of system s available for purchase. I ntel processor num bers are not a m easure of perform ance. Processor num bers differentiate features within each processor fam ily, not across different processor fam ilies. See / products/ processor_num ber for details. I ntel, processors, chipsets, and desktop boards m ay contain design defects or errors known as errata, which m ay cause the product to deviate from published specifications. Current characterized errata are available on request. The code nam es Poulson and Tukwila presented in this docum ent are only for use by I ntel to identify products, technologies, or services in developm ent, that have not been m ade com m ercially available to the public, i.e., announced, launched or shipped. They are not "com m ercial" nam es for products or services and are not intended to function as tradem arks. I nt el, I ntel I tanium, I tanium, I ntel Xeon, I ntel Core m icroarchitecture, and the I ntel logo are tradem arks or registered tradem arks of I ntel Corporation or its subsidiaries in the United States and other countries. Copyright 2011 I nt el Corporat ion. All right s reserved. 2
3 Agenda Design Space Overview Parallelism Data I ntegrity Power Conclusion 3
4 Design Space: Mission Critical Perform ance Both single-thread speed and total throughput are im portant Goal: > 2x generat ional perform ance im provem ent Scalabilit y Glueless up to 8 sockets Support larger topologies via node controllers from system vendors Reliabilit y Redundancy and self- correct ion Error detection and recovery via hardware and firm ware Com patibility Socket com patible with I tanium 9300 Processor Series (Tukwila) Com m on platform ingredients with Xeon E7 Processor Series 4
5 Poulson Overview Tukw ila Poulson Process 65nm 32nm Devices 2.05 Billion 3.1 Billion Area 700 m m m m 2 Power ( m ax TDP) 185W 170W I tanium Cores 4 8 Last Level Cache Size 24MB 32MB I ntel QuickPath I nterconnect Links 4 full + 2 half 4 full + 2 half I ntel QPI Link Speed 4.8GT/ s 6.4GT/ s I ntel Scalable Mem ory I nterface 4 4 Links I ntel SMI Link Speed 4.8GT/ s 6.4GT/ s 5
6 Poulson Overview QPI Core 0 Core 1 QPI Shared Last Level Cache QPI QPI Core 6 Core 7 8 I tanium Cores 32MB Last Level Cache Directory Cache Core 2 ½ QPI Syst em Logic Shared Last Level Cache Directory Cache Core 5 Core 3 Core 4 SMI SMI ½ QPI 4 full-width and 2 half-width I ntel QPI links 2 Direct ory caches 2 I ntegrated m em ory controllers with 2 I ntel SMI links each 6
7 Exam ple 8 - Socket Topology Full I ntel QPI Mem ory I / O I / O Hub Mem ory Mem ory Hub Mem ory Half I ntel QPI I ntel SMI Poulson Poulson Poulson Poulson Poulson Poulson Poulson Poulson Mem ory I / O Mem ory Mem ory I / O Mem ory Hub Hub 7
8 Poulson Core Tukw ila Poulson 8 Devices 108 m illion 89 m illion Area 69 m m 2 20 m m 2 Process scaled TDP Process scaled frequency 1.0 > 1.2 Threads I nstruction Q size 48 96x2 Max inst ruct ion issue/ cycle 6 12 Pipeline stages Pipeline hazard resolution I nterlock+ Stall Replay I nteger RF size I nteger RF ports 12R 8W 12R 12W RF protection Parity SECDED All- new I tanium core
9 I tanium New I nstructions I nteger operations m py4 m pyshl4 clz Thread Cont rol hint@priority Expanded Data Access Hints m ov dahr Expanded Software Prefetch lfetch.count 9
10 Exploiting Parallelism on All Levels I nst ruct ion Level Parallelism Mem ory Parallelism Thread Parallelism Multi- core Parallelism 10 Multi- level parallelism exposed to softw are control
11 I nstruction Parallelism Front- end Pipeline Flush Main Pipeline Replay Replay I PG FET FDC REN DEC REG EXE DET WRB WB2 I PG: Address Gen FET: Array Access FDC: Early Decode REN: Regist er Renam e Fetch width: 32B/ cycle Mult i- level branch Predict ion Renam ing: 128 logical to 185 physical registers I ndependent thread dom ain I BD/ Q I BD: I nstr Buffer/ Dispersal DEC: I nstr Decode REG: Register Read EXE: Execut e DET: Except ion Det ect WRB: Writ e- back WB2: Com m it Q holds 96 instructions I n-order issue I ndependent thread dom ain OZQ MLD Pipeline Concurrency via 3 decoupled pipelines MLD: Mid-level data cache OZQ: Architectural m em ory ordering queue OZQ holds 16 m em ory ops OOO issue I ndependent thread dom ain 11
12 I nstruction Parallelism Front- end Back- end MLD I PG FET FDC REN 32B 6 I nstructions MQ I Q Mem ory I nteger OZQ Com pilers precisely control instruction dispersal AQ FQ BQ NopQ ALU Float ing Point Branch NOP w ide issue under softw are control 12
13 Mem ory Parallelism Avoiding Hazards Expanded Soft ware Hint s Provides control over allocation, speculation and prefetch policies Multi-line software prefetch Reduced Speculat ion Cost s Spont aneous Deferrals on speculat ive loads Deferred load can transform into a prefetch to cache and/ or TLB New Hardware Prefet cher I nto FLD and/ or MLD Adaptive based on FLD/ MLD m iss patterns Reduced Dat a Hazards Expanded store to consum er bypassing in FLD and MLD Single-cycle MLD to FLD line transfers Pipeline bottlenecks rem oved 13
14 Mem ory Parallelism I ncreasing Throughput More pending m em ory operations in each core I ncreased from 48 to 64 operations in MLD I ncreased from 44 to 80 operations in Ring interface Ring Cache 8 I ndependent LLC pipelines and controllers per socket 2 caching agents per socket to generate I ntel QPI transactions I ncreased I ntel QPI link bandwidth for m ulti- socket throughput 14 Mem ory Subsystem I m provem ents Hom e agent in-flight transaction tracker increased by 50% MC scheduler algorithm changes focused on perform ance and power efficiency I ntel SMI link speed increased I m proved queuing and B/ W
15 Thread Parallelism I ntel Hyper-Threading Technology with Dual Dom ain Multithreading support Front-end and m ain pipelines are independently threaded Pipeline-specific thread switch m echanism s More per thread resources I nstruction queues Data TLBs Hardware page walker handles concurrent walks to both threads I nstructions provide software hints to thread switch hardware Attention to Thread Fairness Hardware t o ident ify unfair behavior Measures instruction retirem ent and data cache resource usage Firm ware configurable responses to unfairness By biasing towards victim thread I ncreased throughput via im proved threading 15
16 Multi- core and Multi- socket Parallelism 700GB/ s (aggregate) Core 3 LLC LLC Core 4 Core 2 LLC LLC Core 5 10 port Rout er 128GB/ s QPI x6 I ntel QPI HAx2 MCx2 45GB/ s I ntel SMI Core 1 LLC LLC Core 6 Core 0 LLC LLC Core 7 16
17 Reliability, Fault Tolerance and Recoverability I ntel I nstruction Replay Technology Main pipeline redesigned for error handling Enabled hardware and firm ware recovery of correct able errors I ncreased Error Detection and Correction Arrays MLI, MLD, LLC tag and Directory cache SECDED LLC data DECTED Register files (GR and FR) SECDED Logic path error detection in FP ALU via residues Key paths are protected end to end with invalidation and replay Reworked Error Det ect ion, Response and Report ing Fram ework Large increase in error detection and response capabilities I ncreased hardware responses and recovery I ntel Cache Safe Technology: lockout of failing cache locations 17
18 Pow er Aw are Design Replay- based Pipeline Enabled fully gated clocks Elim inated slow and power-wasting global pipeline stalls Aggressive Clock Gat ing Substantial reductions in both idle, typical and m ax power Rem oved Dynam ic Logic from FPU Low- Leakage FETs 18 Digit al Power Predict ion Measures, weights and averages key core state signals Beyond architectural state uses data patterns Estim ation error decreased from 35% to 2% 6 0 % TDP, 7 0 % idle C dyn reduction over process scaled Tukw ila
19 Poulson Status Well into post-si validation Booted and being tested on m ultiple operating system s Running in m any topologies On track for 2012 shipm ents 19
20 Conclusion All new core design 12- wide issue I ntel Hyper-Threading Technology with Dual Dom ain Multithreading I tanium New I nstructions and policies for fine-grain software control of hardware parallelism I ntel I nstruction Replay Technology for frequency, power and fault tolerance Core power efficiency: > 3x better than Tukwila 1 Ring- based design Large last level caches High bandwidth system interface Outstanding data reliability Enhanced error detection I m proved error recovery Better RAS integration 1 I nternal lab m easurem ent 20
21
22 Glossary DECTED DTB FI T FLB Double Error Correction, Triple Error Detection Second Level Data TLB Failure in Thousands First Level Branch Cache MLD MLI OZQ Mid Level Data Cache Mid Level I nstruction Cache Ordering czar Queue: architectural m em ory ordering QPI I ntel QuickPath I nterface FLD FLI HA I LP First Level Dat a Cache First Level I nstruction Cache Hom e Agent: I ntel QPI agent responsible for m anaging m em ory I nst ruct ion Level Parallelism I PF I tanium Processor Fam ily LLC MC Last Level Cache Mem ory Cont roller RAS RF SECDED Reliability, Availability and Serviceability Register File Single Error Correction, Double Error Detection SMI I ntel Scalable Mem ory I nt erconnect TD P TLB Therm al Design Power Translat ion Look- aside Buffer 22
Transitioning the I ntel Next. ( Nehalem and W estm ere) into the. Lily Looi, St ephan Jourdan I ntel Corporation Hot Chips 21 August 24, 2009
Transitioning the I ntel Next Generation Microarchitectures ( Nehalem and W estm ere) into the Mainstream Lily Looi, St ephan Jourdan I ntel Corporation Hot Chips 21 August 24, 2009 1 Agenda Next Generation
More informationAMD and OpenCL. Mike Houst on Senior Syst em Archit ect Advanced Micro Devices, I nc.
AMD and OpenCL Mike Houst on Senior Syst em Archit ect Advanced Micro Devices, I nc. Overview Brief overview of ATI Radeon HD 4870 architecture and operat ion Spreadsheet perform ance analysis Tips & tricks
More informationI ntel. QuickPath I nterconnect Overview. presented by Robert Safranek. Contributors: Gurbir Singh, Robert Maddox
Overview presented by Robert Safranek Contributors: Gurbir Singh, Robert Maddox Legal Disclaim er INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationForging a Future in Memory: New Technologies, New Markets, New Applications
Forging a Future in Memory: New Technologies, New Markets, New Applications Ed Doller V.P. Chief Memory Systems Architect Non-Volatile Memory Seminar Hot Chips Conference August 22, 2010 Memorial Auditorium
More informationQuadrics QsNet II : A network for Supercomputing Applications
Quadrics QsNet II : A network for Supercomputing Applications David Addison, Jon Beecroft, David Hewson, Moray McLaren (Quadrics Ltd.), Fabrizio Petrini (LANL) You ve bought your super computer You ve
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationIntel Atom Processor Based Platform Technologies. Intelligent Systems Group Intel Corporation
Intel Atom Processor Based Platform Technologies Intelligent Systems Group Intel Corporation Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS
More informationItanium 2 Processor Microarchitecture Overview
Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationSam Naffziger. Gary Hammond. Next Generation Itanium Processor Overview. Lead Circuit Architect Microprocessor Technology Lab HP Corporation
Next Generation Itanium Processor Overview Gary Hammond Principal Architect Enterprise Platform Group Corporation August 27-30, 2001 Sam Naffziger Lead Circuit Architect Microprocessor Technology Lab HP
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationThe Transition to PCI Express* for Client SSDs
The Transition to PCI Express* for Client SSDs Amber Huffman Senior Principal Engineer Intel Santa Clara, CA 1 *Other names and brands may be claimed as the property of others. Legal Notices and Disclaimers
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 18 Advanced Processors II 2006-10-31 John Lazzaro (www.cs.berkeley.edu/~lazzaro) Thanks to Krste Asanovic... TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationIBM's POWER5 Micro Processor Design and Methodology
IBM's POWER5 Micro Processor Design and Methodology Ron Kalla IBM Systems Group Outline POWER5 Overview Design Process Power POWER Server Roadmap 2001 POWER4 2002-3 POWER4+ 2004* POWER5 2005* POWER5+ 2006*
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationReliability & Flow Control
Read 7.E Reliability & Flow Control Prof. ina Kat abi Some slides are from lectures by Nick Mckeown, Ion Stoica, Frans Kaashoek, Hari Balakrishnan, and Sam Madden 1 Previous Lecture How the link layer
More informationThe Alpha and Microprocessors:
The Alpha 21364 and 21464 icroprocessors: Continuing the Performance Lead Beyond Y2K Shubu ukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer Corporation Shrewsbury,
More informationSolid-State Drive System Optimizations In Data Center Applications
Solid-State Drive System Optimizations In Data Center Applications Tahmid Rahman Senior Technical Marketing Engineer Non Volatile Memory Solutions Group Intel Corporation Flash Memory Summit 2011 Santa
More informationCS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.
CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in
More informationLecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University
Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:
More information1. PowerPC 970MP Overview
1. The IBM PowerPC 970MP reduced instruction set computer (RISC) microprocessor is an implementation of the PowerPC Architecture. This chapter provides an overview of the features of the 970MP microprocessor
More informationAccelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing
Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product
More informationAdvanced Com puter Architecture: s1/ 2005
Advanced Com puter Architecture: s1/ 2005 Project Presen tation David Mirabito Handling branches through context fking Currently: b eq Hypothetical instruction stream (operands rem oved) Currently: b eq
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationc. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?
Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More information6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU
1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high
More informationLecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ
Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)
More informationCOSC4201 Instruction Level Parallelism Dynamic Scheduling
COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism
More informationHardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.
Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)
More informationCS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming
CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 12: Hardware Assisted Software ILP and IA64/Itanium Case Study Lecture Outline Review of Global Scheduling,
More informationM7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle
M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.
More information" # " $ % & ' ( ) * + $ " % '* + * ' "
! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File
More informationJomar Silva Technical Evangelist
Jomar Silva Technical Evangelist Agenda Introduction Intel Graphics Performance Analyzers: what is it, where do I get it, and how do I use it? Intel GPA with VR What devices can I use Intel GPA with and
More informationPOWER7: IBM's Next Generation Server Processor
POWER7: IBM's Next Generation Server Processor Acknowledgment: This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002 Outline
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationOpenCL *, Heterogeneous Computing, and the CPU
OpenCL *, Heterogeneous Computing, and the CPU Dr. Tim Mattson Visual Applications Research lab Intel Corporation * 3 rd party nam es are the property of their ow ners I ntel Corporation, 2009 Agenda Heterogeneous
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationVirtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili
Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)
Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationStatic Multiple-Issue Processors: VLIW Approach
Static Multiple-Issue Processors: VLIW Approach Instructor: Prof. Cristina Silvano, email: cristina.silvano@polimi.it Teaching Assistant: Dr. Giovanni Agosta, email: agosta@acm.org Dipartimento di Elettronica,
More informationChapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,
Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would
More informationPOWER7: IBM's Next Generation Server Processor
Hot Chips 21 POWER7: IBM's Next Generation Server Processor Ronald Kalla Balaram Sinharoy POWER7 Chief Engineer POWER7 Chief Core Architect Acknowledgment: This material is based upon work supported by
More informationIntel Xeon Phi Coprocessor Performance Analysis
Intel Xeon Phi Coprocessor Performance Analysis Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
More informationIntel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012
Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012 Legal Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationAdapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]
Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Pipelining Review PC IF/ID ID/EX EX/M
More informationPentium IV-XEON. Computer architectures M
Pentium IV-XEON Computer architectures M 1 Pentium IV block scheme 4 32 bytes parallel Four access ports to the EU 2 Pentium IV block scheme Address Generation Unit BTB Branch Target Buffer I-TLB Instruction
More informationECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories
ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-06 1 Processor and L1 Cache Interface
More information/ : Computer Architecture and Design Fall Final Exam December 4, Name: ID #:
16.482 / 16.561: Computer Architecture and Design Fall 2014 Final Exam December 4, 2014 Name: ID #: For this exam, you may use a calculator and two 8.5 x 11 double-sided page of notes. All other electronic
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More informationECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti
ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real
More informationELE 655 Microprocessor System Design
ELE 655 Microprocessor System Design Section 2 Instruction Level Parallelism Class 1 Basic Pipeline Notes: Reg shows up two places but actually is the same register file Writes occur on the second half
More informationTransient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.
Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson Outline Motivation What are transient faults? Hardware Fault Detection
More informationAnnouncements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory
ECE4750/CS4420 Computer Architecture L11: Speculative Execution I Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab3 due today 2 1 Overview Branch penalties limit performance
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 19 Advanced Processors III 2006-11-2 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Udam Saini and Jue Sun www-inst.eecs.berkeley.edu/~cs152/ 1 Last
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationIntel and the Future of Consumer Electronics. Shahrokh Shahidzadeh Sr. Principal Technologist
1 Intel and the Future of Consumer Electronics Shahrokh Shahidzadeh Sr. Principal Technologist Legal Notices and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.
More informationExploring the Effects of Hyperthreading on Scientific Applications
Exploring the Effects of Hyperthreading on Scientific Applications by Kent Milfeld milfeld@tacc.utexas.edu edu Kent Milfeld, Chona Guiang, Avijit Purkayastha, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER
More informationIntel s Architecture for NFV
Intel s Architecture for NFV Evolution from specialized technology to mainstream programming Net Futures 2015 Network applications Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION
More informationMIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14
MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK
More informationIntel Many Integrated Core (MIC) Architecture
Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products
More informationAdvanced Instruction-Level Parallelism
Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu
More informationOrange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Pipelining Recall Pipelining is parallelizing execution Key to speedups in processors Split instruction
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationPipelining and Vector Processing
Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC
More informationCollecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers
Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers Collecting Important OpenCL*-related Metrics with Intel GPA System Analyzer Introduction Intel SDK for OpenCL* Applications
More informationENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013
ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of
More informationEE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University
EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation
More informationLecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"
Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationProcessor Architecture
Processor Architecture Advanced Dynamic Scheduling Techniques M. Schölzel Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading Content Tomasulo
More informationNon-Volatile Memory Cache Enhancements: Turbo-Charging Client Platform Performance
Non-Volatile Memory Cache Enhancements: Turbo-Charging Client Platform Performance By Robert E Larsen NVM Cache Product Line Manager Intel Corporation August 2008 1 Legal Disclaimer INFORMATION IN THIS
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationOpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing
OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing Intel SDK for OpenCL* Applications Sample Documentation Copyright 2010 2012 Intel Corporation All Rights Reserved Document Number: 327281-001US
More informationPerformance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.
Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution
More informationCS 152 Computer Architecture and Engineering
CS 152 Computer Architecture and Engineering Lecture 22 Advanced Processors III 2005-4-12 John Lazzaro (www.cs.berkeley.edu/~lazzaro) TAs: Ted Hong and David Marquardt www-inst.eecs.berkeley.edu/~cs152/
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture
More informationLecture 9: Multiple Issue (Superscalar and VLIW)
Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationC 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA
CSE 490/590 Computer Architecture Complex Pipelining I Steve Ko Computer Sciences and Engineering University at Buffalo Last time Virtual address caches Virtually-indexed, physically-tagged cache design
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u
More informationRon Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group
Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group Outline Motivation Background Threading Fundamentals
More informationHyperthreading 3/25/2008. Hyperthreading. ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.
Hyperthreading ftp://download.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf Hyperthreading is a design that makes everybody concerned believe that they are actually using
More information