HOT CHIPS 2014 NVIDIA S DENVER PROCESSOR. Darrell Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman

Similar documents
NVIDIA S FIRST CPU IS A WINNER

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ECE 571 Advanced Microprocessor-Based Design Lecture 22

The Pentium II/III Processor Compiler on a Chip

Multithreaded Processors. Department of Electrical Engineering Stanford University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

Itanium 2 Processor Microarchitecture Overview

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Processor (IV) - advanced ILP. Hwansoo Han

Jim Keller. Digital Equipment Corp. Hudson MA

Arm s Latest CPU for Laptop-Class Performance

Each Milliwatt Matters

KeyStone II. CorePac Overview

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

How to write powerful parallel Applications

CS 152 Computer Architecture and Engineering

Power 7. Dan Christiani Kyle Wieschowski

The Processor: Instruction-Level Parallelism

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

TDT 4260 lecture 7 spring semester 2015

Silvermont. Introducing Next Generation Low Power Microarchitecture: Dadi Perlmutter

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Dynamic Scheduling. CSE471 Susan Eggers 1

EE382A Lecture 3: Superscalar and Out-of-order Processor Basics

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

CS425 Computer Systems Architecture

OOO Execution and 21264

Case Study IBM PowerPC 620

CS 152 Computer Architecture and Engineering

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

BOBCAT: AMD S LOW-POWER X86 PROCESSOR

Complex Pipelines and Branch Prediction

CS 152 Computer Architecture and Engineering

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Advanced Processor Architecture

Agenda. What is the Itanium Architecture? Terminology What is the Itanium Architecture? Thomas Siebold Technology Consultant Alpha Systems Division

Handout 2 ILP: Part B

E0-243: Computer Architecture

Big.LITTLE Processing with ARM Cortex -A15 & Cortex-A7

ARM big.little Technology Unleashed An Improved User Experience Delivered

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

Chapter 4 The Processor (Part 4)

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Hardware-Based Speculation

Elements of CPU performance

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

XT Node Architecture

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CS425 Computer Systems Architecture

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

EECS 322 Computer Architecture Superpipline and the Cache

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Lecture-13 (ROB and Multi-threading) CS422-Spring

Kevin Meehan Stephen Moskal Computer Architecture Winter 2012 Dr. Shaaban

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Dynamic Memory Dependence Predication

The University of Texas at Austin

Keywords and Review Questions

ILP: COMPILER-BASED TECHNIQUES

Please state clearly any assumptions you make in solving the following problems.

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Introduction to the Tegra SoC Family and the ARM Architecture. Kristoffer Robin Stokke, PhD FLIR UAS

Vector Engine Processor of SX-Aurora TSUBASA

Cryptographic Engineering

IBM's POWER5 Micro Processor Design and Methodology

Microarchitecture Overview. Performance

Pentium Pro Case Study ECE/CS 752 Fall 2017

ARM Cortex-A* Series Processors

Master Informatics Eng.

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)

Putting it all Together: Modern Computer Architecture

CS 61C: Great Ideas in Computer Architecture Case Studies: Server and Cellphone microprocessors. Not yet in producvon, the next core awer Ivy Bridge!

COMPUTER ORGANIZATION AND DESI

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

CS 152, Spring 2012 Section 8

Advanced Instruction-Level Parallelism

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

CS 152, Spring 2011 Section 10

Instruction scheduling

Growth outside Cell Phone Applications

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.

Static & Dynamic Instruction Scheduling

CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Transcription:

HOT CHIPS 2014 NVIDIA S DENVER PROCESSOR Darrell Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman

TEGRA K1 with Dual Denver CPUs The First 64-bit Android Kepler-Class Chip 2

One Chip Two Versions TEGRA K1 192-core Kepler-Class Chip Pin Compatible Quad A15 CPUs 32-bit 3-way Superscalar Up to 2.3GHz 32K+32K L1$ Dual Denver CPUs 64-bit 7-way Superscalar Up to 2.5GHz 128K+64K L1$ 3

DENVER VALUE PROPOSITION Highest performance and very energy-efficient ARMv8 processor Greater dynamic sharing with GPU Extended battery life Low latency power-state transition Best web browsing experience Designed to bring PC-class performance to the ARM ecosystem Content creation Gaming Enterprise applications 4

Denver Core BPU IFU L1 Inst Cache 128 K 4 Way DENVER CPU Highest Perf ARMv8 CPU 7-wide superscalar Aggressive HW prefetcher Dynamic Code Optimization Optimize once, use many times OOO execution without the power ARM HW Decoder Scheduler HW Prefetch L1 Data Cache 64 K 4 Way JSR IEU0 IEU1 FP0 FP1 LS0 LS1 5

TEGRA K1 SUPERSCALAR ARCHITECTURE Cortex-A15 Tegra K1-32 Denver Tegra K1-64 Decoder 3 wide Decoder (predecode) 8 wide Scheduler Scheduler Branch Integer Integer Mul FP FP NEON NEON Load Store Branch Mul Integer Integer FP NEON FP NEON Load Store Integer Load Store Integer Branch: 1 Integer: 2 Multiply: 1 Floating Point/Neon: 2 x 64-bit LD/ST: 1 LD and 1 ST Peak IPC 3 Branch: 1 Integer: 2 (+ Mul) + 2 Floating Point/Neon: 2 x 128-bit LD/ST: 2 LD and/or ST Peak IPC 7+ 6

Pipeline Microarchitecture Mispredict Penalty Denver 13 Cycle Penalty EL4 EE5 ES6 EW7 JCC Bypass CMP RF wr Correct Target IP1 IC2 IW3 IN4 IN5 SB1 SB2 EB0 EB1 EA2 ED3 EL4 EE5 ES6 EW7 Misprediction Signal ITLB I$ Rd Way Sel Dec PB Pick Sch RF Rd Bypass Bypass ALU RF wr Tegra K1-32 15 cycle mispredict Lower is better Tegra K1-64 13 cycle mispredict Higher ILP and efficiency 7

CORE CLUSTER RETENTION STATE New power management state: CC4 Allows cache and architectural state retention Allows voltage to be reduced below Vmin to a retention voltage Fast entry and exit latencies enable wider range of use 8

Power penalty if entered for short durations DENVER IDLE POWER IMPROVES WITH RETENTION Leakage Overhead from energy required to flush L2 Efficiency crossover 9

DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES Instructions Hardware Decoder Optimizer Execution Units Denver Hardware Optimization m-interrupt 10

DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES Instructions Hardware Decoder Execution Units Denver Hardware Dynamic Profile Information Optimizer Optimized mcode Unrolls Loops Renames registers Reorders Loads and Stores Improves control flow Optimization Cache Removes unused computation Hoists redundant computation Sinks uncommonly executed computation Improves scheduling 11

DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES Instructions Optimization Lookup Hardware Decoder Optimized code execution from optimization cache Optimized mcode Execution Units Denver Hardware Optimization Cache 12

DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES Instructions Chaining Optimization Lookup Hardware Decoder Execution Units Denver Hardware Dynamic Profile Information Optimizer Optimized mcode Optimized mcode Optimized mcode Optimized mcode Optimization Cache 13

Performance Relative to Tegra K1 32 DENVER PERFORMANCE 300% 250% 200% 150% 100% Baytrail (Celeron N2910) Krait-400 (8974-AA) iphone 5s (A7 Cyclone) Haswell (Celeron 2955U) Tegra K1 64 50% 0% DMIPS SPECInt 2K SPECFP 2K AnTuTu 4 Geekbench 3 Single-Core Google Octane v2.0 16MB Memcpy (GB/s) 16MB Memset (GB/s) 16MB Memread (GB/s) 14

DCO: AN EXAMPLE SPECINT CRAFTY EXECUTION Full benchmark run Dynamic Code Optimizer % Exec Types Profile of execution HW Decoder execution Optimized ucode execution 15

DCO: AN EXAMPLE SPECINT CRAFTY EXECUTION Full benchmark run % Exec Types New ARM Code Profile of new ARM instructions 16

DCO: AN EXAMPLE SPECINT CRAFTY EXECUTION Full benchmark run % Exec Types New ARM Code IPC Profile of Instructions Per Cycle 17

DCO: AN EXAMPLE SPECINT CRAFTY EXECUTION Full benchmark run % Exec Types New ARM Code IPC 18

DCO: AN EXAMPLE SPECINT CRAFTY EXECUTION Full benchmark run % Exec Types New ARM Code IPC 19

DCO: AN EXAMPLE SPECINT CRAFTY EXECUTION Full benchmark run % Exec Types New ARM Code IPC 20

DCO: AN EXAMPLE SPECINT CRAFTY EXECUTION Full benchmark run % Exec Types New ARM Code IPC 21

DCO: AN EXAMPLE SPECINT CRAFTY EXECUTION Full benchmark run % Exec Types New ARM Code IPC 22

DCO: AN EXAMPLE SPECINT CRAFTY EXECUTION 3% of benchmark run % Exec Types New ARM Code IPC 23

CONCLUSION Dynamic Code Optimization is the architecture of the future Breaks the out-of-order window physical limitation Opens synergy between HW and SW that current architectures lack Improves efficiency by optimizing once and using many times Delivering PC-class performance to mobile form factors Enables PC-class gaming experience Enables true enterprise applications Enables content creation 24

ACKNOWLEDGMENT We would like to thank the CPU team in NVIDIA for all the creativity, hard work, and dedication to bring this vision to a reality. 25