C66x CorePac: Achieving High Performance

Similar documents
Leveraging DSP: Basic Optimization for C6000 Digital Signal Processors

The objective of this presentation is to describe you the architectural changes of the new C66 DSP Core.

D. Richard Brown III Associate Professor Worcester Polytechnic Institute Electrical and Computer Engineering Department

Lecture 4 - Number Representations, DSK Hardware, Assembly Programming

One instruction specifies multiple operations All scheduling of execution units is static

General Purpose Signal Processors

Multiple Instruction Issue. Superscalars

Computer Organization CS 206 T Lec# 2: Instruction Sets

CS 101, Mock Computer Architecture

Hsiao-Lung Chan Dept. Electrical Engineering Chang Gung University

Final Lecture. A few minutes to wrap up and add some perspective

INSTRUCTION SET AND EXECUTION

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

Cycle Accurate Simulator for TMS320C62x, 8 way VLIW DSP Processor

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Implementation of DSP Algorithms

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Parallel Processors from Client to Cloud Part 1

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Chapter 5. Introduction ARM Cortex series

CS430 Computer Architecture

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

EITF20: Computer Architecture Part2.2.1: Pipeline-1

KeyStone II. CorePac Overview

Speeding AM335x Programmable Realtime Unit (PRU) Application Development Through Improved Debug Tools

Microprocessors vs. DSPs (ESC-223)

04 - DSP Architecture and Microarchitecture

Chapter 06: Instruction Pipelining and Parallel Processing. Lesson 14: Example of the Pipelined CISC and RISC Processors

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Blog -

CS152 Computer Architecture and Engineering. Complex Pipelines

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

EE 4980 Modern Electronic Systems. Processor Advanced

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

ECE 486/586. Computer Architecture. Lecture # 12

Computer Architecture and Organization. Instruction Sets: Addressing Modes and Formats

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

This course provides an overview of the SH-2 32-bit RISC CPU core used in the popular SH-2 series microcontrollers

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

Micro-programmed Control Ch 15

Instruction Sets: Characteristics and Functions Addressing Modes

Machine Instructions vs. Micro-instructions. Micro-programmed Control Ch 15. Machine Instructions vs. Micro-instructions (2) Hardwired Control (4)

Micro-programmed Control Ch 15

VLIW Digital Signal Processor. Michael Chang. Alison Chen. Candace Hobson. Bill Hodges

The Nios II Family of Configurable Soft-core Processors

Instruction Pipelining Review

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

INTRODUCTION TO DIGITAL SIGNAL PROCESSOR

Intel Enterprise Processors Technology

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Page 1. Chapter 1: The General Purpose Machine. Looking Ahead Chapter 5 Important advanced topics in CPU design

COMPUTER STRUCTURE AND ORGANIZATION

The ARM10 Family of Advanced Microprocessor Cores

LOW-COST SIMD. Considerations For Selecting a DSP Processor Why Buy The ADSP-21161?

Universität Dortmund. ARM Architecture

Processor (I) - datapath & control. Hwansoo Han

Micro-programmed Control Ch 17

DSP Platforms Lab (AD-SHARC) Session 05

5 Computer Organization

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-11: 80x86 Architecture

The University of Texas at Austin

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Hardwired Control (4) Micro-programmed Control Ch 17. Micro-programmed Control (3) Machine Instructions vs. Micro-instructions

Chapter 13 Reduced Instruction Set Computers

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Chapter 4. The Processor Designing the datapath

CSE 141 Summer 2016 Homework 2

Programmable Logic Design Grzegorz Budzyń Lecture. 15: Advanced hardware in FPGA structures

ECE 486/586. Computer Architecture. Lecture # 7

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Digital Signal Processors: fundamentals & system design. Lecture 1. Maria Elena Angoletta CERN

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

COMPUTER ORGANIZATION AND ARCHITECTURE

COSC 122 Computer Fluency. Computer Organization. Dr. Ramon Lawrence University of British Columbia Okanagan

Digital Semiconductor Alpha Microprocessor Product Brief

Chapter 4. The Processor

Advanced processor designs

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

LECTURE 10. Pipelining: Advanced ILP

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Today s lecture is all about the System Unit, the Motherboard, and the Central Processing Unit, Oh My!

TMS320C54x DSP Programming Environment APPLICATION BRIEF: SPRA182

Computer Architecture and Data Manipulation. Von Neumann Architecture

Multicycle Approach. Designing MIPS Processor

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

The von Neumann Architecture. IT 3123 Hardware and Software Concepts. The Instruction Cycle. Registers. LMC Executes a Store.

Real instruction set architectures. Part 2: a representative sample

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Intel released new technology call P6P

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

are Softw Instruction Set Architecture Microarchitecture are rdw

Superscalar Processors

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Computer Organization

Transcription:

C66x CorePac: Achieving High Performance

Agenda 1. CorePac Architecture 2. Single Instruction Multiple Data (SIMD) 3. Memory Access 4. Pipeline Concept

CorePac Architecture 1. CorePac Architecture 2. Single Instruction Multiple Data (SIMD) 3. Memory Access 4. Pipeline Concept

C66x CorePac Level 1 Program Memory (L1P) Single-Cycle Cache / RAM 256 Instruction Fetch DSP Core M S L D Reg A 64-bit M S L D Reg B Level 1 Data Memory (L1D) Single-Cycle Cache / RAM Level 2 Memory (L2) Program / Data Cache / RAM Memory Controller C66x CorePac CorePac includes: DSP Core Two register sets Four functional units per register side L1P memory (Cache/RAM) L1D memory (Cache/RAM) L2 memory (Cache/RAM)

Memory C66x DSP Core A0. A31.D1.D2.S1.S2 MACs.M1.M2.L1.L2 Controller/Decoder B0. B31 Four functional units per side: o Multiplier (.M) o ALU (.L) o Data (.D) o Control (.S) These independent functional units enable efficient execution of parallel specialized instructions: o Multiplier (.M1and.M2) and ALU (.L1 and.l2) provide MAC (multiple accumulation) operations. o o Data (.D) provides data input/output. Control (.S) provides control functions (loop, branch, call). Each DSP core dispatches up to eight parallel instructions each cycle. All instructions are conditional, which enables efficient pipelining. The optimized C compiler generates efficient target code.

C66x DSP Core Cross-Path A0 A1 A2 A3 Register File A Any 64-bit pair of registers from A can be one of the inputs to a B functional unit, and vice versa. Register File B B0 B1 B2 B3 A4 B4. A.D1 B.D1. A31.S1.S1 B31.M1.M1.L1.L1

Partial List of.d Instructions

Partial List of.l Instructions

Partial List of.m Instructions

Partial List of.s Instructions

Single Instruction Multiple Data (SIMD) 1. CorePac Architecture 2. Single Instruction Multiple Data (SIMD) 3. Memory Access 4. Pipeline Concept

C66x SIMD Instructions: Examples ADDDP Add Two Double-Precision Floating-Point Values DADD2 4-Way SIMD Addition, Packed Signed 16-bit Performs four additions of two sets of four 16-bit numbers packed into 64-bit registers. The four results are rounded to four packed 16-bit values unit =.L1,.L2,.S1,.S2 FMPYDP - Fast Double-Precision Floating Point Multiply QMPY32-4-Way SIMD Multiply, Packed Signed 32-bit. Performs four multiplications of two sets of four 32-bit numbers packed into 128-bit registers. The four results are packed 32-bit values. unit =.M1 or.m2

C66x SIMD Instruction: CMATMPY Many applications use complex matrix arithmetic. CMATMPY 2x1 Complex Vector Multiply 2x2 Complex Matrix Results in 2x1 signed complex vector. All values are 16-bit (16-bit real/16-bit Imaginary) unit =.M1 or.m2 How many multiplications are complex multiplication, where each complex multiplication has the following: 4 complex multiplications (4 real multiplications each) Two M units (16 multiplications each) = 32 multiplications Core cycles per second (1.25 G) Total multiplications per second = 40 G multiplications 8 cores = 320 G multiplications The issue here is, can we feed the functional units data fast enough?

Feeding the Functional Units There are two challenges: How to provide enough data from memory to the core Access to L1 memory is wide (2 x 64 bit) and fast (0 wait state) Multiple mechanisms are used to efficiently transfer new data to L1 from L2 and external memory. How to get values in and out of the functional units Hardware pipeline enables execution of instructions every cycle. Software pipeline enables efficient instruction scheduling to maximize functional unit throughput.

Memory Access 1. CorePac Architecture 2. Single Instruction Multiple Data (SIMD) 3. Memory Access 4. Pipeline Concept

Internal Buses Program Address x32 PC L1 Memories L2 and External Memory Program Data x256 Data Address - T1 x32 Data Data - T1 x32/64 Data Address - T2 x32 Data Data - T2 x32/64 Fetch A Regs B Regs Peripherals C62x: Dual 32-Bit Load/Store C67x: Dual 64-Bit Load / 32-Bit Store C64x, C674x, C66x: Dual 64-Bit Load/Store

Pipeline Concept 1. CorePac Architecture 2. Single Instruction Multiple Data (SIMD) 3. Memory Access 4. Pipeline Concept

Non-Pipelined vs. Pipelined CPU CPU Type Clock Cycles 1 2 3 4 5 6 7 8 9 Non-Pipelined F 1 D 1 E 1 F 2 D 2 E 2 F 3 D 3 E 3 Pipelined F 1 D 1 E 1 F 2 D 2 E 2 F 3 D 3 E 3 Stage F Fetch D Decode E Execute Pipeline Function Generate program fetch address Read opcode Route opcode to functional units Decode instructions Execute instructions Pipeline full Now look at the C66x pipeline.

Program Fetch Phases Phase PG PS PW PR Description Generate fetch address Send address to memory Wait for data ready Read opcode PR C66x Core Functional Units Memory PW PS PG

Pipeline Phases - Review Program Fetch Decode Execute PG PS PW PR D E PG PS PW PR D E PG PS PW PR D E PG PS PW PR D E PG PS PW PR D E PG PS PW PR D E Single-cycle performance is not affected by adding three program fetch phases. That is, there is still an execute every cycle. How about decode? Is it only one cycle?

Decode Phases Decode Phase DP DC Description Intelligently routes instruction to functional unit (dispatch) Instruction decoded at functional unit (decode) PR C66x Core DP Functional Units DC Memory PW PS PG

Pipeline Phases Program Fetch Decode Execute PG PS PW PR DP DC E1 PG PS PW PR DP DC E1 PG PS PW PR DP DC E1 PG PS PW PR DP DC E1 PG PS PW PR DP DC E1 PG PS PW PR DP DC E1 PG PS PW PR DP DC E1 Pipeline Full How many cycles does it take to execute an instruction?

Instruction Delays All C66x instructions require only one cycle to execute, but some results are delayed. Description Instruction Example Delay Single Cycle All instructions except 0 Integer multiplication and new floating point Legacy floating point multiplication MPY, FMPYSP 1 MPYSP 2 Load LDW 4 Branch B 5

Software Pipeline Example Dot product; A typical DSP MAC operation. LDH LDH MPY ADD How many cycles would it take to perform this loop five times? (Disregard delay slots). cycles

Software Pipeline Example Dot product; A typical DSP MAC operation. LDH LDH MPY ADD How many cycles would it take to perform this loop five times? (Disregard delay slots). 5 x 3 = 15 cycles

Cycle Non-Pipelined Code 1.D1 ldh.d2 ldh.m1.m2.l1.l2.s1.s2 2 mpy 3 add 4 ldh ldh 5 mpy 6 add 7 ldh ldh 8 mpy 9 add

Cycle 1.D1 ldh.d2 ldh Pipelining Code.M1.M2.L1.L2.S1.S2 2 ldh ldh mpy 3 ldh ldh mpy add 4 ldh ldh mpy add 5 ldh ldh mpy add 6 mpy add 7 add Pipelining these instructions took 1/2 the cycles!

Software Pipeline Support The compiler is smart enough to schedule instructions efficiently. DSP algorithms are typically loop intensive. Generally speaking, servicing of interrupts is not allowed in the middle of the loop because fixed timing is essential. The C66x hardware SPLOOP enables servicing of interrupts in the middle of loops. NOTE: For more information on SPLOOP, refer to Chapter 8 of the C66x CPU and Instruction Set Reference Guide.

For More Information For more information, refer to the C66x CPU and Instruction Set Reference Guide. For questions regarding topics covered in this training, visit the support forums at the TI E2E Community website.