DSP VLSI Design. Pipelining. Byungin Moon. Yonsei University

Similar documents
DSP VLSI Design. Instruction Set. Byungin Moon. Yonsei University

DSP VLSI Design. Addressing. Byungin Moon. Yonsei University

General Purpose Signal Processors

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

Module 4c: Pipelining

What is Pipelining? RISC remainder (our assumptions)

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Chapter 5:: Target Machine Architecture (cont.)

Pipelining and Vector Processing

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Unpipelined Machine. Pipelining the Idea. Pipelining Overview. Pipelined Machine. MIPS Unpipelined. Similar to assembly line in a factory

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

Introduction to Pipelining. Silvina Hanono Wachman Computer Science & Artificial Intelligence Lab M.I.T.

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Pipeline: Introduction

Chapter 5. A Closer Look at Instruction Set Architectures

Implementation of DSP Algorithms

Understand the factors involved in instruction set

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Chapter 8. Pipelining

EITF20: Computer Architecture Part2.2.1: Pipeline-1

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

CS 3510 Comp&Net Arch

EITF20: Computer Architecture Part2.2.1: Pipeline-1

3 INSTRUCTION FLOW. Overview. Figure 3-0. Table 3-0. Listing 3-0.

Instr. execution impl. view

Appendix C: Pipelining: Basic and Intermediate Concepts

CS311 Lecture: Pipelining and Superscalar Architectures

Unresolved data hazards. CS2504, Spring'2007 Dimitris Nikolopoulos

Instruction Pipelining Review

ECEC 355: Pipelining

The von Neumann Architecture. IT 3123 Hardware and Software Concepts. The Instruction Cycle. Registers. LMC Executes a Store.

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Outline Marquette University

EE-123. An Overview of the ADSP-219x Pipeline Last modified 10/13/00

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

Pipelining: Hazards Ver. Jan 14, 2014

More advanced CPUs. August 4, Howard Huang 1

Computer Architecture. Lecture 6.1: Fundamentals of

1 Hazards COMP2611 Fall 2015 Pipelined Processor

CPU Pipelining Issues

04 - DSP Architecture and Microarchitecture

Chapter 5. A Closer Look at Instruction Set Architectures

COMP2611: Computer Organization. The Pipelined Processor

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Lecture 7 Pipelining. Peng Liu.

D. Richard Brown III Associate Professor Worcester Polytechnic Institute Electrical and Computer Engineering Department

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Please state clearly any assumptions you make in solving the following problems.

Instruction Register. Instruction Decoder. Control Unit (Combinational Circuit) Control Signals (These signals go to register) The bus and the ALU

14:332:331 Pipelined Datapath

Multi-cycle Instructions in the Pipeline (Floating Point)

Chapter 9. Pipelining Design Techniques

The overall datapath for RT, lw,sw beq instrucution

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles. Interrupts and Exceptions. Device Interrupt (Say, arrival of network message)

LECTURE 10. Pipelining: Advanced ILP

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

CISC 662 Graduate Computer Architecture Lecture 7 - Multi-cycles

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

Control Hazards. Branch Prediction

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Lecture 15: Pipelining. Spring 2018 Jason Tang

CS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2005

Problem Set 1 Solutions

12.1. CS356 Unit 12. Processor Hardware Organization Pipelining

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

DC57 COMPUTER ORGANIZATION JUNE 2013

The Pipelined MIPS Processor

DSP Platforms Lab (AD-SHARC) Session 05

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Chapter 4. The Processor

ECE 154A Introduction to. Fall 2012

MIPS An ISA for Pipelining

Chapter 5. A Closer Look at Instruction Set Architectures. Chapter 5 Objectives. 5.1 Introduction. 5.2 Instruction Formats

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

UNIT- 5. Chapter 12 Processor Structure and Function

Chapter 5. A Closer Look at Instruction Set Architectures

PIPELINE AND VECTOR PROCESSING

CHAPTER 5 A Closer Look at Instruction Set Architectures

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

Pipelining. Parts of these slides are from the support material provided by W. Stallings

Lecture 6: Pipelining

Computer Organization & Assembly Language Programming

Chapter 5. A Closer Look at Instruction Set Architectures

Chapter 5. A Closer Look at Instruction Set Architectures. Chapter 5 Objectives. 5.1 Introduction. 5.2 Instruction Formats

TMS320C5x Interrupt Response Time

Lecture 4: ISA Tradeoffs (Continued) and Single-Cycle Microarchitectures

DEE 1053 Computer Organization Lecture 6: Pipelining

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

Transcription:

Byungin Moon Yonsei University

Outline What is pipelining? Performance advantage of pipelining Pipeline depth Interlocking Due to resource contention Due to data dependency Branching Effects Interrupt Effects Pipeline programming modes Time-stationary Data-stationary 1

Definition A technique for increasing the performance of a processor (or other electronic system) by a sequence of operations into smaller pieces and executing these pieces in parallel when possible Used in almost all current DSP processors Strength Decreases the overall time required to the complete the set of operations Weakness Complicates programming Execution time of a specific instruction sequence can vary from case to case Certain instruction sequences must be avoid for correct program operation Represents a trade-off between efficiency and ease of use 2

Illustration of how pipelining increases performance on a Hypothetical Processor A hypothetical processor uses separate execution units to accomplish the following actions for a single instruction (each stage takes 20 ns to execute) resembles TI TMS320C3x Fetch and instruction word from memory Decode the instruction Read or write a data operand from or to memory Execute the ALU or MAC portion of the instruction Nonpipelined The four stages is executed sequentially Execution time of 80 ns per instruction Each stage is idle 75 % of the time Pipelined The four stages of execution are overlapped Executes a new instruction every 20 ns An instruction appears to the programmer to execute in one instruction cycle Instructions appear to execute sequentially 3

Performance Comparison (Nonpipelined vs. Pipelined) 4

Pipeline Depth Number of pipeline stages Vary from one processor to another A deeper pipeline Allows the processor to execute faster But makes the processor harder to program Most processors use three or four stages Three-stage pipeline Instruction fetch, decode, and execute Operand fetch is typically done in the latter part of the decode stage Four-stage pipeline Instruction fetch, decode, operand fetch, and execute Others Analog Devices processors (two stages) and TI TMS320C54x (five stages) 5

What is Interlocking? Resource Contention Pipelined processors may not perform as well as we have shown in the hypothetical example Mainly due to resource contention (conflict) Example Suppose it takes two instruction cycles to write to memory (like AT&T DSP16xx processors) Instruction I2 attempts to write to memory and I3 needs to read from memory The second cycle in I2 s data write phase conflicts with I3 s data read Solution to resource contention -> Interlocking Interlocking Delays the progression of the latter of the conflicting instructions through the pipeline 6

Example of Pipeline Resource Contention and Interlocking to Resolve Resource Contention 7

Complicated programming on Interlocking Pipelined Processors There is a number of interlocking sources For example, In processors supporting instructions with long immediate data Instruction with long immediate data require an additional program memory fetch to get the immediate data This long immediate data fetch conflicts with the fetch of the next instruction -> resulting interlocking Not easy to spot interlocks by reading the program code The pipeline is interlocked or not depending on the instructions that surrounds it For example, if the instruction I3 in the previous example did not need to read from data memory, Then there would be not conflict, and no interlock would occur 8

Data Hazard Another Interlocking Source Example from the Motorola DSP5600x Makes little use of interlocking Uses a three-stage pipeline Fetch Decode addresses used in data accesses are formed Execute ALU operation, data accesses, register loads Example code MOVE #$1234, R0 MOVE X : (R0), X0 (R0 contains the hexadecimal value 5678 before execution of the above) Seemingly The above instructions move the value stored at X memory address 1234 into register X0 Actually The above instructions move the value stored at X memory address 5678 This is because of a pipeline hazard resulting from data dependency 9

A Motorola DSP5600x Pipeline Hazard 10

Data Hazard Another Interlocking Source Interlocking to protect the programmer from the hazard TI TMS320C3x, TMS320C4x, and TMS320C5x processors TMS320C3x detect writes to any of its address registers and holds the progression through the pipeline of other instructions that use any address register until the write has completed Trade-off made by heavily interlocked processors Saves the programmer from worrying about whether certain instruction sequences will produce correct output Allows the programmer to write slower-than-optimal code, even without even realizing it 11

Interlocking to Solve the Pipeline Hazard (from TI TMS320C3x) LDI (load immediate) instruction loads a value into an address register MPYF (floating-point multiply) uses register-indirect addressing fetch one of its operands 12

Branching Effects Control dependency from branches When a branch instruction reaches the decode stage in the pipeline and realizes that it must begin executing at a new address, the next sequential instruction word has already been fetched and is in the pipeline After the processor realizes a branch instruction, it didn t know where the next instruction is located until the branch is resolved One solution multicycle branch Discard, or flush the unwanted instruction And cease fetching new instructions until the branch is resolved Results in some waste cycles Some processors use tricks to execute the branch late in the decode phase, saving one instruction cycle Almost all DSP processors use multicycle branches 13

Branch Effects Alternative to the multicycle branch delayed branch Several instructions following the branch are executed normally BRD NEW_ADDR INST 2 ; INST2 to INST4 INST 3 ; are executed before INST 4 ; the branch occurs Instructions that will be executed before the branch instruction must be located in memory after the branch The branch appears to be delayed in its effect by several instruction cycles TMS320C3x, TMS320C4x, TMS320C5x, ADSP-2100x, DSP32C, DSP32xx, and ZR3800x Trade-offs of multicycle and delayed branches ease of programming and efficiency, as with interlocking Can always place NOP instructions after a delayed branch, in the worst case Branch effects occur whenever there is a change in program flow Subroutine call instructions, subroutine return instructions, and return from interrupt instructions 14

Multicycle Branch vs. Delayed Branch 15

Interrupt Effects Interrupts have the effects similar to branches on the pipeline Interrupts typically involve a change in a program flow of control to branch to the interrupt service routine The pipeline often increases the processor s interrupt response time, much as it slows down branch execution When an interrupt occurs, Almost all processors allow instructions at the decode stage or further in the pipeline to finish executing, because these instructions may be partially executed. What occurs past this point Varies from processor to processor 16

Example from TI TMS320C5x One cycle after the interrupt is recognized the processor inserts an INTR instruction into the pipeline INTR is a special branch instruction that causes the processor to begin execution at the appropriate interrupt vector Causes a four-instruction delay before the first word of the interrupt vector 17

Normal Interrupts of Motorola DSP5600x DSP5600x does not use an INTR instruction Simply begins fetching from the vector location after the interrupt is recognized At most two words are fetched starting at this address One of the two words is a subroutine call Flushes the previously fetched instruction and then branches to the long interrupt vector 18

Fast Interrupts of Motorola DSP5600x The same as normal interrupts except Neither of the two words starting at the interrupt vector is a subroutine call The processor executes the two words and continues executing from the original program 19

Pipeline Programming Models Two major assembly code formats for pipelined processors Time-stationary The processor s instruction specify the action to be performed by the execution units during a single instruction cycle (example from AT&T DSP16xx) a0=a0+p p=x*y x=*r0++ y=*pt++ Each portion of the instruction operates on separate operands Related to operand-unrelated parallel moves More flexible Data-stationary Specifies the operations that are to be performed, but not the exact timings during which the actions are to be executed (example from AT&T DSP32xx) a1 = a1 + (*r5++ = *r4++) * *r3++ Related to operand-related parallel moves Easier to read 20

Two Basic Control Schemes for Pipelined Data Paths Data-stationary Passes control function code along with data Allows simple and straight-forward design of both the state sequencer and the data path control circuits for each stage Requires more layout area Time-stationary Provides the control signals for the entire pipeline from a single source external to the pipeline The central controller govern the entire state of the machine at each time unit More complex design Must remember the current pipe state and provides appropriate control signals for each pipe stage 21

Two Basic Control Schemes for Pipelined Data Paths Data-stationary Time-stationary 22