Institutionen för systemteknik

Similar documents
Institutionen för systemteknik

Institutionen för systemteknik

Design and Implementation of Single Issue DSP Processor Core. Vinodh Ravinath

Institutionen för systemteknik Department of Electrical Engineering

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Institutionen för systemteknik

Terrain Rendering using Multiple Optimally Adapting Meshes (MOAM)

Institutionen för systemteknik

Institutionen för systemteknik

08 - Address Generator Unit (AGU)

Institutionen för systemteknik

Institutionen för systemteknik

TSEA 26 exam page 1 of Examination. Design of Embedded DSP Processors, TSEA26 Date 8-12, G34, G32, FOI hus G

Institutionen för systemteknik

VME64M VME64 MASTER CONTROLLER. Version 1.1

SD Card Controller IP Specification

Institutionen för systemteknik Department of Electrical Engineering

MICROPROCESSOR AND MICROCONTROLLER BASED SYSTEMS

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Institutionen för datavetenskap Department of Computer and Information Science

Multimedia Decoder Using the Nios II Processor

Flit Synchronous Aelite Network on Chip. Mahesh Balaji Subburaman

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

St.MARTIN S ENGINEERING COLLEGE Dhulapally, Secunderabad

PGT - A path generation toolbox for Matlab (v0.1)

Institutionen för systemteknik

An Infrastructural IP for Interactive MPEG-4 SoC Functional Verification

Institutionen för systemteknik Department of Electrical Engineering

Design of Single Scalar DSP based H.264/AVC Decoder

5 MEMORY. Figure 5-0. Table 5-0. Listing 5-0.

Institutionen för systemteknik

EE 8217 *Reconfigurable Computing Systems Engineering* Sample of Final Examination

EECS150 - Digital Design Lecture 7 - Computer Aided Design (CAD) - Part II (Logic Simulation) Finite State Machine Review

Architectural Support for Reducing Parallel Processing Overhead in an Embedded Multiprocessor

William Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved.

EE108B Lecture 17 I/O Buses and Interfacing to CPU. Christos Kozyrakis Stanford University

INSTITUTE OF AERONAUTICAL ENGINEERING Dundigal, Hyderabad ELECTRONICS AND COMMUNICATIONS ENGINEERING

An Infrastructural IP for Interactive MPEG-4 SoC Functional Verification

Design And Implementation Of USART IP Soft Core Based On DMA Mode

Design of Embedded DSP Processors Unit 5: Data access. 9/11/2017 Unit 5 of TSEA H1 1

TSEA44 - Design for FPGAs

Design of Embedded DSP Processors

COMPUTER ORGANIZATION AND ARCHITECTURE

Section III. Transport and Communication

MLR Institute of Technology

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

The Nios II Family of Configurable Soft-core Processors


The ARCUS Planning Framework for UAV Surveillance with EO/IR Sensors

Institutionen för systemteknik

Chapter 4. MARIE: An Introduction to a Simple Computer. Chapter 4 Objectives. 4.1 Introduction. 4.2 CPU Basics

ECE 587 Hardware/Software Co-Design Lecture 23 Hardware Synthesis III

BRIDGE PIF / WISHBONE

Synthesis-driven Derivation of Process Graphs from Functional Blocks for Time-Triggered Embedded Systems. Ghennadii Sivatki

PART A (22 Marks) 2. a) Briefly write about r's complement and (r-1)'s complement. [8] b) Explain any two ways of adding decimal numbers.

FPGA Design Challenge :Techkriti 14 Digital Design using Verilog Part 1

Ref: AMBA Specification Rev. 2.0

Institutionen för systemteknik

MODBUS APPLICATION PROTOCOL SPECIFICATION V1.1a CONTENTS

05 - Microarchitecture, RF and ALU

SCUBA-2 PCI Card DSP Code Overview

EENG 2910 Project III: Digital System Design. Due: 04/30/2014. Team Members: University of North Texas Department of Electrical Engineering

DEVELOPMENT AND VERIFICATION OF AHB2APB BRIDGE PROTOCOL USING UVM TECHNIQUE

Institutionen för systemteknik

The task of writing device drivers to facilitate booting of the DSP via these interfaces is with the user.

Interconnecting Components

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University

Learning the Binary System

Core Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

Universal Serial Bus Host Interface on an FPGA

The Xilinx XC6200 chip, the software tools and the board development tools

Chapter 5 - Input / Output

Digital Design with FPGAs. By Neeraj Kulkarni

The Design of MCU's Communication Interface

EECS150 - Digital Design Lecture 4 - Verilog Introduction. Outline

DESIGN A APPLICATION OF NETWORK-ON-CHIP USING 8-PORT ROUTER

Design of Transport Triggered Architecture Processor for Discrete Cosine Transform

VHDL-MODELING OF A GAS LASER S GAS DISCHARGE CIRCUIT Nataliya Golian, Vera Golian, Olga Kalynychenko

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Application Note. Introduction AN2471/D 3/2003. PC Master Software Communication Protocol Specification

1. Define Peripherals. Explain I/O Bus and Interface Modules. Peripherals: Input-output device attached to the computer are also called peripherals.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Sciences

ADPCM-HCO Voice Compression Logic Core

f. ws V r.» ««w V... V, 'V. v...

: : (91-44) (Office) (91-44) (Residence)

The x86 Microprocessors. Introduction. The 80x86 Microprocessors. 1.1 Assembly Language

Design and Implementation of Effective Architecture for DCT with Reduced Multipliers

Chapter 2. Cyclone II Architecture

Design and Implementation of Sampling Rate Converters for Conversions between Arbitrary Sampling Rates. Fedor Merkelov, Yaroslav Kodess

EMBEDDED SOPC DESIGN WITH NIOS II PROCESSOR AND VHDL EXAMPLES

1. INTRODUCTION TO MICROPROCESSOR AND MICROCOMPUTER ARCHITECTURE:

Design and Implementation of 3-D DWT for Video Processing Applications

CPU ARCHITECTURE. QUESTION 1 Explain how the width of the data bus and system clock speed affect the performance of a computer system.

Keystone Architecture Inter-core Data Exchange

Implementation of Pipelined Architecture Based on the DCT and Quantization For JPEG Image Compression

structure syntax different levels of abstraction

Here is a list of lecture objectives. They are provided for you to reflect on what you are supposed to learn, rather than an introduction to this

ENGG3380: Computer Organization and Design Lab4: Buses and Peripheral Devices

Transcription:

Institutionen för systemteknik Department of Electrical Engineering Examensarbete Design and Implementation of a DMA Controller for Digital Signal Processor Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av Guoyou Jiang LiTH-ISY-EX--10/4244--SE Linköping 2010 Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Linköpings tekniska högskola Linköpings universitet 581 83 Linköping

Design and Implementation of a DMA Controller for Digital Signal Processor Examensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av Guoyou Jiang LiTH-ISY-EX--10/4244--SE Handledare: Examinator: Dake Liu isy, Linköpings universitet Dake Liu isy, Linköpings universitet Linköping, 12 August, 2010

Avdelning, Institution Division, Department Division of Computer Engineering Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Datum Date 2010-08-12 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport ISBN ISRN LiTH-ISY-EX--10/4244--SE Serietitel och serienummer Title of series, numbering ISSN URL för elektronisk version http://www.da.isy.liu.se http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-58868 Titel Title Design and Implementation of a DMA Controller for Digital Signal Processor Författare Author Guoyou Jiang Sammanfattning Abstract The thesis work is conducted in the division of computer engineering at the department of electrical engineering in Linköping University. During the thesis work, a configurable Direct Memory Access (DMA) controller was designed and implemented. The DMA controller runs at 200MHz under 65nm digital CMOS technology. The estimated gate count is 26595. The DMA controller has two address generators and can provide two clock sources. It can thus handle data read and write simultaneously. There are 16 channels built in the DMA controller, the data width can be 16-bit, 32-bit and 64-bit. The DMA controller supports 2D data access by configuring its intelligent linking table. The DMA is designed for advanced DSP applications and it is not dedicated for cache which has a fixed priority. Nyckelord Keywords direct memory access, DMA, digital signal processing, DSP, linking table, processor, peripherals, scalability, testbench, verification

Abstract The thesis work is conducted in the division of computer engineering at the department of electrical engineering in Linköping University. During the thesis work, a configurable Direct Memory Access (DMA) controller was designed and implemented. The DMA controller runs at 200MHz under 65nm digital CMOS technology. The estimated gate count is 26595. The DMA controller has two address generators and can provide two clock sources. It can thus handle data read and write simultaneously. There are 16 channels built in the DMA controller, the data width can be 16-bit, 32-bit and 64-bit. The DMA controller supports 2D data access by configuring its intelligent linking table. The DMA is designed for advanced DSP applications and it is not dedicated for cache which has a fixed priority. 5

Acknowledgments This is the result of master thesis work starting from spring of 2009 to the spring of 2010 in Linköping University. First of all, I would like to thank my supervisor and examiner Professor Dake Liu, who gave me the great opportunity to do this final year project. The thesis would not be possible to complete without his experience and support. Second, I would like to give my gratitude to those Ph.D students in the division of Computer Engineering. Their experience in the digital signal processor design helped me a lot. Jian Wang, who helped me with some key issues in the design of behavior model. Di Wu, who introduced me with this topic. Olof Kraigher, who helped me to solve some programming problems of the C++ model. I also want to thank He Zhang, who helped me discussing some example applications of the design. I also want to appreciate Thomas Österholm, who helped me to integrate my design to the complete DSP system. Andreas Ehliar and Johan Eilert who gave me a lot of advice while implement my design into ASIC. Last but not least, I want to express my appreciation to my parents in my hometown Shanghai, their love and supports are unlimited and throughout my entire academic career far away from home. 7

Contents 1 Introduction 13 1.1 Scope.................................. 14 1.2 Method................................. 14 1.3 Thesis Overview............................ 15 1.4 Notations................................ 15 1.5 Abbreviations.............................. 16 2 Background 17 2.1 DMA Basics............................... 17 2.2 DMA Operations............................ 18 2.2.1 Normal DMA Operation.................... 19 2.2.2 Chain Operation........................ 19 2.2.3 Linking Table Operation.................... 20 3 Application Requirements 23 3.1 Application Analysis.......................... 23 3.2 Requirement Specification....................... 26 4 Interfaces 29 4.1 Host Interface.............................. 29 4.1.1 Main Status Register...................... 29 4.1.2 Main Control Register..................... 30 4.1.3 Special Memory Control Register............... 31 4.2 Memory Interface............................ 31 4.3 Behavior model of I/O......................... 32 4.4 Task Packet Specification....................... 32 5 DMA Hardware 37 5.1 Host Interface.............................. 38 5.1.1 Block Diagram......................... 38 5.1.2 Interface............................. 38 5.2 Source Address Generator....................... 39 5.2.1 Block Diagram......................... 39 5.2.2 Interface............................. 40 5.3 Destination Address Generator.................... 41 9

10 Contents 5.3.1 Block Diagram......................... 41 5.3.2 Interface............................. 41 5.4 Source Decoder............................. 42 5.4.1 Block Diagram......................... 42 5.4.2 Interface............................. 42 5.5 Destination Decoder.......................... 42 5.5.1 Block Diagram......................... 42 5.5.2 Interface............................. 43 5.6 Transaction FSM............................ 43 5.6.1 Interface............................. 45 6 Integration 47 6.1 Hardware Integration.......................... 47 6.2 Software Integration.......................... 47 6.3 DMA Programming.......................... 48 6.3.1 Initialize the DMA Controller................. 49 6.3.2 Poll the DMA Controller................... 50 6.3.3 Handle the DMA Interrupt.................. 51 7 Verification 53 7.1 Functional Verification......................... 53 7.2 Hardware Implementation....................... 54 8 Conclusion 55 8.1 Achieved Results............................ 55 8.1.1 DMA Benchmark........................ 55 8.1.2 Comparison........................... 56 8.1.3 Conclusion........................... 57 8.2 Future Work.............................. 57 Bibliography 59 A DMA Simulator C++ Header 61 B DMA Simulator C++ Code 63 List of Figures 1.1 DIT butterfly of Radix-2 FFT..................... 14 2.1 System overview............................ 18 2.2 Basic DMA operation to save processor run time........... 19 2.3 DMA Chain operation example..................... 20 2.4 An example of DMA linking table operation............. 20 3.1 Matrix Transposition.......................... 23 3.2 Transfer decomposition of Example 3.1................ 24

Contents 11 3.3 Transfer decomposition of Example 3.2................ 25 3.4 Neighbor Searching in Motion Estimation.............. 25 3.5 Transfer decomposition of Example 3.3................ 27 4.1 DMA configuration........................... 32 5.1 DMA Hardware architecture...................... 37 5.2 DMA Controller Block Diagram.................... 38 5.3 Block diagram of Host Interface Module............... 39 5.4 Block diagram of Source address generator.............. 40 5.5 Block diagram of Destination address generator........... 41 5.6 Block diagram of Source decoder................... 42 5.7 Block diagram of Destination decoder................ 43 5.8 Finite State Machine of the control logic............... 44 7.1 DMA Functional Verification Flow.................. 53 8.1 Timing diagram of basic DMA operation............... 55 8.2 Timing diagram of linking table operation.............. 56 List of Tables 3.1 Preparing DMA for Motion Estimation................ 26 3.2 Requirement Specification....................... 28 4.1 Host Interface.............................. 29 4.2 DMA Registers specification...................... 30 4.3 Main status register specification................... 30 4.4 Main control register specification.................. 30 4.5 Special memory control register specification............ 31 4.6 Memory Interface............................ 31 4.7 Task packet specification........................ 33 4.8 Control Vector 1............................ 34 4.9 Control Vector 2............................ 35 4.10 Control Vector 3............................ 36 4.11 Control Vector 4 & 5.......................... 36 4.12 Control Vector 6 & 7 & 8....................... 36 5.1 Interface of Host Interface Module.................. 39 5.2 Interface of Source address generator................. 40 5.3 Interface of Destination address generator.............. 41 5.4 Interface of Source decoder...................... 43 5.5 Interface of Destination decoder.................... 44 5.6 Interface of Transaction FSM..................... 45 8.1 Synthesis Result of DMA controller.................. 56

12 Contents 8.2 Results Comparison with and without DMA............. 57

Chapter 1 Introduction Today, as the technology evolving, there are lots of DSP applications emerge on the horizon. The demands for rich content multimedia such as HDTV or 3D display are huge. Behind all these demands, there are always some technologies pushing the need for better experience of electronic products. One of them is called digital signal processing. The DSP techniques have provided improvements in traditional signal processing applications like audio, visual, radar, and communications [9, p.1]. The component which does the digital signal processing can be called digital signal processor. A special designed peripheral of the processor can help the processor itself with accessing memories. That peripheral can be called DMA controller. With the help of DMA or DMA controller, the processor can do more tasks related to computing itself while the data transfer is in progress. Since most of the memory accesses are hidden from the DSP algorithms, it is important to reveal the hidden memory accesses from the algorithms [6]. A DMA controller will be a great help in the perspective of both power consumption and performance benchmark. For example, a DIT butterfly algorithm, which is the basis of FFT algorithm, can be divided into the following steps and it is shown in Figure 1.1: 1. Load two complex operands; 2. Load one complex coefficient and perform one complex Multiply; 3. Perform two complex Addition; 4. Store two complex results. This is a simple example of memory accesses hidden in the basic DSP algorithms, more detailed discussion will be presented in Chapter 3. 13

14 Introduction Figure 1.1. DIT butterfly of Radix-2 FFT 1.1 Scope The scope of this thesis work is to design and implement a DMA peripheral for Senior, a DSP processor developed at the division of Computer Engineering in Linköping University. The interface between the DMA controller and DSP core was already done in another project [7, p.53]. The design work started from the definition of DMA specification. For many DSP applications, it is always desired to use a technique called linking table to accelerate the processing two-dimensional array [6, p.584]. The linking table is thus supported in the current DMA design. In order to make sure the design is correct, a test bench is also developed to verify the functionality of designed modules. Since the DMA should work with Senior DSP, the test bench was written on the basis of the Senior test bench. 1.2 Method For designing the DMA module, the specification should be defined on the requirement of applications. Since the DMA is designed to meet the need of Senior DSP, a behavioral model of DMA module should also be added to the exist Senior instruction set simulator. It is important to develop the behavioral model because it can be used not only to get the performance benchmark of the hardware, but also be used to compare with the actual hardware for verification. Once the behavioral model is done, the RTL implementation is to translate the behavioral model into RTL language such as Verilog. After the completion of RTL implementation, the behavioral model is used as a golden reference to verify the RTL module. If they produce the same result, then it is believed that the RTL implementation is correct.

1.3 Thesis Overview 15 1.3 Thesis Overview In Chapter 1, a brief introduction is presented to let the reader know what this thesis is about. Some basic knowledge background and operations of DMA will then be discussed in Chapter 2. In Chapter 3, some applications will be analyzed first and then the requirement specification will be discussed based on the analysis of application requirements. The designed DMA controller should work together with our host Senior DSP, in Chapter 4, the interfaces and registers of the DMA controller will be described along with the DMA task. Thus, the user of Senior will have an idea on how the DMA works with Senior DSP. After discussing the requirement specification and the host interface, Chapter 5 will describe the detailed hardware architecture of the designed DMA controller, the micro architecture of each block will also be detailed in this chapter. Once the DMA controller hardware is completed, we need to integrate it into the Senior system, Chapter 6 discuss the integration of DMA controller both in hardware perspective and in software perspective. The DMA controller behavioral model will also be discussed. Chapter 7 will discuss the verification of the implemented hardware. Chapter 8 is the summary which contains the results I have got, together with the conclusions and the future work. 1.4 Notations In order to make the thesis more understandable, there are some notations the readers should be kept in mind while go through the text. A $ and 0x before the number means that the number is in hexadecimal. A number without any prefix is a decimal number. For example, "0x64" means decimal value 100, while "64" means decimal value 64. When discussing specific bits of a word, the Verilog syntax is used as far as possible. Three zeros after each other followed by three ones is written as 6 b000111, where 6 denotes the total number of bits, the b tells it is a binary value. status[10:5] means the bits 10 to 5 of register status, and just bit 3 is written as status[3].

16 Introduction 1.5 Abbreviations 3D 3 Dimensional AGU Address Generation Unit ASIC Application Specific Integrated Circuit ASIP Application Specific Instruction set Processor DCT Discrete Cosine Transform DDR Double Data Rate DIT Decimation In Time DM Data Memory DMA Direct Memory Access DRAM Dynamic Random Access Memory DSP Digital Signal Processor FFT Fast Fourier Transform FIFO First In First Out FPGA Field Programable Gate Array FSM Finite State Machine GIO General I/O HDTV High Definition Television I/O Input/Output ISR Interrupt Service Routine IP Intellectual Property JPEG Joint Photographic Experts Group LSB Least Significant Bit MB Macro Block MP3 MPEG 1 Layer 3 MSB Most Significant Bit MUX Multiplexer PC Program Counter PM Program Memory RTL Register Transfer Level SDR Single Data Rate

Chapter 2 Background With the help of pipeline, the processor core can execute one operation in one cycle, including calculation, data load and data store, in reality it is only possible to achieve optimal performance in the application if the processor core has to do the data transfer itself [4, p.75]. This is where the DMA controller can be used to relieve the core from data movements. 2.1 DMA Basics DMA stands for Direct Memory Access, and it is a technique to transfer data blocks between memories directly without using the processor for data access [6, p.535] [5]. Since the DSP is designed to do highly computational work, in most cases, a separated peripheral should help the processor core to access processor memories instead of the processor itself doing that. While the peripheral is doing memory transactions, the processor can do other operations not related to those memory transfers. DMA module or DMA controller, by definition, is a peripheral module of a processor core for direct memory access. The basic work flow of a DMA transaction can be described as follows. The core or other data units prepare and send a DMA request to the DMA controller when they want to transfer a lot of data. The DMA controller prepares and transfers data while the core can do other operations. The core might poll the status of DMA controller to see if the transfer is completed, or an interrupt will be sent to core or other data units by the DMA controller when the transaction is finished. Then the processor core can decide if it is going to continue to process on the data. A DMA subsystem can consist of a processor core, DMA module and several memory modules connected to both processor core and DMA module. The DMA module can provide DMA transfers between two memory interfaces. Transfers can also be performed between memories and high-speed I/Os. Figure 2.1 shows a typical DSP sub-system with the DMA module inside. In this DSP sub-system, the DSP core acts like the system master, and the DMA module is the slave of the DSP core. On the other hand, the DMA module 17

18 Background Figure 2.1. System overview is the master of its connected memory modules and high-speed I/Os, etc. Both the DSP core and the DMA module can access the memory modules, but cares must be taken since the memories cannot be accessed at the same time. From the DMA controller s point of view, the master DSP core configure the data format of the transaction and request DMA to do the data transfer. The configuration is called a DMA channel, which consists of the task priority, source port and destination port of the transfer, start addresses of both ports, the data packet size, etc. 2.2 DMA Operations Usually, the DMA controller should be able to support more than one operation, since there are quite a lot of different access patterns according to different DSP algorithms. This section will illustrate several transfer options and their operations.

2.2 DMA Operations 19 2.2.1 Normal DMA Operation This is a simple DMA operation performing a block copy. In this operation, DMA performs a block copy from one location to another, either on the same interface or on different interfaces. The external software running on the processor core is responsible for limiting the access time. Figure 2.2 shows the basic DMA operation performed by the DMA controller. Figure 2.2. Basic DMA operation to save processor run time. As we can see from Figure 2.2, the processor core is responsible for the DMA transaction, once there is a need for the data of the processor, the processor will prepare a DMA request which specifies some basic parameters of the transfer. Then the processor will send the request through the general I/O to the DMA controller. The DMA controller will transfer the corresponding data from memory location 1 to memory location 2 based on the request sent by the processor. When the transfer is finished, the processor will check the status register of DMA controller or an interrupt will be sent to the processor. When the processor get the information that the transfer is done. It can use the data provided by the DMA. Thus, while the DMA is doing the data transfer, the processor can do other things rather than transferring the data itself, the run time can be saved. 2.2.2 Chain Operation In this operation, a contiguous set of elements can be transferred when a synchronous event occurs [1] [8]. The DMA controller is used to transfer a chain of data elements which have equal distance between each element. Once the DMA controller gets the task, it will setup the proper parameters and transfer each element in that chain. Figure 2.3 is an illustration of this operation. As we can see in Figure 2.3, each data element is separated by fixed stride. After transferring the first data element, the DMA can transfer the next element just like the data elements are chained together. By doing this operation, extra time for channel configuration can be saved.

20 Background Figure 2.3. DMA Chain operation example. 2.2.3 Linking Table Operation In this operation, multiple data blocks will be merged as one large data block of a DMA transaction. Since some of the DSP algorithms require data blocks at different locations in the main memory, with the help of linking table, multiple data blocks can be loaded sequentially by one DMA transaction. For example, in a video CODEC application, it is often desired to compare data from different reference frames [6]. A linking table concatenates several data blocks into one DMA transaction. Figure 2.4 gives an example of linking table. Figure 2.4. An example of DMA linking table operation. The first data block starts at the physical address 0x2000, the length of this block is 256 data words. While the first data block is loaded, the loading of second data block, which has the block number 2, is followed at once. As we can see from the Figure 2.4, the start address is 0x4000 and the length is 128. And after the loading of data block 2, the loading of data block 3 is activated immediately. The start address of data block 3 is 0x8000 and block length is 512. When the link=0 is reached at the end of data block 3, the DMA transaction is finished. Using linking table, three non-continuous data blocks transferring are merged into one single DMA transaction.

2.2 DMA Operations 21 Actually, linking table operation is a more flexible form of chain operation. Since the distance between each data element is not fixed, we need another parameter to determine the length of each data element. Table 4.7 gives us a detailed configuration of linking table.

Chapter 3 Application Requirements In this chapter, several application examples will be described and analyzed, then the requirement specification will be proposed based on the analysis of these examples. 3.1 Application Analysis First of all, let us take several application examples into consideration. Example 3.1: Matrix Transposition Suppose we want to transpose a matrix. 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 0x9 0xA 0xB 0xC 0xD 0xE 0xF Address Data 0 0x0 1 0x1 2 0x2 3 0x3 4 0x4 5 0x5.... 14 0xE 15 0xF 0x0 0x4 0x8 0xC 0x1 0x5 0x9 0xD 0x2 0x6 0xA 0xE 0x3 0x7 0xB 0xF Figure 3.1. Matrix Transposition The matrix may be saved in the memory consequently shown as Figure 3.1. In order to transpose the matrix, we can simply move the data from the original address to the desired position. It could be thus abstracted by the chain operation as we discussed in Section 2.2. 23

24 Application Requirements Figure 3.2. Transfer decomposition of Example 3.1 The data transfer can be represented in Figure 3.2, we can split the whole transfer into four chained transfer. In the example, the source address is discrete with a stride of 4 data words while the destination address is continuous. This is only a simple example due to the small size of the matrix. In more complicated application, the matrix could be very large, but the basic principle still holds. Example 3.2: Create a large Matrix Suppose we want to create a large matrix with 4096 elements, each element of the matrix is the same value 0 or 1. This case is quite common in the matrix manipulation in both communication algorithms and video processing algorithms. It is possible to create such matrix by writing continuous zeros or ones to a serious address. But to do this will waste quite a lot of precious core cycles, which makes impossible for the core to do more useful tasks. In this case, we can simply use the DMA controller to create the zero matrix. First we use the core to write one element in DM0 the matrix, then we use the DMA controller to transfer the same content to the DM1, suppose we should create the matrix in DM1. As we can see from Figure 3.3, the transfer is quite simple. The source address is fixed, while the destination address is continuous. The data to be transferred is the same as the size of the matrix.

3.1 Application Analysis 25 Figure 3.3. Transfer decomposition of Example 3.2 Let us see a more complicated and realistic example according to the algorithms of motion estimation [6, p.585]. Example 3.3: Motion Estimation In the motion estimation algorithm, each macro block (usually 16 16 = 256 pixels) in the current frame will be compared by searching the neighboring area of the reference frame. 01 02 03 04 05 06 07 08 09 17 18 19 20 25 26 28 33 34 35 36 01 02 03 04 05 06 07 08 09 17 25 27 33 Figure 3.4. Neighbor Searching in Motion Estimation Suppose we divide the picture into 8 8 = 64 macro blocks, each macro block contains 256 pixels. We want to estimate the motion vector of macro block 27 in the current frame. Based on the algorithm, we need to search the neighboring macro blocks in the reference frame. The macro blocks of number 18, 19, 20, 26, 28, 34, 35, 36 in the reference frame are going to be compared. Usually, the data memory of the processor core is not large enough to hold the whole picture, we need to transfer the desired data from main memory to the data memory of the processor core. Then the processor can perform the algorithms on the data.

26 Application Requirements Let s say the segment address of the current frame in the main memory is 32768 and the address of the reference frame is 32768+(8 8) (16 16) = 49152. Thus, we can specify the data block to be transferred in Table 3.1. Specification Value Comment DMA task ID 1 The identification of transaction Task priority 1 The priority of the transaction Number of links 5 Source port Main memory Destination port DM0 Destination start 0 address Link 1 start address 32768 + 26 256 = 39424 Block 27 in current frame Link 1 length 256 Link 2 start address 49152 + 17 256 = 53504 Block 18 in reference frame Link 2 length 768 3 blocks in row Link 3 start address 49152 + 25 256 = 55552 Block 26 in reference frame Link 3 length 256 Link 4 start address 49152 + 27 256 = 56064 Block 28 in reference frame Link 4 length 256 Link 5 start address 49152 + 33 256 = 57600 Block 34 in reference frame Link 5 length 768 3 blocks in row Table 3.1. Preparing DMA for Motion Estimation Based on the data block specification in Table 3.1, we can draw the transfer decomposition in Figure 3.5 as follows. 3.2 Requirement Specification As we have described in Chapter 2, the DSP core is responsible to configure the DMA controller. So we need to specify the parameters of the memory transfer. As configured by the DSP core, the DMA controller will connect a source port and a destination port. Here, a port is either a data source supplying data or a data sink consuming data. In most cases, a port is a memory location or a data buffer. A DMA data transaction is to move data from source port to destination port as configured by the DMA task from the master DSP core. In order to design the DMA controller, we need to specify the following parameters of the memory transaction. Number of ports supported by the DMA controller This specifies the number of channels can be connected by the DMA controller. Address Generator Unit (AGU) The AGU is used to provide address required for memory access. At least

3.2 Requirement Specification 27 Figure 3.5. Transfer decomposition of Example 3.3 two AGUs are needed, one to provide source address and the other is to provide the destination address of a data transfer. Data Width Since the DMA controller should support different memory modules, the width of data path should be configurable. We need to specify the data widths supported by the DMA module. Memory Organization Since there are two different ways to store words in a byte-addressed memory. The least significant byte stored at lower address is called little endian, while the least significant byte stored at higher address is called big endian [3]. There is no specific reason why to choose one way or another, but still we need to specify the format we support during the data transfer. Linking Table support As we described in Chapter 2, the linking table can save the extra cost for configuring several separate data blocks by concatenating several data blocks into one transaction. On the other hand, it also costs extra hardware to keep track of several different data blocks [2]. Thus we need to specify the length of the linking table. Table 3.2 shows the requirement specification of the DMA controller to be designed for Senior system.

28 Application Requirements No. Description 1 16 Source ports: 8 on-chip memory, 1 off-chip memory, 1 high-speed I/O, other reserved. 2 16 Destination ports: 8 on-chip memory, 1 off-chip memory, 1 high-speed I/O, other reserved. 3 Address Generator Unit (AGU) 1 for Source port, 1 for Destination port, each has 32b address space. 4 Clock Generator: supply clock signal for memory (I/O),Source:Destination 1:1, 1:1/2, 1:1/4, 1:1/8, 1:1/16, 1:1/32, 1:1/64; 1:2, 1:4, 1:8, 1:16, 1:32, 1:64. 5 Data width Source port: 8 bits, 16 bits, 32 bits. (64 bits not implemented in Senior.) Destination port: 8 bits, 16 bits, 32 bits. (64 bits not implemented in Senior.) 6 Memory organization: The DMA controller should support both big endian and little endian data. 7 Linking Table supported, the maximum length of linking table is 64. Table 3.2. Requirement Specification

Chapter 4 Interfaces The DMA module is controlled by the Senior core. Thus, when configuring, the Senior core uses its I/O instruction in and out to read and write the registers of the DMA module. 4.1 Host Interface The host interface of DMA module conforms to the standard Senior I/O and should be connected through general I/O of Senior processor. The interface between DSP core and DMA module can be seen in Table 4.1. The data buses from and to the DMA module are 32 bits wide. Only the 16 LSB are used for current DMA configuration. Name width DIR Description clk_i 1 In System clock. rst_i 1 In System reset, active low. addr_i 16 In Address input (from DSP core). data_i 32 In Data input (from DSP core). rd_strobe_i 1 In Read strobe signal. wr_strobe_i 1 In Write strobe signal. data_o 32 Out Data output (to DSP core). Table 4.1. Host Interface Table 4.2 gives an overview of the DMA Register specification. The reference [7, p.53] has shown more detailed information about how to connect a peripheral to the Senior I/O. 4.1.1 Main Status Register The status register is used to show the status of DMA transactions. Firmware developer can use this register to handle the DMA transactions. 29

30 Interfaces Name Addr Width written by Description Status 00 16 DMA Show the status of DMA. Further details can be found in Table 4.3. Control 01 16 Senior Used for configuring and controlling the DMA, details can be found in Table 4.4. Output Data 10 16 Not used in current implementation. Input Data 11 16 Senior DSP core writes task packet to this port to configure the DMA channel. Table 4.2. DMA Registers specification Bits Specification [0] Idling or busy: Idle=0, Busy = 1. [1] When 1, a channel can be configured, When 0, no channel is available. [2] When 1, running task is finished. [3] When 1, an exception is occurred. [4] When 1, task queue is full. [15 : 5] Reserved Table 4.3. Main status register specification 4.1.2 Main Control Register The control register, as the name suggests, is used to control a DMA transaction. Bits Specification [0]=1 Reset DMA, flush the current task. [1]=1 Shutdown DMA. [2]=1 Data rate: always using DMA clock. [9 : 3] Reserved [10]=1 Activate a task (Channel) which is specified in task ID. [14 : 11] DMA task ID [15] When [15] = 1, ask for a channel configuration Table 4.4. Main control register specification

4.2 Memory Interface 31 4.1.3 Special Memory Control Register This register doesn t belong to the general I/O of Senior core. It is a special purpose register, which is written by the DMA controller and read by Senior core. By writing the corresponding bit in the register, the DMA controller will notify the Senior core which memory is being accessed now. Bits Specification [0]=1 The DMA controller is accessing DM 0. [1]=1 The DMA controller is accessing DM 1. [2]=1 The DMA controller is accessing PM. [15 : 3] Reserved Table 4.5. Special memory control register specification 4.2 Memory Interface The memory interface is used for the slaves of the DMA module. Since the DMA module supports 16 in ports and 16 out ports, we need 32 ports in all. Table 4.6 shows the detail of the memory interface needed for the DMA module. Name width DIR Description src0_data_i 32 I Data input for Source Port 0. src0_addr_o 16 O Address output for Source Port 0. src0_csn_o 1 O Memory chip select enable for Source Port 0, active low. src0_oe_o 1 O Memory output enable for Source Port 0, active low. src1 Interfaces for Source Port 1.... src15 Interfaces for Source Port 15. dst0_data_o 32 O Data output for Destination Port 0. dst0_addr_o 16 O Address output for Destination Port 0. dst0_csn_o 1 O Memory chip select enable for Destination Port 0, active low. dst0_we_o 1 O Memory write enable for Destination Port 0, active low. dst1 Interfaces for Destination Port 1.... dst15 Interfaces for Destination Port 15. Table 4.6. Memory Interface

32 Interfaces 4.3 Behavior model of I/O Since we use only one data I/O for both configuring the DMA module and writing DMA task, we need a protocol to distinguish the DMA configuration and task receiving. Figure 4.1 illustrates the configuration flow of the DMA module. Figure 4.1. DMA configuration Here, the PREAMBLE means the first control vector we sent to control register of the DMA module. Chapter 6 shows several examples of how to program the DMA controller. 4.4 Task Packet Specification The task packet is used to setup the DMA transfer channel, both for normal DMA operation and linking table multiple transaction. Since the DSP core has a general I/O of 16-bit data width, the task packet is also 16-bit wide per data word. We could specify a transaction by configuring a channel. The configuration includes configuring the source, the destination and the transaction. Generally, a basic channel configuration includes the following steps: Task priority Data size: the length of the data block. Data from: the name of the source port. Data to: the name of the destination port. The physical start address of the source port. The physical start address of the destination port. The endian behavior of the source port: Big or Little endian. Besides the software configuration for the DMA transaction, the hardware specifications of transactions are also important to know by the DMA designers and DMA users:

4.4 Task Packet Specification 33 The maximum source clock rate. The maximum destination clock rate. Data width of the source port: 8 bits, 16 bits, 32 bits or 64 bits. Data width of the destination port: 8 bits, 16 bits, 32 bits or 64 bits. Data protocol of the source port: error check or not. Table 4.7 shows a task packet consists of 2 links, and from Table 4.8 to Table 4.12, we can see the explanation of each control vector. The length of task packet depends on the total number of the linking table. Number of Links Task Priority Task ID 8b 4b 4b SRC DST SRC DST SRC DST SRC DST width width proc proc endian endian rate rate 2b 2b 1b 1b 1b 1b 4b 4b Reserved Source Port Destination Port 6b 5b 5b Destination Address low part 16b Destination Address high part 16b Source Address 1 low part 16b Source Address 1 high part 16b Length of Link 1 16b Source Address 2 low part 16b Source Address 2 high part 16b Length of Link 2 16b... Table 4.7. Task packet specification

34 Interfaces Name Bits Description Number of Links [15:8] Specify the total number of links, up to 64 Task Priority [7:4] Specify the priority of the task.(not yet implemented) Task ID [3:0] Specify Task ID. Table 4.8. Control Vector 1

4.4 Task Packet Specification 35 Name Bits Description SRC width [15:14] Specify the data width of source port: 2 b00: 8 bits 2 b01: 16 bits 2 b10: 32 bits 2 b11: 64 bits DST width [13:12] Specify the data width of destination port: 2 b00: 8 bits 2 b01: 16 bits 2 b10: 32 bits 2 b11: 64 bits SRC proc [11] Specify if the source port use parity check: 1 b0: Don t use 1 b1: Use DST proc [10] Specify if the destination port use parity check: 1 b0: Don t use 1 b1: Use SRC endian [9] Specify endian of source port: 1 b0: Little endian 1 b1: Big endian DST endian [8] Specify endian of destination port: 1 b0: Little endian 1 b1: Big endian SRC rate [7:4] Clock rate of source port: 4 b0000: clk; 4 b0001: clk/2; 4 b0010: clk/4; 4 b0011: clk/8; 4 b0100: clk/16; 4 b0101: clk/32; 4 b0110: clk/64; DST rate [3:0] Clock rate of destination port: 4 b0000: clk; 4 b0001: clk/2; 4 b0010: clk/4; 4 b0011: clk/8; 4 b0100: clk/16; 4 b0101: clk/32; 4 b0110: clk/64; Table 4.9. Control Vector 2

36 Interfaces Name Bits Description Reserved [15:10] Reserved for future use. Source Port [9:5] Specify the source port number. Destination Port [4:0] Specify the destination port number. Table 4.10. Control Vector 3 Name Bits Description Destination Address low part [15:0] low 16 bit part of destination address. Destination Address high part [15:0] high 16 bit part of destination address. Table 4.11. Control Vector 4 & 5 Name Bits Description Source Address 1 low part [15:0] Specify low 16 bit part of source address 1. Source Address 1 high part [15:0] Specify high 16 bit part of source address 1. Length of Link 1 [15:0] Specify the length of Link 1. Table 4.12. Control Vector 6 & 7 & 8

Chapter 5 DMA Hardware Generally, the DMA controller hardware can be divided into data path and control path [6, p.572]. Figure 5.1 shows the basic architecture of the DMA module. Figure 5.1. DMA Hardware architecture The DMA data path gets data from the source port using source address generator, and stores data to the destination port using the destination address generator. In order to handle the data with different data rates and formats, source decoding and destination decoding module are also needed. The DMA control path consists of the channel configuration FSM (Finite State Machine) and transaction FSM. The DSP core can request for the configuration of a channel. When the DMA is idle, the channel configuration FSM will issue the channel to the transaction FSM module. The transaction FSM is responsible for the control of data path. When the block is transmitted, the channel configuration FSM will generate an interrupt to the DSP core. The following sections will give more detail information about the sub blocks of the DMA controller. Figure 5.2 shows the block diagram of the DMA controller with its main inputs and outputs. 37

38 DMA Hardware Figure 5.2. DMA Controller Block Diagram 5.1 Host Interface This is the interface between Senior DSP core and DMA controller. It is used to keep the control vectors sent by DSP core into registers inside the DMA controller and update the status register which can be accessed by the Senior DSP core. 5.1.1 Block Diagram Figure 5.3 shows the block diagram of the Host Interface. The input MUX is used to select input I/O data based on the input I/O address. The task FIFO is used to keep the Task packet, which will be used by transaction FSM. The output MUX is to output the desired data based on I/O address. 5.1.2 Interface Table 5.1 gives the detail interface description of the Host Interface.

5.2 Source Address Generator 39 Figure 5.3. Block diagram of Host Interface Module Name width DIR Description clk_i 1 I Clock input. rst_i 1 I Synchronous reset, active low. io_data_i 16 I Data input from Host interface. io_addr_i 16 I Address input from Host interface. io_rd_strobe_i 1 I Read strobe from Host interface. io_wr_strobe_i 1 I Write strobe from Host interface. io_data_o 16 O Data output to Host interface. (Reserved) config_reg_addr_i 8 I Read address for Task queue. config_reg_addr_en_i 1 I Read enable signal for Task queue. config_reg_data_o 16 O Task queue data output. contrl_reg_o 16 O DMA control register, output to transaction FSM. status_reg_i 16 I DMA status register, input from transaction FSM. Table 5.1. Interface of Host Interface Module 5.2 Source Address Generator This module is used to generate the address for the source port, it is controlled by the transaction FSM. 5.2.1 Block Diagram Figure 5.4 shows the block diagram of the source address generator. Once the transaction FSM decodes the task packet parameter into several control signals, it will send these signals to the source address generator. As

40 DMA Hardware Figure 5.4. Block diagram of Source address generator shown in Figure 5.4, an Adder is used inside source address generator to produce the output source port address. Two counters are also implemented to count how many words and how many links have been transferred, and thus the end link or end transfer signal will be asserted once the transfer is finished. 5.2.2 Interface Table 5.2 gives the interface detail of source address generator. Name width DIR Description clk_i 1 I Clock input. step_i 2 I Address increment step. enable_i 1 I Enable address increment. set_addr_i 1 I Set start address. end_link_o 1 O Indicate the end of one link. end_transfer_o 1 O Indicate the end of transfer. src_addr_i 32 I Start address of the transfer. src_length_i 16 I Transfer length. src_link_number_i 8 I Total number of links. src_addr_o 32 O Source address output. Table 5.2. Interface of Source address generator

5.3 Destination Address Generator 41 5.3 Destination Address Generator This module is used to generate the address for the destination port, the control signal to this module is provided by the transaction FSM. 5.3.1 Block Diagram Figure 5.5 shows the block diagram of the destination address generator. Figure 5.5. Block diagram of Destination address generator This module has the same structure as source address generator, the only difference is that it doesn t need the counter for counting transferred words or links. 5.3.2 Interface Table 5.3 gives the detailed interface description of destination address generator. Name width DIR Description clk_i 1 I Clock input. step_i 2 I Address increment step. enable_i 1 I Enable address increment. setaddr_i 1 I Set start address. addr_i 32 I Start address of the transfer. addr_o 32 O Address output. Table 5.3. Interface of Destination address generator

42 DMA Hardware 5.4 Source Decoder This module decodes the incoming data based on the task packet provided by the transaction FSM. It will adapt the data into the internal data format which can be transferred through the channel. 5.4.1 Block Diagram Figure 5.6 shows the block diagram of the source decoder. Figure 5.6. Block diagram of Source decoder The source decoder consists of several MUXs to decode the incoming data based on control signals provided by transaction FSM. First, the input data are segmented by 8 bytes, then the MUXs will select the right combination of data bytes to get the internal data format. 5.4.2 Interface Table 5.4 gives the interface detail of Source decoder. 5.5 Destination Decoder This module will package the internal data format into the data format specified by the task packet. 5.5.1 Block Diagram Figure 5.7 shows the block diagram of the destination decoder.

5.6 Transaction FSM 43 Name width DIR Description clk 1 I Clock input. rst 1 I Synchronous reset, active low. src_width 2 I Source data width. src_parity 1 I Source parity check. src_endian 1 I Source endian. channel_din 64 I Data input from source port. channel_dout 64 O Data output to channel FIFO. Table 5.4. Interface of Source decoder Figure 5.7. Block diagram of Destination decoder The destination decoder has the similar structure as source decoder. The output MUX will combine the internal data into the desired data format based on control signals provided by transaction FSM. 5.5.2 Interface Table 5.5 gives the detail interface description of Destination decoder. 5.6 Transaction FSM This FSM is necessary to control all the transaction based on the task packet provided by the DSP core. It receives the incoming task packet and saves the packet into the DMA internal registers. According to the task packet, the transaction FSM will decode the task packet based on the specification in Table 3.2 and then

44 DMA Hardware Name width DIR Description clk 1 I Clock input. rst 1 I Synchronous reset, active low. dest_width 2 I Destination data width. dest_parity 1 I Destination parity check. dest_endian 1 I Destination endian. channel_din 64 I Data input from channel FIFO. channel_dout 64 O Data output to destination port. Table 5.5. Interface of Destination decoder issue different control signals to different sub blocks of DMA controller to complete the DMA transaction. Figure 5.8 shows the Finite State Machine of the control logic. Figure 5.8. Finite State Machine of the control logic There are eight states of the transaction FSM in the current design. IDLE is the default state when the DMA controller is reset. Once the Senior core requests to configure the DMA controller, CONFIG1 state will be entered, and the transaction FSM will decode the incoming common control vectors until it finishes the first 5 common control vectors. States CONFIG2_1, CONFIG2_2 and CONFIG2_3 continues to configure the source address and link length of the linking table. Once the channel is configured, state TRANS is entered, the DMA controller starts the data transfer. When the FSM receives the end of link signal, state WAIT is entered to wait for configure the next transfer in the linking table. Then the FSM will repeat states CONFIG2_1, CONFIG2_2 and CONFIG2_3 to configure the channel. Once the end of transfer signal is detected, state FINISH will be

5.6 Transaction FSM 45 entered and the interrupt signal will be sent to the Senior core and status register will be updated. Then the DMA controller will wait for the Senior core to respond either on the status register or on the interrupt signal. 5.6.1 Interface Table 5.6 gives the detailed interface description of Transaction FSM. Name width DIR Description clk_i 1 I Clock input. rst_i 1 I Synchronous reset, active low. src_port_o 5 O Source port number. dst_port_o 5 O Destination port number. config_reg_data_i 16 I Task packet data input. contrl_reg_i 16 I Control register data input. config_reg_addr_o 8 O Task packet read address. config_addr_en_o 1 O Task packet read enable. status_reg_o 16 O Status register data output. src_addr_o 32 O Start address of source port. src_addr_en_o 1 O Enable source port start address. src_addr_incr 2 O Increment step of source port. enable_src_gen_o 1 O Source address generator enable signal. link_length_o 16 O Length of current transfer link. link_num_o 8 O Total link number. end_link_i 1 I End of current link. end_transfer_i 1 I End of current transfer. dst_addr_o 32 O Start address of destination port. dst_addr_en_o 1 O Enable destination port start address. dst_addr_incr 2 O Increment step of destination port. enable_dst_gen_o 1 O Destination address generator enable signal. src_rate_o 4 O Source port data rate. src_parity_o 1 O Source port parity check. src_endian_o 1 O Source port endian. dst_rate_o 4 O Destination port data rate. dst_parity_o 1 O Destination port parity check. dst_endian_o 1 O Destination port endian. src_csn_o 1 O Source port chip select enable, active low. src_oe_o 1 O Source port output enable, active low. dst_csn_o 1 O Destination port chip select enable, active low. dst_we_o 1 O Destination port write enable, active low. Table 5.6. Interface of Transaction FSM

Chapter 6 Integration Since the DMA controller should work together with the Senior DSP core, we need to integrate the DMA controller into the processor core. In this Chapter, the basic flow will be introduced. It includes the hardware integration and software integration. 6.1 Hardware Integration The DMA controller works as a peripheral of the Senior DSP core. As introduced in Chapter 4 and Reference [7], the peripheral can be connected to any available GIO. In the following piece of code, the DMA controller is connected to I/O number 5. The Senior DSP system has other peripherals connected such as timer and interrupt controller. The memory interface of the DMA controller should also be connected to the current Senior memory sub-system. Since the processor need to know which memory is being accessed by DMA controller to make sure the processor core will not access the same memory module, the Special Memory Control Register of DMA controller should be connected to Senior core, also. 6.2 Software Integration In order make the verification of the DMA controller easier, a behavioral model of DMA controller is also developed. Thus, it is necessary to integrate the behavioral model into the simulator. The behavioral model is written in C++. At first, the behavioral model is not exactly cycle accurate. After the simulation of hardware implementation, the behavioral model is further tuned to meet the timing specification of the actual hardware. The behavioral model should be compiled together with the Senior simulator. The DMA controller should be instantiated in header file of the simulator in Example 6.1. 47