Design of Embedded DSP Processors Unit 5: Data access 9/11/2017 Unit 5 of TSEA26-2017 H1 1
Data memory in a Processor Store Data FIFO supporting DSP executions Computing buffer Parameter storage Access Data access Memory addressing Cache Scratch pad memory Single/ multi-port Address generator Critical path Silicon cost 9/11/2017 2
Memory design: what to do Memory may not be a component in IP Addressing is part of the core design Memory peripheral is the core hardware Design for data access today includes 1. Memory peripheral design fundamentals 2. General memory addressing 3. Special memory addressing accelerations 9/11/2017 Unit 5 of TSEA26-2017 H1 3
Memory issues and challenges An application may need mega B s memories Memory consumes the majority silicon area Move memory off chip will reeduce silicon cost Keep memory on chip will keep the performance On-off chip partition / trade-off is a main challenge for performance, low silicon, and low power Memory issue is the challenge of DSP design Memory hierarchy, partition, access latency, and usage or efficiency (SW-HW co-design) 9/11/2017 Unit 5 of TSEA26-2017 H1 4
On chip memory problems Background: memory speed-up slower than that of logic when scaling silicon 1. Architecture: multi-memory blocks in parallel to support functional acceleration a) Small memory block size: fast, area inefficient, DFT? b) Large memory block size: slow and area efficient 2. Custom addressing V.S. cache (hide complexity) a) Acceleration will reduce latency and cost b) Cache easy programming and SW portability 9/11/2017 Unit 5 of TSEA26-2017 H1 5
Physical memory circuits 9/11/2017 Unit 5 of TSEA26-2017 H1 6
Data access specification Single port dual port Cache or SPM General address GEN Special address GEN such as modulo Memory peripheral functions Cache (I and D) and SPM (D) Size and number of memory modules Unsigned computing for AGU AGU Pipeline tricky Block shutting down Tricky to design for M peripherals Pipelie balancing for critical path and LPD 9/11/2017 7
The row decoder with m-lines of inputs and 2 m lines of output Copyright of Linköping University, all rights reserved Basic SRAM and its timing The column decoder and sense amplifiers with k-lines of inputs and 2 k+1 lines of output to the memory and one bit data in-out Data in-out bit 9/11/2017 For teachers using the book 8
Column line + Column bar line The row decoder Copyright of Linköping University, all rights reserved Basic SRAM and its timing A memory cell Row line The column decoder and R-W circuit (a) A memory cell 4 Data in-out bits (b) 128x4-bit signal port SRAM 9/11/2017 Unit 5 of TSEA26-2017 H1 9
Scratch pad memory Simple SRAM, not a cache In general, scratch pad memory: Is a Synch, Single port, SRAM SW designer handels data access complexity & gets opportunities Can be DRAM / ROM multi ports 9/11/2017 Unit 5 of TSEA26-2017 H1 10
Processor core, General address translation module Copyright of Linköping University, all rights reserved Cache = Valid Tag 0 1 0 1 Data (256b=16bX16) Tag (21b) Index(9b) offset(4b) Cache come from French, Cache = Hide, meaning to access easier 9/11/2017 Unit 5 of TSEA26-2017 H1 11
Scratch pad or cache? Scratch pad memory Simpler, cheaper, and use less power More opportunities for access acceleration: such as multi separated memories, custom addressing Deterministic, static only, for embedded systems! Cache memory Hide complexity, much more power, cost silicon Cache miss induced cycles costs uncertainty Programming easy, suitable for general computing 9/11/2017 Unit 5 of TSEA26-2017 H1 12
Design of memory peripherals 9/11/2017 Unit 5 of TSEA26-2017 H1 13
Basic SRAM and its timing W 9/11/2017 Unit 5 of TSEA26-2017 H1 14
D flip-flop Memory logic (case 1) Copyright Write of in Linköping data University, all rights reserved Address Memory enable Read out data Logic or SRAM longwire Write enable Memory clock Machine clock The circuitry with problem Machine clock Address Memory enable Write enable Logic delay Read out data Read out register Data valid Before clock duty modification, memory using machine clock Memory clock Address Memory enable Write enable Logic delay Read out data Read out register Data valid 9/11/2017 For teachers using the book 15 After clock duty modification, memory using memory clock
Adress register Memory logic (case 2) Copyright of Linköping University, all rights reserved Machine clock Logic or long wire Write in data M Address Memory enable Write enable Memory clock The circuitry with problem SRAM Read out data Machine clock AR out Address Memory enable Write enable old address Decoding time Reading a wrong word Right address comes too late Read out data Incorrect data Read out register Data valid Before phase modification, memory using machine clock Machine clock Memory clock AR out Address Memory enable Write enable old address Decoding time Reading a right word Right address comes in time Data available here Read out data Correct data Read 9/11/2017 out register For teachers using the book 16 Data valid After phase modification, memory clock is delayed
Memory addressing 9/11/2017 Unit 5 of TSEA26-2017 H1 17
General memory addressing Addressing Implied addressing Memory direct Segment plus offset Register indirect Register post increment Register pre decrement Index addressing Algorithm Specification Implicitly specified in the OP code A <= immediate data of the instruction A <= SEG + OFFSET A <= Selected GR A <= AP and then INC (AP) /*AP is an address pointer*/ DEC (AP) and then A <= AP A <= SEG + Index GR 9/11/2017 Unit 5 of TSEA26-2017 H1 18
Addressing feedback Memory addressing circuit in general Copyright of Linköping University, all rights reserved Inputs Address calculation logic circuit Initial address Keeper Combinational output Registered output Address pointer 9/11/2017 Unit 5 of TSEA26-2017 H1 19
An addressing circuit example Register value Offset value Addressing calculation logic II Direct address Addressing calculation M5 logic I (same 7 6 2 1 same as II in this figure) +1 1 1 2 3 1 2 M1 M2 FA + 1 2 3 4 M3 M4 1 2 6 7...... 1 2 3 4 M6... address RG1 address RG2 address RG6 address RG7 9/11/2017 For teachers using the book 20
Modulo addressing: an addressing acceleration example Register FIFO is needed by algorithms Easy to use, high power/silocn consumptions Modulo FIFO addressing on low power SRAM Using Top blocker, Bottom blocker, Address pointer Top Address pointer Bottom Address pointer <= Address pointer ± 1 If Address pointer (AP) = top then AP=bottom If Address pointer (AP) = bottom then AP=top 9/11/2017 Unit 5 of TSEA26-2017 H1 21
A convolution Copyright of Linköping University, all rights reserved 02 ACR <= 0; 03 LCR <= m; // LCR is loop counter register; 04 CAR <= coefficient_starting_address; 05 DAR <= data_starting_address; // for DM; 06 TAR <= top_address; //of FIFO in DM; 07 BAR <= bottom_address;// of FIFO in DM; 08 DM (DAR) <= input_new_data; 09 OPA <= DM (DAR); 10 OPB <= TM (CAR); 11 BFR <= OPA * OPB; 12 ACR <= ACR + BFR; 13 if DAR == BAR then DAR <= TAR 14 else DEC (DAR); 15 INC (CAR); 16 DEC (LCR); 17 if LCR <> Loop_size then jump to 09 18 else Y <= Saturate (round (ACR)); 9/11/2017 For teachers using the book 22 19 end.
A convolution Copyright of Linköping University, all rights reserved 02 ACR <= 0; 03 LCR <= m; // LCR is loop counter register; 04 CAR <= coefficient_starting_address; 05 DAR <= data_starting_address; // for DM; 06 TAR <= top_address; //of FIFO in DM; 07 BAR <= bottom_address;// of FIFO in DM; 08 DM (DAR) <= input_new_data; 09 OPA <= DM (DAR); 10 OPB <= TM (CAR); 11 BFR <= OPA * OPB; 12 ACR <= ACR + BFR; 13 if DAR == BAR then DAR <= TAR 14 else DEC (DAR); 15 INC (CAR); 16 DEC (LCR); 17 if LCR <> Loop_size then jump to 09 18 else Y <= Saturate (round (ACR)); 9/11/2017 For teachers using the book 23 19 end.
The data memory space The FIFO buffer Copyright of Linköping University, all rights reserved Design a FIFO based on SRAM MIN address BAR TAR MAX address DM BAR + 0 BAR + 1 BAR + 2 BAR + 3 BAR + 4 BAR TAR BAR TAR Step 0 Step 1 X (n-3) BAR X (n-4) X (n-4) X (n) X (n) DAR X (n-1) X (n-1) X (n-2) X (n-2) TAR X (n-3) before getting after getting new data new data 1 Step 2 Step 3 X (n) DAR BAR X (n-1) X (n-1) X (n-2) X (n-2) X (n-3) X (n-3) X (n-4) X (n-4) TAR X (n) after getting after getting new data 2 new data 3 DAR DAR Example: The procedure a FIFO getting a new data sample 9/11/2017 Unit 5 of TSEA26-2017 H1 24
Convolution hardware Copyright of Linköping University, all rights reserved Modulo addressing circuit Load data to registers M3 1 0 1 DM TM TAR + Flag if EQ 1 0 M1 * 0 1 M4 2 1 0 M2 BAR DAR = Modolu address generator + ACR 9/11/2017 Unit 5 of TSEA26-2017 H1 25
Convolution hardware Copyright of Linköping University, all rights reserved Modulo++ addressing circuit Load data to registers M3 1 0 +1 DM TM BAR + Flag if EQ 1 0 M1 * 0 1 M4 2 1 0 M2 TAR DAR = Modolu address generator + ACR 9/11/2017 For teachers using the book 26
Memory hierarchy 9/11/2017 Unit 5 of TSEA26-2017 H1 27
Memory in an embedded system An application may need >200MB SRAM E.g. a video encoder with high end monitor Frames for camera pre-processing Frames for video encoder and for video decoder Frames for video post processing A 4K frame size is 24 MB (4096 x 2160 x 3B) More than 200MB SRAM needed, it consumes ~ 50mm 2 on chip, expensive! 9/11/2017 Unit 5 of TSEA26-2017 H1 28
Memory in an embedded system An application may need X-MB ROM Except for data buffer, a video encoder with high end monitor needs yet other memories for: Up to xmb camera and other driver setting Up to xmb video CODEC codes Up to xkb codes for post processing Usually CMOS silicon processing masks are not for ROM processing. Extra cost needed! 9/11/2017 Unit 5 of TSEA26-2017 H1 29
Memory in an embedded system 1. We therefore need off chip SDRAM (DDR) and off-chip ROM (or SSD) in a system 2. We speed up on chip SRAM by splitting DM into small memory blocks Memory subsystem design is essential for embedded system 9/11/2017 Unit 5 of TSEA26-2017 H1 30
DSP core External memory I/F Main memory, ROM DSP core DMA controller & I/F Main memory, ROM Copyright of Linköping University, all rights reserved Memory hierarchy I-Scratch pad memory D-Scratch pad memories (a) I-Scratch pad memory I-cache memory D-Scratch pad memory (b) D-cache memory 9/11/2017 For teachers using the book 31
Datapath Level 1: Register file DMA controller Level 3: Main memory Copyright of Linköping University, all rights reserved Memory hierarchy on a chip Level 2: Data memory: Scratch pad memory or cache The chip 9/11/2017 Unit 5 of TSEA26-2017 H1 32
DMA Copyright of Linköping University, all rights reserved Memory hierarchy of SoC = DSP+MCU DSP MCU Accelerators L1: RF DP+CP L1: RF DP+CP DP+CP DM1 DMn PM DM1 DMn PM DMn PM I/F DMA DMA I/F DMA I/F SoCBUS and its arbitration / routing / control Main on chip memory Nonvolatile memory I/F Off chip DRAM I/F I/F 9/11/2017 Unit 5 of TSEA26-2017 H1 33
Requirements Memory partition The limit of the on chip Memory size The number of data needed simultaneously Supporting access of different data types Overhead costs from memory peripheral Critical path from memory peripheral Memory shutting down for low power 9/11/2017 Unit 5 of TSEA26-2017 H1 34
Off chip memory extension DDR or Flash M or off-chip SRAM DDR (volatile): 1. Double data rate 2. SDRAM DDR Controller: POR, refresh, access, buffering DDR PHY: circuit and I/O implementation Flash M (no volatile SSD) Floating gate to keep data, control gate to change data NAND: write fast, large size, low cost NOR: read fast, low power, and small Off chip SRAM: fast and parallel, high cost 9/11/2017 Unit 5 of TSEA26-2017 H1 35
DMA basics 9/11/2017 Unit 5 of TSEA26-2017 H1 36
DMA definition and specification DMA: Direct memory access An external device independent to the core Running load and store in parallel with a core DSP processor can do other things in parallel Requirements Large bandwidth and low latency Flexible / support different access patterns Multiple access and Linking table is important 9/11/2017 Unit 5 of TSEA26-2017 H1 37
Direct Memory Access Processor configures a DMA setup Processor asks for the right to use the main memory Gets right Processor releases a memory page (block) and connects it to a DMA port Main memory (or another memory) is connected to another DMA port Run DMA transaction Processor running other programs DMA run the transaction 9/11/2017 For teachers using the book 38 Processor takes back the memory page and release DMA as well the main memory
DMA data and control 9/11/2017 Unit 5 of TSEA26-2017 H1 39
DMA Behavior model DMA request Arbiter DMA ACK Config. REG address Configuration vector DMA data in Controller FIFO buffer Address generator 1 Address generator 2 DMA data out Clock 1 Address enable W-enable DMA status Clock 2 Address enable W-enable 9/11/2017 Unit 5 of TSEA26-2017 H1 40
Questions to discuss Why computing for memory addresses are based on unsigned computing, what are benefits Modulo addressing to emulate a FIFO, which is one kind of acceleration for address computing, what kinds of acceleration is need for address calculation of FFT? It is proposed to try (to run DIT butterfly in one cycle) it if you have time. 9/11/2017 Unit 5 of TSEA26-2017 H1 41
Concepts Copyright of Linköping University, all rights reserved Skills Review on Unit 5 System understanding Plan HW schematic HW coding Micro architecture Memory & data access Memory circuit Memory hierarchy General an special addressing for memory access We did not talk about addressing for RF Modulo addressing: A way to use part of SRAM as FIFO for filters and for Java garbage collection in JVM You may need to think how to use D-cache and SPM in parallel Plan for pipeline to balance the critical path and avoid long wire delay Coding skill is specially important for IP reuse because SRAM offered by different IP suppliers might be different 9/11/2017 Unit 5 of TSEA26-2017 H1 42
Self reading after the lecture Function of the memory subsystem Function of memory general/special addressing Read chapter 15 Think about: FIFO Behavior model mapping on SRAM Modulo addressing hardware implementation 9/11/2017 Unit 5 of TSEA26-2017 H1 43
Exciting time now! Let us discuss Whatever you want to discuss and related to HW You will have the chance after each lecture (Fö), do take the chance! Prepare your Qs for the next time 9/11/2017 Unit 5 of TSEA26-2017 H1 44
LOGO Welcome to ask any questions you want to I can answer Or discuss together I want to know what you want Dake Liu, Room 556 coridoor B, Hus-B, phone 281256, dake.liu@liu.se