Design of Embedded DSP Processors Unit 5: Data access. 9/11/2017 Unit 5 of TSEA H1 1

Similar documents
08 - Address Generator Unit (AGU)

Design of Embedded DSP Processors

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1

Design of Embedded DSP Processors

05 - Microarchitecture, RF and ALU

Design of Embedded DSP Processors Unit 7: Programming toolchain. 9/26/2017 Unit 7 of TSEA H1 1

Design of Embedded DSP Processors Unit 8: Firmware design and benchmarking. 9/27/2017 Unit 8 of TSEA H1 1

Embedded Systems Design: A Unified Hardware/Software Introduction. Outline. Chapter 5 Memory. Introduction. Memory: basic concepts

Embedded Systems Design: A Unified Hardware/Software Introduction. Chapter 5 Memory. Outline. Introduction

Embedded Systems: Hardware Components (part I) Todor Stefanov

Organization. 5.1 Semiconductor Main Memory. William Stallings Computer Organization and Architecture 6th Edition

TSEA 26 exam page 1 of Examination. Design of Embedded DSP Processors, TSEA26 Date 8-12, G34, G32, FOI hus G

CENG4480 Lecture 09: Memory 1

EE414 Embedded Systems Ch 5. Memory Part 2/2

Basic Organization Memory Cell Operation. CSCI 4717 Computer Architecture. ROM Uses. Random Access Memory. Semiconductor Memory Types

William Stallings Computer Organization and Architecture 6th Edition. Chapter 5 Internal Memory

Computer Organization and Assembly Language (CS-506)

TSEA22, DIGITALTEKNIK LECTURE 7

Characteristics of Memory Location wrt Motherboard. CSCI 4717 Computer Architecture. Characteristics of Memory Capacity Addressable Units

ECE 485/585 Microprocessor System Design

a) Memory management unit b) CPU c) PCI d) None of the mentioned

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM

CENG3420 Lecture 08: Memory Organization

Chapter 4 Main Memory

ENGIN 112 Intro to Electrical and Computer Engineering

! Memory Overview. ! ROM Memories. ! RAM Memory " SRAM " DRAM. ! This is done because we can build. " large, slow memories OR

Chapter 5 Internal Memory

COMP3221: Microprocessors and. and Embedded Systems. Overview. Lecture 23: Memory Systems (I)

Internal Memory Cache Stallings: Ch 4, Ch 5 Key Characteristics Locality Cache Main Memory

Lecture 18: Memory Systems. Spring 2018 Jason Tang

Memory. Outline. ECEN454 Digital Integrated Circuit Design. Memory Arrays. SRAM Architecture DRAM. Serial Access Memories ROM

Computer Organization. 8th Edition. Chapter 5 Internal Memory

CHAPTER 12 ARRAY SUBSYSTEMS [ ] MANJARI S. KULKARNI

Picture of memory. Word FFFFFFFD FFFFFFFE FFFFFFFF

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Chapter 5. Internal Memory. Yonsei University

Lecture 11 Cache. Peng Liu.

Memory hierarchy Outline

MEMORY AND PROGRAMMABLE LOGIC

Five Key Steps to High-Speed NAND Flash Performance and Reliability

Cycle Time for Non-pipelined & Pipelined processors

Computer Organization

Lecture Objectives. Introduction to Computing Chapter 0. Topics. Numbering Systems 04/09/2017

The Memory Hierarchy. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 3, 2018 L13-1

Computer Memory. Textbook: Chapter 1

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

Embedded Systems: Architecture

CS/EE 3710 Computer Architecture Lab Checkpoint #2 Datapath Infrastructure

Lecture 13: SRAM. Slides courtesy of Deming Chen. Slides based on the initial set from David Harris. 4th Ed.

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (5 th Week)

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

The Nios II Family of Configurable Soft-core Processors

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Lecture 11 SRAM Zhuo Feng. Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 2010

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Memory technology and optimizations ( 2.3) Main Memory

Internal Memory. Computer Architecture. Outline. Memory Hierarchy. Semiconductor Memory Types. Copyright 2000 N. AYDIN. All rights reserved.

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Module 5a: Introduction To Memory System (MAIN MEMORY)

EECS150 - Digital Design Lecture 16 - Memory

COSC 6385 Computer Architecture - Memory Hierarchies (III)

Cannon Mountain Dr Longmont, CO LS6410 Hardware Design Perspective

ISSN: [Bilani* et al.,7(2): February, 2018] Impact Factor: 5.164

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

ELCT 912: Advanced Embedded Systems

William Stallings Computer Organization and Architecture 8th Edition. Chapter 5 Internal Memory

Memory Pearson Education, Inc., Hoboken, NJ. All rights reserved.

EECS150 - Digital Design Lecture 16 Memory 1

k -bit address bus n-bit data bus Control lines ( R W, MFC, etc.)

Semiconductor Memory Classification. Today. ESE 570: Digital Integrated Circuits and VLSI Fundamentals. CPU Memory Hierarchy.

Very Large Scale Integration (VLSI)

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Contents. Memory System Overview Cache Memory. Internal Memory. Virtual Memory. Memory Hierarchy. Registers In CPU Internal or Main memory

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Processing Unit CS206T

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

CMPEN 411 VLSI Digital Circuits Spring Lecture 22: Memery, ROM

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

ECE 1160/2160 Embedded Systems Design. Midterm Review. Wei Gao. ECE 1160/2160 Embedded Systems Design

EE4380 Microprocessor Design Project

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:

Summary of Computer Architecture

ELEC-H-473 Microprocessor architecture Caches

LECTURE 5: MEMORY HIERARCHY DESIGN

Memory. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Introduction to Embedded System Processor Architectures

Memory Hierarchy and Cache Ch 4-5

Main Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1

Electronic Control systems are also: Members of the Mechatronic Systems. Control System Implementation. Printed Circuit Boards (PCBs) - #1

ECEN 449 Microprocessor System Design. Memories

CS 33. Memory Hierarchy I. CS33 Intro to Computer Systems XVI 1 Copyright 2016 Thomas W. Doeppner. All rights reserved.

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems

Question?! Processor comparison!

Digital Integrated Circuits Lecture 13: SRAM

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

Transcription:

Design of Embedded DSP Processors Unit 5: Data access 9/11/2017 Unit 5 of TSEA26-2017 H1 1

Data memory in a Processor Store Data FIFO supporting DSP executions Computing buffer Parameter storage Access Data access Memory addressing Cache Scratch pad memory Single/ multi-port Address generator Critical path Silicon cost 9/11/2017 2

Memory design: what to do Memory may not be a component in IP Addressing is part of the core design Memory peripheral is the core hardware Design for data access today includes 1. Memory peripheral design fundamentals 2. General memory addressing 3. Special memory addressing accelerations 9/11/2017 Unit 5 of TSEA26-2017 H1 3

Memory issues and challenges An application may need mega B s memories Memory consumes the majority silicon area Move memory off chip will reeduce silicon cost Keep memory on chip will keep the performance On-off chip partition / trade-off is a main challenge for performance, low silicon, and low power Memory issue is the challenge of DSP design Memory hierarchy, partition, access latency, and usage or efficiency (SW-HW co-design) 9/11/2017 Unit 5 of TSEA26-2017 H1 4

On chip memory problems Background: memory speed-up slower than that of logic when scaling silicon 1. Architecture: multi-memory blocks in parallel to support functional acceleration a) Small memory block size: fast, area inefficient, DFT? b) Large memory block size: slow and area efficient 2. Custom addressing V.S. cache (hide complexity) a) Acceleration will reduce latency and cost b) Cache easy programming and SW portability 9/11/2017 Unit 5 of TSEA26-2017 H1 5

Physical memory circuits 9/11/2017 Unit 5 of TSEA26-2017 H1 6

Data access specification Single port dual port Cache or SPM General address GEN Special address GEN such as modulo Memory peripheral functions Cache (I and D) and SPM (D) Size and number of memory modules Unsigned computing for AGU AGU Pipeline tricky Block shutting down Tricky to design for M peripherals Pipelie balancing for critical path and LPD 9/11/2017 7

The row decoder with m-lines of inputs and 2 m lines of output Copyright of Linköping University, all rights reserved Basic SRAM and its timing The column decoder and sense amplifiers with k-lines of inputs and 2 k+1 lines of output to the memory and one bit data in-out Data in-out bit 9/11/2017 For teachers using the book 8

Column line + Column bar line The row decoder Copyright of Linköping University, all rights reserved Basic SRAM and its timing A memory cell Row line The column decoder and R-W circuit (a) A memory cell 4 Data in-out bits (b) 128x4-bit signal port SRAM 9/11/2017 Unit 5 of TSEA26-2017 H1 9

Scratch pad memory Simple SRAM, not a cache In general, scratch pad memory: Is a Synch, Single port, SRAM SW designer handels data access complexity & gets opportunities Can be DRAM / ROM multi ports 9/11/2017 Unit 5 of TSEA26-2017 H1 10

Processor core, General address translation module Copyright of Linköping University, all rights reserved Cache = Valid Tag 0 1 0 1 Data (256b=16bX16) Tag (21b) Index(9b) offset(4b) Cache come from French, Cache = Hide, meaning to access easier 9/11/2017 Unit 5 of TSEA26-2017 H1 11

Scratch pad or cache? Scratch pad memory Simpler, cheaper, and use less power More opportunities for access acceleration: such as multi separated memories, custom addressing Deterministic, static only, for embedded systems! Cache memory Hide complexity, much more power, cost silicon Cache miss induced cycles costs uncertainty Programming easy, suitable for general computing 9/11/2017 Unit 5 of TSEA26-2017 H1 12

Design of memory peripherals 9/11/2017 Unit 5 of TSEA26-2017 H1 13

Basic SRAM and its timing W 9/11/2017 Unit 5 of TSEA26-2017 H1 14

D flip-flop Memory logic (case 1) Copyright Write of in Linköping data University, all rights reserved Address Memory enable Read out data Logic or SRAM longwire Write enable Memory clock Machine clock The circuitry with problem Machine clock Address Memory enable Write enable Logic delay Read out data Read out register Data valid Before clock duty modification, memory using machine clock Memory clock Address Memory enable Write enable Logic delay Read out data Read out register Data valid 9/11/2017 For teachers using the book 15 After clock duty modification, memory using memory clock

Adress register Memory logic (case 2) Copyright of Linköping University, all rights reserved Machine clock Logic or long wire Write in data M Address Memory enable Write enable Memory clock The circuitry with problem SRAM Read out data Machine clock AR out Address Memory enable Write enable old address Decoding time Reading a wrong word Right address comes too late Read out data Incorrect data Read out register Data valid Before phase modification, memory using machine clock Machine clock Memory clock AR out Address Memory enable Write enable old address Decoding time Reading a right word Right address comes in time Data available here Read out data Correct data Read 9/11/2017 out register For teachers using the book 16 Data valid After phase modification, memory clock is delayed

Memory addressing 9/11/2017 Unit 5 of TSEA26-2017 H1 17

General memory addressing Addressing Implied addressing Memory direct Segment plus offset Register indirect Register post increment Register pre decrement Index addressing Algorithm Specification Implicitly specified in the OP code A <= immediate data of the instruction A <= SEG + OFFSET A <= Selected GR A <= AP and then INC (AP) /*AP is an address pointer*/ DEC (AP) and then A <= AP A <= SEG + Index GR 9/11/2017 Unit 5 of TSEA26-2017 H1 18

Addressing feedback Memory addressing circuit in general Copyright of Linköping University, all rights reserved Inputs Address calculation logic circuit Initial address Keeper Combinational output Registered output Address pointer 9/11/2017 Unit 5 of TSEA26-2017 H1 19

An addressing circuit example Register value Offset value Addressing calculation logic II Direct address Addressing calculation M5 logic I (same 7 6 2 1 same as II in this figure) +1 1 1 2 3 1 2 M1 M2 FA + 1 2 3 4 M3 M4 1 2 6 7...... 1 2 3 4 M6... address RG1 address RG2 address RG6 address RG7 9/11/2017 For teachers using the book 20

Modulo addressing: an addressing acceleration example Register FIFO is needed by algorithms Easy to use, high power/silocn consumptions Modulo FIFO addressing on low power SRAM Using Top blocker, Bottom blocker, Address pointer Top Address pointer Bottom Address pointer <= Address pointer ± 1 If Address pointer (AP) = top then AP=bottom If Address pointer (AP) = bottom then AP=top 9/11/2017 Unit 5 of TSEA26-2017 H1 21

A convolution Copyright of Linköping University, all rights reserved 02 ACR <= 0; 03 LCR <= m; // LCR is loop counter register; 04 CAR <= coefficient_starting_address; 05 DAR <= data_starting_address; // for DM; 06 TAR <= top_address; //of FIFO in DM; 07 BAR <= bottom_address;// of FIFO in DM; 08 DM (DAR) <= input_new_data; 09 OPA <= DM (DAR); 10 OPB <= TM (CAR); 11 BFR <= OPA * OPB; 12 ACR <= ACR + BFR; 13 if DAR == BAR then DAR <= TAR 14 else DEC (DAR); 15 INC (CAR); 16 DEC (LCR); 17 if LCR <> Loop_size then jump to 09 18 else Y <= Saturate (round (ACR)); 9/11/2017 For teachers using the book 22 19 end.

A convolution Copyright of Linköping University, all rights reserved 02 ACR <= 0; 03 LCR <= m; // LCR is loop counter register; 04 CAR <= coefficient_starting_address; 05 DAR <= data_starting_address; // for DM; 06 TAR <= top_address; //of FIFO in DM; 07 BAR <= bottom_address;// of FIFO in DM; 08 DM (DAR) <= input_new_data; 09 OPA <= DM (DAR); 10 OPB <= TM (CAR); 11 BFR <= OPA * OPB; 12 ACR <= ACR + BFR; 13 if DAR == BAR then DAR <= TAR 14 else DEC (DAR); 15 INC (CAR); 16 DEC (LCR); 17 if LCR <> Loop_size then jump to 09 18 else Y <= Saturate (round (ACR)); 9/11/2017 For teachers using the book 23 19 end.

The data memory space The FIFO buffer Copyright of Linköping University, all rights reserved Design a FIFO based on SRAM MIN address BAR TAR MAX address DM BAR + 0 BAR + 1 BAR + 2 BAR + 3 BAR + 4 BAR TAR BAR TAR Step 0 Step 1 X (n-3) BAR X (n-4) X (n-4) X (n) X (n) DAR X (n-1) X (n-1) X (n-2) X (n-2) TAR X (n-3) before getting after getting new data new data 1 Step 2 Step 3 X (n) DAR BAR X (n-1) X (n-1) X (n-2) X (n-2) X (n-3) X (n-3) X (n-4) X (n-4) TAR X (n) after getting after getting new data 2 new data 3 DAR DAR Example: The procedure a FIFO getting a new data sample 9/11/2017 Unit 5 of TSEA26-2017 H1 24

Convolution hardware Copyright of Linköping University, all rights reserved Modulo addressing circuit Load data to registers M3 1 0 1 DM TM TAR + Flag if EQ 1 0 M1 * 0 1 M4 2 1 0 M2 BAR DAR = Modolu address generator + ACR 9/11/2017 Unit 5 of TSEA26-2017 H1 25

Convolution hardware Copyright of Linköping University, all rights reserved Modulo++ addressing circuit Load data to registers M3 1 0 +1 DM TM BAR + Flag if EQ 1 0 M1 * 0 1 M4 2 1 0 M2 TAR DAR = Modolu address generator + ACR 9/11/2017 For teachers using the book 26

Memory hierarchy 9/11/2017 Unit 5 of TSEA26-2017 H1 27

Memory in an embedded system An application may need >200MB SRAM E.g. a video encoder with high end monitor Frames for camera pre-processing Frames for video encoder and for video decoder Frames for video post processing A 4K frame size is 24 MB (4096 x 2160 x 3B) More than 200MB SRAM needed, it consumes ~ 50mm 2 on chip, expensive! 9/11/2017 Unit 5 of TSEA26-2017 H1 28

Memory in an embedded system An application may need X-MB ROM Except for data buffer, a video encoder with high end monitor needs yet other memories for: Up to xmb camera and other driver setting Up to xmb video CODEC codes Up to xkb codes for post processing Usually CMOS silicon processing masks are not for ROM processing. Extra cost needed! 9/11/2017 Unit 5 of TSEA26-2017 H1 29

Memory in an embedded system 1. We therefore need off chip SDRAM (DDR) and off-chip ROM (or SSD) in a system 2. We speed up on chip SRAM by splitting DM into small memory blocks Memory subsystem design is essential for embedded system 9/11/2017 Unit 5 of TSEA26-2017 H1 30

DSP core External memory I/F Main memory, ROM DSP core DMA controller & I/F Main memory, ROM Copyright of Linköping University, all rights reserved Memory hierarchy I-Scratch pad memory D-Scratch pad memories (a) I-Scratch pad memory I-cache memory D-Scratch pad memory (b) D-cache memory 9/11/2017 For teachers using the book 31

Datapath Level 1: Register file DMA controller Level 3: Main memory Copyright of Linköping University, all rights reserved Memory hierarchy on a chip Level 2: Data memory: Scratch pad memory or cache The chip 9/11/2017 Unit 5 of TSEA26-2017 H1 32

DMA Copyright of Linköping University, all rights reserved Memory hierarchy of SoC = DSP+MCU DSP MCU Accelerators L1: RF DP+CP L1: RF DP+CP DP+CP DM1 DMn PM DM1 DMn PM DMn PM I/F DMA DMA I/F DMA I/F SoCBUS and its arbitration / routing / control Main on chip memory Nonvolatile memory I/F Off chip DRAM I/F I/F 9/11/2017 Unit 5 of TSEA26-2017 H1 33

Requirements Memory partition The limit of the on chip Memory size The number of data needed simultaneously Supporting access of different data types Overhead costs from memory peripheral Critical path from memory peripheral Memory shutting down for low power 9/11/2017 Unit 5 of TSEA26-2017 H1 34

Off chip memory extension DDR or Flash M or off-chip SRAM DDR (volatile): 1. Double data rate 2. SDRAM DDR Controller: POR, refresh, access, buffering DDR PHY: circuit and I/O implementation Flash M (no volatile SSD) Floating gate to keep data, control gate to change data NAND: write fast, large size, low cost NOR: read fast, low power, and small Off chip SRAM: fast and parallel, high cost 9/11/2017 Unit 5 of TSEA26-2017 H1 35

DMA basics 9/11/2017 Unit 5 of TSEA26-2017 H1 36

DMA definition and specification DMA: Direct memory access An external device independent to the core Running load and store in parallel with a core DSP processor can do other things in parallel Requirements Large bandwidth and low latency Flexible / support different access patterns Multiple access and Linking table is important 9/11/2017 Unit 5 of TSEA26-2017 H1 37

Direct Memory Access Processor configures a DMA setup Processor asks for the right to use the main memory Gets right Processor releases a memory page (block) and connects it to a DMA port Main memory (or another memory) is connected to another DMA port Run DMA transaction Processor running other programs DMA run the transaction 9/11/2017 For teachers using the book 38 Processor takes back the memory page and release DMA as well the main memory

DMA data and control 9/11/2017 Unit 5 of TSEA26-2017 H1 39

DMA Behavior model DMA request Arbiter DMA ACK Config. REG address Configuration vector DMA data in Controller FIFO buffer Address generator 1 Address generator 2 DMA data out Clock 1 Address enable W-enable DMA status Clock 2 Address enable W-enable 9/11/2017 Unit 5 of TSEA26-2017 H1 40

Questions to discuss Why computing for memory addresses are based on unsigned computing, what are benefits Modulo addressing to emulate a FIFO, which is one kind of acceleration for address computing, what kinds of acceleration is need for address calculation of FFT? It is proposed to try (to run DIT butterfly in one cycle) it if you have time. 9/11/2017 Unit 5 of TSEA26-2017 H1 41

Concepts Copyright of Linköping University, all rights reserved Skills Review on Unit 5 System understanding Plan HW schematic HW coding Micro architecture Memory & data access Memory circuit Memory hierarchy General an special addressing for memory access We did not talk about addressing for RF Modulo addressing: A way to use part of SRAM as FIFO for filters and for Java garbage collection in JVM You may need to think how to use D-cache and SPM in parallel Plan for pipeline to balance the critical path and avoid long wire delay Coding skill is specially important for IP reuse because SRAM offered by different IP suppliers might be different 9/11/2017 Unit 5 of TSEA26-2017 H1 42

Self reading after the lecture Function of the memory subsystem Function of memory general/special addressing Read chapter 15 Think about: FIFO Behavior model mapping on SRAM Modulo addressing hardware implementation 9/11/2017 Unit 5 of TSEA26-2017 H1 43

Exciting time now! Let us discuss Whatever you want to discuss and related to HW You will have the chance after each lecture (Fö), do take the chance! Prepare your Qs for the next time 9/11/2017 Unit 5 of TSEA26-2017 H1 44

LOGO Welcome to ask any questions you want to I can answer Or discuss together I want to know what you want Dake Liu, Room 556 coridoor B, Hus-B, phone 281256, dake.liu@liu.se