Design and Implementation of High Performance Application Specific Memory

Similar documents
2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

ECEN 449 Microprocessor System Design. Memories. Texas A&M University

Topic 21: Memory Technology

Topic 21: Memory Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

The DRAM Cell. EEC 581 Computer Architecture. Memory Hierarchy Design (III) 1T1C DRAM cell

ECE 485/585 Microprocessor System Design

ECEN 449 Microprocessor System Design. Memories

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

A 120mW Embedded 3D Graphics Rendering Engine with 6Mb Logically Local Frame-Buffer and 3.2GByte/s Run-time Reconfigurable Bus for PDA-Chip

COMPUTER ARCHITECTURES

CMOS Logic Circuit Design Link( リンク ): センター教官講義ノートの下 CMOS 論理回路設計

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

Computer Memory. Textbook: Chapter 1

MTJ-Based Nonvolatile Logic-in-Memory Architecture

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

ECE 2300 Digital Logic & Computer Organization

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness

registers data 1 registers MEMORY ADDRESS on-chip cache off-chip cache main memory: real address space part of virtual addr. sp.

Basic Organization Memory Cell Operation. CSCI 4717 Computer Architecture. ROM Uses. Random Access Memory. Semiconductor Memory Types

EECS150 - Digital Design Lecture 16 - Memory

An introduction to SDRAM and memory controllers. 5kk73

EECS150 - Digital Design Lecture 16 Memory 1

CS311 Lecture 21: SRAM/DRAM/FLASH

ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

CpE 442. Memory System

The Memory Hierarchy 1

Embedded Systems Design: A Unified Hardware/Software Introduction. Outline. Chapter 5 Memory. Introduction. Memory: basic concepts

Embedded Systems Design: A Unified Hardware/Software Introduction. Chapter 5 Memory. Outline. Introduction

Introduction to memory system :from device to system

The Memory Hierarchy Part I

Chapter 3 Semiconductor Memories. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

UMBC. Select. Read. Write. Output/Input-output connection. 1 (Feb. 25, 2002) Four commonly used memories: Address connection ... Dynamic RAM (DRAM)

Memories: Memory Technology

CS152 Computer Architecture and Engineering Lecture 16: Memory System

ECE 485/585 Microprocessor System Design

Unleashing the Power of Embedded DRAM

UMBC D 7 -D. Even bytes 0. 8 bits FFFFFC FFFFFE. location in addition to any 8-bit location. 1 (Mar. 6, 2002) SX 16-bit Memory Interface

8M x 64 Bit PC-100 SDRAM DIMM

Computer Organization. 8th Edition. Chapter 5 Internal Memory

Chapter 2: Memory Hierarchy Design (Part 3) Introduction Caches Main Memory (Section 2.2) Virtual Memory (Section 2.4, Appendix B.4, B.

Memory Systems for Embedded Applications. Chapter 4 (Sections )

A Fast Synchronous Pipelined DRAM Architecture with SRAM Buffers

Chapter 5 Internal Memory

Random Access Memory (RAM)

Spiral 2-9. Tri-State Gates Memories DMA

CMPEN 411 VLSI Digital Circuits Spring Lecture 22: Memery, ROM

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM

Memory System Overview. DMA & Endian-ness. Technology. Architectural. Problem: The Memory Wall

Memory Device Evolution

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

Memory Challenges. Issues & challenges in memory design: Cost Performance Power Scalability

Z-RAM Ultra-Dense Memory for 90nm and Below. Hot Chips David E. Fisch, Anant Singh, Greg Popov Innovative Silicon Inc.

Chapter 5. Internal Memory. Yonsei University

Memory technology and optimizations ( 2.3) Main Memory

CENG4480 Lecture 09: Memory 1

EE414 Embedded Systems Ch 5. Memory Part 2/2

CS 320 February 2, 2018 Ch 5 Memory

ENEE 759H, Spring 2005 Memory Systems: Architecture and

Memory hierarchy Outline

+1 (479)

4GB Unbuffered VLP DDR3 SDRAM DIMM with SPD

ENGIN 112 Intro to Electrical and Computer Engineering

Organization. 5.1 Semiconductor Main Memory. William Stallings Computer Organization and Architecture 6th Edition

Filter-Based Dual-Voltage Architecture for Low-Power Long-Word TCAM Design

Novel Hardware Architecture for Fast Address Lookups

Real Time Embedded Systems

Chapter 8 Memory Basics

2GB DDR3 SDRAM SODIMM with SPD

TECHNOLOGY BRIEF. Double Data Rate SDRAM: Fast Performance at an Economical Price EXECUTIVE SUMMARY C ONTENTS

William Stallings Computer Organization and Architecture 6th Edition. Chapter 5 Internal Memory

Infineon HYB39S128160CT M SDRAM Circuit Analysis

EEM 486: Computer Architecture. Lecture 9. Memory

Internal Memory. Computer Architecture. Outline. Memory Hierarchy. Semiconductor Memory Types. Copyright 2000 N. AYDIN. All rights reserved.

ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Computer Structure. The Uncore. Computer Structure 2013 Uncore

Introduction read-only memory random access memory

Where Have We Been? Ch. 6 Memory Technology

Address connections Data connections Selection connections

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (5 th Week)

Integrated Circuits & Systems

ECE 152 Introduction to Computer Architecture

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

Lecture-14 (Memory Hierarchy) CS422-Spring

Performance Evolution of DDR3 SDRAM Controller for Communication Networks

Semiconductor Memory Classification

Organization Row Address Column Address Bank Address Auto Precharge 128Mx8 (1GB) based module A0-A13 A0-A9 BA0-BA2 A10

Lecture 13: SRAM. Slides courtesy of Deming Chen. Slides based on the initial set from David Harris. 4th Ed.

Main Memory Systems. Department of Electrical Engineering Stanford University Lecture 5-1

Embedded Systems: Hardware Components (part I) Todor Stefanov

Views of Memory. Real machines have limited amounts of memory. Programmer doesn t want to be bothered. 640KB? A few GB? (This laptop = 2GB)

Semiconductor Memory Classification. Today. ESE 570: Digital Integrated Circuits and VLSI Fundamentals. CPU Memory Hierarchy.

CPE300: Digital System Architecture and Design

APPLICATION NOTE. SH3(-DSP) Interface to SDRAM

MEMORY SYSTEM MEMORY TECHNOLOGY SUMMARY DESIGNING MEMORY SYSTEM. The goal in designing any memory system is to provide

Application Note AN2247/D Rev. 0, 1/2002 Interfacing the MCF5307 SDRAMC to an External Master nc... Freescale Semiconductor, I Melissa Hunter TECD App

ELE 758 * DIGITAL SYSTEMS ENGINEERING * MIDTERM TEST * Circle the memory type based on electrically re-chargeable elements

Transcription:

Design and Implementation of High Performance Application Specific Memory - 고성능 Application Specific Memory 의설계와구현 - M.S. Thesis Sungdae Choi Dec. 20th, 2002

Outline Introduction Memory for Mobile 3D Graphics Content-Addressable-Memory (CAM) for Network Memory Conclusion

Outline Introduction Motivation Related Work This Work Memory for Mobile 3D Graphics Content-Addressable-Memory (CAM) for Network Memory Conclusion

Motivation(1/2) Conventional Memory Architecture Designed for General Applications Not Optimized for Specific Application Long t RC (60~80ns) For Mass Production & Low Cost Bottleneck of the System Performance Only Read / Write operation Need for Application Specific Memory

Motivation(2/2) Application Specific Memory Offer Unique Characteristics to Improve Performance in Particular Application * Free Design Flexible Control Guaranteed Bandwidth Low Power Consumption Complex Design Both Memory and System Background * IEEE Circuits & Devices, High-Speed Memory Architectures for Multimedia Applications, 1997

Application Specific Memory Cache DRAM (CDRAM)* DRAM Density + SRAM Speed Video RAM (VRAM)* Requirement for High-quality Display Media-Chip** Embedded Memory for PC 3D Graphics Content-Addressable-Memory (CAM)*** High-speed Search Function * IEEE Circuits & Devices, High-Speed Memory Architectures for Multimedia Applications, 1997 ** Takao Watanabe, A Modular Architecture for a 6.4Gbyte/s 8Mb DRAM-Integrated Media Chip, 1997 *** IEEE Potentials, RAM versus CAM, 1997

Thesis Work Memories for Mobile 3D Graphics Embedded in RamP-IV Processor Implemented using 0.16um DRAM Process Content-Addressable-Memory (CAM) for Network Lookup Engine Ternary Cell Structure Simulation using 0.35um Logic Process Memory Architecture Contributes the System Performance

Outline Introduction Memories for Mobile 3D Graphics Need for Special Memory Memory Specification Applied Architectures Simulation Results Chip Implementation Content-Addressable-Memory (CAM) for Network Memory Conclusion

RamP-IV Processor* For Mobile Multimedia Application * R. Woo, et. al., "A 210mW Graphics LSI Implementing Full 3D Pipeline with 264Mtexels/s Texturing for Mobile Multimedia Applications, ISSCC 2003

3D Graphics in RamP-IV Target Performance for High-Quality 3D Graphics Mobile Application 256 x 256 resolution @ 24bit color 16bit depth-comparison High Image Quality Double Frame Buffering Texture Mapping High Performance Double Pixel Processor 100Mpixel/sec Performance ZB Polygon Setup Z Compare Rendering Color Porcessing Texture Processing TM Need for New Memories to Supply the Performance Alpha Blending FB Memories in the Rendering Engine

Memories for Target Performance Frame-Buffer 24bit I/O bus (for 24bit color processing) 3Mb capacity (256 x 256 x 24bit x double FB) Individual 768kb x 4 (for double pixel processing) Read-Modify-Write Operation Simplifies the Pile-line Stage of the Rendering Engine 20ns t RC (for 100Mpixel/s performance) Conventional Memory has 60~80ns t Depth Buffer Frame Buffer RC 2Mb Depth Buffer Depth Buffer RamP Rendering Engine Frame Buffer Frame Buffer 3Mb Depth Buffer Frame Buffer Texture Memory Texture Memory Texture Memory Texture Memory 24Mb

Memories for Target Performance Z-Buffer 16bit I/O bus (for 16bit depth comparison) 2Mb capacity (256 x 256 x 16bit depth x double FB) Individual 512kb x 4 (for double pixel processing) Read-Modify-Write Operation Simplifies the Pipe-line Stage of the Rendering Engine 20ns t RC (for 100Mpisel/s performance) Depth Buffer Frame Buffer 2Mb Depth Buffer Depth Buffer RamP Rendering Engine Frame Buffer Frame Buffer 3Mb Depth Buffer Frame Buffer Texture Memory Texture Memory Texture Memory Texture Memory 24Mb

Memories for Target Performance Texture Memory 24bit I/O bus (for 24bit color processing) 20ns t RC (for 200Mtexel/s performance) Read Oriented Operation 24Mb capacity = 6Mb x 4 (for bilinear interpolation) Depth Buffer Frame Buffer 2Mb Depth Buffer Depth Buffer RamP Rendering Engine Frame Buffer Frame Buffer 3Mb Depth Buffer Frame Buffer Texture Memory Texture Memory Texture Memory Texture Memory 24Mb

Memory Specification Frame-Buffer Z-Buffer Texture Memory t RC 20ns Memory Size 768kb 512kb 6Mb Function Read-Modify-Write (RMW) Auto Refresh NOP Read Write Auto Refresh NOP Control Pin CLK - Clock Input RMW - RMW Command MASK Write Mask Signal REF Auto Refresh Command CLK READ WRITE REF Address Pin 15bit 24bit Data pin 24bit Write Bus 24bit Read Bus 16bit Write Bus 16bit Read Bus 24bit R/W Bus

Design Restriction Always Complete RMW within 20ns for Constant Performance of the Rendering Engine t RC = 20ns Precharge - Activation - CMD within 20ns Burst Access function is useless. Page Access Mode enhances burst access latency. Partial Activation reduces power consumption. Conventional SDRAM interface scheme requires high clock frequency for 20ns t RC CLOCK 0 1 2 3 4 5 6 7 8 9 10 ADDR Ra Ca0 7 cycle for RMW = 350MHz Clock DQ CL=2 R1 M1 W1 Row Acive Read Precharge

Architecture(1/5) Single Clock Operation Finish Precharge-Activation-CMD in one cycle SRAM-like Easy Control Simplifies the pipeline control of the rendering engine Low-Power Consumption due to Low Clock Frequency 0 1ns 10ns 15ns 20ns 40ns Clock CMD & ADR C1 C2 MASK Read Out R1 R2 Write In W1 W2 Operation PCG Active & Read Modify Write PCG Active & Read Modify Canceled Write FB&ZB RMW Timing Diagram

DSWSWWblock SDDSWArchitecture(2/5) Partial Activation Consumes lower power than Page-Access-Mode PXA[n] sws[3:0] sws0 sws2 PXA sws1 sws3 _ctl_t Col. gate control_t Row Address decoded signal cell array _ctl_0 Col. gate control_0 Row Address decoded signal array block Dcell _ctl_1 Col. gate control_1 I/O ctl signal I/O ctl DB & Write Driver DQR[23:0] DQW[23:0]

Architecture(3/5) Speed-Optimized Cell Architecture Minimize Wire-Delay 192cell/SWL 128cell/SWL 384cell/SWL SWD SWD SWD SWD 256cell/BL 256cell/BL 512cell/BL SWD SWD SWD SWD (a) Cell Array of the Frame-Buffer (b) Cell Array of the Z-Buffer (c) Cell Array of the Texture Memory

Architecture(4/5) Non-Multiplexed Addressing Scheme Decodes both row and column address at a time Reduces address decoding time Clock PCG ATV Fetch Address Dec. DB Modify Write Frame-Buffer & Z-Buffer

Architecture(5/5) Variable Clock Operation Support Long-Lasting Operation in the System Level FAST Mode 0ns 20ns 40ns 60ns 80ns 100ns 15ns Rendering Engine Operation Operation Operation Operation Operation W W W W W Memory Timing ATV W ATV W ATV W ATV W ATV W NORMAL Mode Operation Operation Operation Write Latch Write Latch ATV W ATV W ATV SLOW Mode Operation ATV Write Latch W Operation ATV

Implementation(1/2) Overall Structure CMD Clock CMD Block Unit Block Driver Ctl Pulse Gen. Ctl & WLD Unit Block Unit Block Unit Block Unit Block Unit Block Unit Block Unit Block Unit Block Write Input / Write In Driver 24b or 16b / Read Output / Read Out Driver 24b or 16b /

Implementation(2/2) Unit Block Control Block Selector Block_ctl Block Power ADDR Row Decoder WL_ATV CMD CLK Control Pulse Generator EN... CDen IO_CTL... Ctl BL Column Decoder CLK PCG WL_ATV I/O Ctl Block_ctl I/O Sub Ctl DB WDRV _EN READ WRITE

Simulation Results(1/2) Read-Modify-Write Waveform Precharge Activation Read Modify Write

Simulation Results(2/2) Operation Waveform RMW RMW NOP NOP RMW RMW REF REF RMW RMW RMW MASK RMW MASK RMW MASK RMW MASK RMW MASK

0.56mm Chip Micrograph(1/3) Frame-Buffer 2mm * 0.16um DRAM technology

Chip Micrograph(2/3) Z-Buffer 0.56mm 1.6mm

Chip Micrograph(3/3) Texture Memory 1.2mm 3mm

Measurement Using Probe-Station

Performance Comparison Frame-Buffer Texture Memory Memory1* Memory2** Process 0.16um DRAM 0.16um DRAM 0.40um DRAM 0.25um DRAM Memory Size 768kb 6Mb 2Mb 2.4M Clock 50MHz 50MHz 100MHz 160MHz Bandwidth 2.4Gb/s (RMW) 1.2Gb/s 12.8Gb/s Max. 3.8Gb/s Latency 1 1 2:Read 1:Write + Activation Period 1 Interface SRAM-like SRAM-like SDRAM-like SRAM-like Power RMW MASK Read Write 20mW 9mW 18mW 27mW 160mW 155mW Max. 110mW Typ. Chip Size < 1.5mm 2 /Mb (2.0mmx0.56mm) 0.6mm 2 /Mb (3.0mmx1.2mm) 6.4mm 2 /Mb (2.95mmx4.37mm) 2.4mm 2 /Mb (2.9mmx2.0mm) Target Mobile 3D Graphics PC 3D Graphics Embedded DRAM Low-Power & Optimized Operation for Mobile 3D Graphics * Takao Watanabe, A Modular Architecture for a 6.4-Gbyte/s, 8-Mb DRAM-Integrated Media Chip, 1997 ** Paul DeMone, A 6.25ns Random Access 0.25um Embedded DRAM, 2001

Outline Introduction Memories for Mobile 3D Graphics Content-Addressable-Memory (CAM) for Network Memory Lookup Engine in a Network Processor Problem of Conventional CAM Proposed CAM for Lookup Engine Simulation Results Conclusion

Lookup Engine in a Network Processor Lookup Engine Searches Tag for incoming Packet Has Lookup Memory (Rule Memory) Search is main job Lookup Engine Packet 1 Tag B Packet 1 Packet 2 Packet Processor Tag C Packet 2 Packet 3 Tag A Packet 3

Implementing Lookup Engine Conventional Memory Architecture* Requires Frequent Memory Access Consumes Many Clock Cycle Low Performance CAM Architecture Enables One-Cycle Search Operation Large Power & Area Consumption CAM is preferred. * Miguel A. Ruiz-Sanchez, Survey and Taxonomy of IP Address Lookup Algorithms, 2001

Conventional CAM Structure of Conventional CAM* BL BL# BL BL# WL WL ML ML Conventional NAND type Low Power Consumption Slow Match Line Propagation Cannot handle Don t Care Conventional NOR type Fast Match Line Propagation Large Power Consumption Cannot handle Don t Care * Ilion Yi-Liang Hsiao, et. al., Power Modeling and Low-Power Design of Content Addressable Memories, ISCAS 2001

Conventional TCAM(1/3) Need for Don t Care Prefix 143.248.1.1 143.248.1.2 143.248.255.255 203.238.128.56 203.238.128.1 203.238.128.2 203.238.128.255 Next hop L2 L2 L2 L2 L5 L6 L6 L6 L6 Prefix Next hop * L9 143.248.*.* L2 203.238.128.56 L5 203.238.128.* L6 203.*.*.* L1 Routing Table using Don t Care Routing Table without Don t Care Don t Care reduces the size of routing table.

Conventional TCAM(2/3) Handling Don t Care B1 B1# B2 B2# WL q1 q2 q1 q1' q2 q2' ML 0 1 0 1 1 0 * 0 0 2 BCAMs build 1 TCAM Encode Don t Care using Two CAM Cells* * Sergio R. Ramirez-Chavez, Encoding Don t Cares in Static and Dynamic CAMs, 1992

Conventional TCAM(3/3) Dynamic TCAM* Use Gate Cap. as a storage DRAM CAM** 2 DRAM cells for one TCAM Multiple-Valued CAM*** EEPROM technology Refresh Operation Core speed Technology Bottleneck * JON P. WADE, A Ternary Content Addressable Search Engine, 1989 ** Tadato Yamagata, A 288kb Fully Parallel Content Addressable Memory Using a Stacked-Capacitor Cell Structure, 1992 *** Takahiro Hanyu, Design of a One-Transistor-Cell Multiple-Valued CAM, 1996

Proposed CAM(1/3) Ternary CAM Cell Structure DCL BL BL# DCL# DCL BL BL# DCL# WL Store 1 or 0 WL Proposed NAND type ML Store Bin or X NAND type consumes less power. Proposed NOR type ML Patent Pending 2002/11

Proposed CAM(2/3) Match Line Repeater DCL32 BL32 DCL#32 BL#32 DCL16 BL16 DCL#16 BL#16 DCL15 BL15 DCL#15 BL#15 DCL00 BL00 DCL#00 BL#00 WLn...... WL CAM ML...... WL CAM ML WL CAM ML...... WL CAM MPCG# MPCG MPCG# MPCG ML Precharger ML RPT ML ML MLin MLout MPCG# MPCG Enhance Match Line speed of NAND type Cell Structure

Proposed CAM(3/3) 2-D Decoding Method By Long Aspect Ratio of the Proposed Cell SWD ML MLout 32bit TCAM Unit WL ML0 ML1 ML2 ML3

Simulation Results Match Line Propagation Proposed Scheme Conventional Scheme Final Match Out 5ns More than x3 Faster than Conventional Scheme Unmatch Match

Cell Size 25% smaller than conventional TCAM WL WL WL WL ML ML ML ML DCL BL BL# DCL# BL BL# DCL DCL# Proposed Cell Conventional Cell

Expected Performance Proposed JSSC 1998* JSSC 2001** Process 0.35um CMOS 0.35um 5M1P 0.35um Standard Clock 100MHz 30MHz 70MHz Memory Capacity 64k x 32bit Ternary 64k x 40bit Binary 256 x 54bit Binary Search Power Dissipation 100~120fJ/Bit/Search 83fJ/Bit/Search 131fJ/Bit/Search Search Speed 5ns for Search Evaluation 26ns for Search Out 7.3ns for Search Evaluation Match Line Type NAND NAND NOR Etc Store Don t Care Separated Bit-Line and Search-Line pmos NOR type * Farhad Shafai, et al, Fully Parallel 30-MHz, 2.5-Mb CAM, JSSC November 1998 ** Hisatada Miyatake, et al, A Design for High-Speed Low-Power CMOS Fully Parallel Content-Addressable Memory Macros, JSSC June 2001

Outline Introduction Memories for Mobile 3D Graphics Content-Addressable-Memory (CAM) for Network Memory Conclusion

Conclusion Application Specific Memory Architectures are Proposed. Frame-Buffer, Z-Buffer and Texture Memory for Mobile 3D Rendering Engine Ternary-CAM Structure for Network Lookup Engine Mobile 3D Graphic Memories are Implemented. 0.16um DRAM Technology. 20ns t RC with Read-Modify-Write Operation. Improve the performance of the 3D Graphics System. New Ternary-CAM Structure is Proposed. 5ns Match Evaluation Time 25% Reduced Cell Size