Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.

Similar documents
Chapter 6 (Lect 3) Counters Continued. Unused States Ring counter. Implementing with Registers Implementing with Counter and Decoder

Internal Memory. Computer Architecture. Outline. Memory Hierarchy. Semiconductor Memory Types. Copyright 2000 N. AYDIN. All rights reserved.

Silicon Memories. Why store things in silicon? It s fast!!! Compatible with logic devices (mostly)

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

William Stallings Computer Organization and Architecture 6th Edition. Chapter 5 Internal Memory

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

Lecture-14 (Memory Hierarchy) CS422-Spring

Memories. Design of Digital Circuits 2017 Srdjan Capkun Onur Mutlu.

ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS

Concept of Memory. The memory of computer is broadly categories into two categories:

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM

Chapter 5 Internal Memory

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Organization. 5.1 Semiconductor Main Memory. William Stallings Computer Organization and Architecture 6th Edition

Computer Systems Organization

The Memory Hierarchy 1

Semiconductor Memory Classification

Lecture 8: RISC & Parallel Computers. Parallel computers

COMPUTER ARCHITECTURES

EEM 486: Computer Architecture. Lecture 9. Memory

Computer Organization. 8th Edition. Chapter 5 Internal Memory

MEMORIES. Memories. EEC 116, B. Baas 3

Z-RAM Ultra-Dense Memory for 90nm and Below. Hot Chips David E. Fisch, Anant Singh, Greg Popov Innovative Silicon Inc.

Column decoder using PTL for memory

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Silicon Memories. Why store things in silicon? It s fast!!! Compatible with logic devices (mostly) The main goal is to be cheap

Hardware Design I Chap. 10 Design of microprocessor

Digital Design, Kyung Hee Univ. Chapter 7. Memory and Programmable Logic

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Chapter 4 Main Memory

Introduction to memory system :from device to system

CS152 Computer Architecture and Engineering Lecture 16: Memory System

Lecture Objectives. Introduction to Computing Chapter 0. Topics. Numbering Systems 04/09/2017

Unleashing the Power of Embedded DRAM

CMOS Logic Circuit Design Link( リンク ): センター教官講義ノートの下 CMOS 論理回路設計

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

UNIT V (PROGRAMMABLE LOGIC DEVICES)

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

COSC 243. Computer Architecture 1. COSC 243 (Computer Architecture) Lecture 6 - Computer Architecture 1 1

ECEN 449 Microprocessor System Design. Memories. Texas A&M University

Lecture 13: SRAM. Slides courtesy of Deming Chen. Slides based on the initial set from David Harris. 4th Ed.

Memory Supplement for Section 3.6 of the textbook

Department of Computer Science and Engineering CS6303-COMPUTER ARCHITECTURE UNIT-I OVERVIEW AND INSTRUCTIONS PART A

Introduction to Semiconductor Memory Dr. Lynn Fuller Webpage:

COMPUTER STRUCTURE AND ORGANIZATION

CS250 VLSI Systems Design Lecture 9: Memory

EE577b. Register File. By Joong-Seok Moon

MULTIPROCESSOR system has been used to improve

Semiconductor Memory Classification. Today. ESE 570: Digital Integrated Circuits and VLSI Fundamentals. CPU Memory Hierarchy.

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:

Memory & Logic Array. Lecture # 23 & 24 By : Ali Mustafa

+1 (479)

ECE 485/585 Microprocessor System Design

CSEE 3827: Fundamentals of Computer Systems. Storage

CSE502: Computer Architecture CSE 502: Computer Architecture

Infineon HYB39S128160CT M SDRAM Circuit Analysis

Memory in Digital Systems

BlueGene/L (No. 4 in the Latest Top500 List)

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 2, 2016

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

CSE502: Computer Architecture CSE 502: Computer Architecture

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Unit 9 : Fundamentals of Parallel Processing

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Computer Organization and Assembly Language (CS-506)

Lecture 13: Memory and Programmable Logic

Memory. Outline. ECEN454 Digital Integrated Circuit Design. Memory Arrays. SRAM Architecture DRAM. Serial Access Memories ROM

ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems

Instruction Register. Instruction Decoder. Control Unit (Combinational Circuit) Control Signals (These signals go to register) The bus and the ALU

Chapter 5: Computer Systems Organization

! Memory. " RAM Memory. " Serial Access Memories. ! Cell size accounts for most of memory array size. ! 6T SRAM Cell. " Used in most commercial chips

Memory and Programmable Logic

Chapter 5: Computer Systems Organization. Invitation to Computer Science, C++ Version, Third Edition

Chapter 4. The Processor

Computer Organization

Summary of Computer Architecture

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 3, 2015

Lecture 11 SRAM Zhuo Feng. Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 2010

Computer System. Hiroaki Kobayashi 7/25/2011. Agenda. Von Neumann Model Stored-program instructions and data are stored on memory

! Memory Overview. ! ROM Memories. ! RAM Memory " SRAM " DRAM. ! This is done because we can build. " large, slow memories OR

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 13

Topic #6. Processor Design

Embedded Systems: Hardware Components (part I) Todor Stefanov

MicroProcessor. MicroProcessor. MicroProcessor. MicroProcessor

UNIT - V MEMORY P.VIDYA SAGAR ( ASSOCIATE PROFESSOR) Department of Electronics and Communication Engineering, VBIT

Chapter 4. The Processor

CPE300: Digital System Architecture and Design

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

The DRAM Cell. EEC 581 Computer Architecture. Memory Hierarchy Design (III) 1T1C DRAM cell

a) Memory management unit b) CPU c) PCI d) None of the mentioned

Memory in Digital Systems

COMPUTER ORGANIZATION AND ARCHITECTURE

Chapter 5. Internal Memory. Yonsei University

RISC (Reduced Instruction Set Computer)

Logic and Computer Design Fundamentals. Chapter 8 Memory Basics

Sense Amplifiers 6 T Cell. M PC is the precharge transistor whose purpose is to force the latch to operate at the unstable point.

Basic Organization Memory Cell Operation. CSCI 4717 Computer Architecture. ROM Uses. Random Access Memory. Semiconductor Memory Types

DP8420V 21V 22V-33 DP84T22-25 microcmos Programmable 256k 1M 4M Dynamic RAM Controller Drivers

Transcription:

UCAS-6 6 > Stanford > Imperial > Verify 2011 Marching Memory マーチングメモリ Tadao Nakamura 中村維男 Based on Patent Application by Tadao Nakamura and Michael J. Flynn 1 Copyright 2010 Tadao Nakamura

C-M-C Computer Communication Memory CPU 2 Copyright 2010 Tadao Nakamura

HPC Today and beyond! 1 EFlops Ideal This Jump is realized by the Marching Memory Processing Speed 100 PFlops 10 PFlops 1 PFlops 100 TFlops Maximum of HPC Today s 10 TFlops SX Mainframe Today s Applications 3 Copyright 2010 Tadao Nakamura

Computer System Tunable Energy Efficiency η 1, η 2,, η 9 at Levels 1-9 Level 9: Algorithms for Applications Level 8: High Level Languages Level 7: Programs Level Level 6: Operating Systems Level 5: Compilers Level 4: Instruction Set Architectures Level 3: Microarchitectures Level 2: Logic Level 1: Circuits Level 0: Devices COOL Software Software Domain Instruction Set Architecture (Old Definition of Architecture) Hardware Domain COOL Chips Architecture Given Energy Efficiency η 0 at Level 0 Energy Efficiency at All the Levels for COOL Systems 4 Copyright 2010 Tadao Nakamura

Energy Consumption Power Dynamic Static Device level energy consumption Time Power Net energy consumed Overhead energy consumed Power Real energy consumption Net energy consumed Time Ideal energy consumption Time 5 Copyright 2010 Tadao Nakamura

Difference Among Shift Registers, Delay Line Memory, CCD and Marching Memory Examples of streaming memory are shift registers, delay line memory and CCDs. However, these differ from marching memory in the content, and scale s of streaming memory. Shift registers: Bit manipulation in a least unit, word of memory, performing multiplication and division in the binary system and serial parallel and parallel- serial conversions Delay line Memory: Feedbacked delay lines are used to keep one bit in it. CCD: transmitting analog signalized electric charge to the output after photo-electric conversion on the image receptor. Marching memory: DRAM type memory with storage and marching functions of information / data. 6 Copyright 2010 Tadao Nakamura

Concept of Conventional Memory Data accessed by addressing with several procedures and complicated hardware and lots of wires Von Neumann bottleneck Conventional memory C P U Memory access time is 1000 times the clock cycle of the CPU The memory bottleneck as a memory wall 7 Copyright 2010 Tadao Nakamura

Concept of Marching Memory Data marches, column by column, to the DRAM pins / edge No long addressing / sense lines. Marching memory Information / Data marching C P U 1 memory unit marching time = CPU s clock cycle Any portion available to be off for energy saving, which means that t all the cells / memory units are not always active through the control. c 8 Copyright 2010 Tadao Nakamura

Computer Organization Alternatives Cache Memory CPU Marching Memory memory CPU CPU (a) Computer organizations Increasing bandwidth The memory bottleneck as a memory wall (b) Bandwidths 9 Copyright 2010 Tadao Nakamura

Organization Consisted of Marching Memory and ALUs / Arithmetic Pipelines. Conventional main memory Conventional cache Conventional Register file Marching memory appears as a huge DRAM vector register file ALU / Arithmetic pipeline 10 Copyright 2010 Tadao Nakamura

Kinds of Marching Memory 1. Simple Marching Memory for vector data and streaming data 1-1. 1. Extended Simple Marching Memory for random access with high locality of data 2. Complex Marching Memory for random access 11 Copyright 2010 Tadao Nakamura

Classification of Marching Memory Definition: The abbreviation of Marching Memory is MM 1) Simple MM Sequential Access Mode As a core of a complex MM <<<<< As an independent MM unit for streaming/vector data only As an independent MM unit for sequential programs only 2) Extended simple MM Random Access Mode with High Locality As a core of a complex MM <<<<< As an independent MM unit for data with high locality As an independent MM unit for programs without long distance branches 3) Complex MM Random Access Mode with Low Locality As an independent MM unit for huge programs with long distance branches As an independent MM unit for general types of data 12 Copyright 2010 Tadao Nakamura

Structure of a Simple Marching Memory From the previous chip To the arithmetic units Almost no long wires on a chip because of minimum addressing overhead, no precharge,, no activation Information / Data marching Any portion available to be off for energy saving, which means that all the cells / memory units are not always active through the control. 13 Copyright 2010 Tadao Nakamura

Circuit for a simple marching memory for vector and streaming data Space Clock AND gate with delay Memory Write Memory Read Capacitor Cell 1 Cell 2 Cell 3 Cell 4 time=t time = t+1 (a) Memory scheme Logical value Cell 5 Cell position & Time 1 0 Time t t+τ Memory shift Memory stay (a) Clock 14 Copyright 2010 Tadao Nakamura

One stage of Marching Memory consisting of an AND gate and one capacitor Vague part! Space Clock AND gate with delay Memory Write Capacitor Cell 1 Cell 2 time=t time = t+1 Cell position & Time 15 Copyright 2010 Tadao Nakamura

How to implement Marching Memories 1. Reliable implementation by 2-dimensional 2 arrays of flip-flops flops 2. A definite way of using memory cells that have one normal AND gate consisting of several CMOS FETs,, and one capacitor 3. An expected way of using memory cells that have one (special) AND gate function MOS FET and one capacitor 16 Copyright 2010 Tadao Nakamura

Extended simple marching memory Information shifting (a) Shifting (marching) right (b) Staying Information shifting (c) Shifting (marching) left 17 Copyright 2010 Tadao Nakamura

Circuit for extended simple marching memory Clock Selector Space Program shifted left Clock not issued then Program Stayed AND gate Program with delay shifted right AND gate with delay Memory Read/Write Memory Read/Write Capacitor Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 time=t time = t+1 Cell position & Time 18 Copyright 2010 Tadao Nakamura

Accessing data in marching memory requires marching to the correct column Position indexes (tags) relative in marching information / data, that are created dynamically with variables by a counter Contents 19 Copyright 2010 Tadao Nakamura

Operation of marching memories in action Instruction position index I v I s (a) For instructions b a Scalar / vector instruction Position index for data of the scalar instruction Contents t s r (b) For scalar data q p o Starting point position index for vector data and streaming data (c) For vector data and streaming data 20 Copyright 2010 Tadao Nakamura

Pragmatics of marching memories in general uses Position used Contents Instruction from/to the next column Scalar data from/to the next column Position not used Vector data from the next column Instructions marching (a) For instructions Scalar data marching (b) For scalar data Vector data marching (c) For vector data Instruction to the CPU,, and from/to the next column Scalar data to ALU, and from/to the next column Position used at the start point of a vector data Vector data to a pipeline 21 Copyright 2010 Tadao Nakamura

Comparison of MM with conventional DRAM in a memory cycle in terms of hardware quantity Sense amplifiers Memory space Column decoder A wordline Conventional DRAM Row decoder Bitline 1 or Bitline 2 or Bitline 3 or Bitline 4 One stage of MM Marching memory MM 22 Copyright 2010 Tadao Nakamura

Pipelining in MM but not so serious in energy consumption due to shorter processing time and if any, lower clock frequency even still higher Addressing Data sense Column access Data restore Bitline precharge time Pipelining in MM storage process time 23 Copyright 2010 Tadao Nakamura

Comparison of MM with conventional memory in a typical read cycle Addressing Data sense Possible? And besides refresh! Column access Data restore Memory access of ordinal DRAM Bitline precharge time Data access / One stage march in Marching Memory Memory access of Marching memory time 24 Copyright 2010 Tadao Nakamura

(Extended) Simple marching memory speed 1 memory unit time in existing memory Memory unit in existing memory s s speed 100 memory unit time in marching memory Memory units in marching memory s s speed The other 99 memory units of marching memory could be available in 100 memory units. 25 Copyright 2010 Tadao Nakamura

Scalar data for threads in a process, needs compiler support Memory unit in existing memory Time multiplex thread data Memory units in marching memory 26 Copyright 2010 Tadao Nakamura

Complex marching memory 1. Sometimes it is required to access a remote column 2. Use multiple (extended) simple marching memory cores with interconnection to a single simple marching memory to interface to arithmetic 3. Provides random access facility among multiple (extended) simple marching memory cores 27 Copyright 2010 Tadao Nakamura

Comparing conventional DRAM Device to complex (multi-core) marching memory Row decoder Bit line Word line Sense amplifiers 256 Kbit marching memory core which is either a simple or extended marching memory 256 Mbit DRAM Device / 256 Mbit Marching memory consisting of 1000 marching memory cores Column decoder 28 Copyright 2010 Tadao Nakamura

Structure of conventional DRAM architecture 256 Kbit-core DRAM device Rank Rank Dual in-line memory module 29 Copyright 2010 Tadao Nakamura

Structure of conventional DRAM core 256 Kbit-core Bitline pitch is 90 nm 90 nm*512 bitlines = 46 um Wordline Wordline pitch 135 nm 135 nm*512 wordlines = 69 um DRAM device Bitline Core 30 Copyright 2010 Tadao Nakamura

Structure of complex (multi-core) marching memory 256 Kbit-simple simple/extended marching memory core pins Complex marching memory Rank Rank Dual in-line marching memory module 31 Copyright 2010 Tadao Nakamura

Complex Marching Memory Core -Its Basics- 256 Kbit simple / extended marching memory core that has 1000 marching columns, each of which consists of 256 bits (32 Bytes). Contents Position Index/Tag (For example, The token of the column that means the first address of the column bytes) 32 Copyright 2010 Tadao Nakamura

Interconnecting 1000 simple MM cores in complex marching memory One solution : Simple marching memory cores are interconnected to a single simple marching memory as the interface between the multiple cores and the arithmetic units Data is transferred column by column c between cores. Simple marching memory contents transfer interconnect Arithmetic units Simple marching memory Multi-core controller based upon compiler analysis 256 Kbit (extended) simple marching memory core 33 Copyright 2010 Tadao Nakamura

Features of Marching Memory Row decoder Wordlines Column decoder Bitlines Sense Amplifiers And then Disappeared! None of addressing, sensing, restore, precharge and refresh exist MM s Downsizing with much less hardware No long wires even among complex MMs, Simple structure of MM systems total As a result High speed Large capacity Low energy consumption Low cost 34 Copyright 2010 Tadao Nakamura

Conclusions We have presented two types of marching memory (MM) using DRAM type technology 1. Simple MM is fast and suitable for streaming applications 1-1. Extended Simple MM for random access with high locality of data 2. Complex MM for random access in a program uses multiple (extended) simple MM cores This is slower and requires multi-core interconnection networks. More research is needed. MM has potential for faster memory technology especially with good compiler support. 35 Copyright 2010 Tadao Nakamura