High-Performance Architectures for Embedded Memory Systems

Similar documents
High-Performance Architectures for Embedded Memory Systems

Vector IRAM: A Microprocessor Architecture for Media Processing

Memory Systems IRAM. Principle of IRAM

A New Direction for Computer Architecture Research

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Computer Architecture

Hakam Zaidan Stephen Moore

A New Direction for Computer Architecture Research

Data Parallel Architectures

Computer Architecture. Introduction. Lynn Choi Korea University

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

The Processor: Instruction-Level Parallelism

Computer Architecture. Fall Dongkun Shin, SKKU

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

The Future of Microprocessors Embedded in Memory

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

A Media-Enhanced Vector Architecture for Embedded Memory Systems

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

International Journal of Modern Trends in Engineering and Research e-issn No.: , Date: 2-4 July, 2015.

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

CS425 Computer Systems Architecture

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

Multithreading: Exploiting Thread-Level Parallelism within a Processor

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Integrated circuit processing technology offers

Processor (IV) - advanced ILP. Hwansoo Han

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Outline Marquette University

Embedded Systems. 7. System Components

Lecture 1: Introduction

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Intel released new technology call P6P

One-Level Cache Memory Design for Scalable SMT Architectures

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CSE 141: Computer Architecture. Professor: Michael Taylor. UCSD Department of Computer Science & Engineering

Architectural Issues for the 1990s. David A. Patterson. Computer Science Division EECS Department University of California Berkeley, CA 94720

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Simultaneous Multithreading on Pentium 4

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Computer Architecture!

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

CS 654 Computer Architecture Summary. Peter Kemper

Fundamentals of Computers Design

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Kaisen Lin and Michael Conley

Simultaneous Multithreading (SMT)

Computer Organization and Design THE HARDWARE/SOFTWARE INTERFACE

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Simultaneous Multithreading (SMT)

Fundamentals of Quantitative Design and Analysis

Instructor Information

Computer Architecture 计算机体系结构. Lecture 9. CMP and Multicore System 第九讲 片上多处理器与多核系统. Chao Li, PhD. 李超博士

PERFORMANCE ANALYSIS OF AN H.263 VIDEO ENCODER FOR VIRAM

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Computer Architecture!

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

High-Performance Processors Design Choices

Parallel Computer Architecture

Intel Enterprise Processors Technology

Hardware/Compiler Codevelopment for an Embedded Media Processor

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Multi-Core Microprocessor Chips: Motivation & Challenges

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Simultaneous Multithreading and the Case for Chip Multiprocessing

Advanced Processor Architecture

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Lecture 14: Multithreading

Evaluation of Existing Architectures in IRAM Systems

Computer Architecture s Changing Definition

Keywords and Review Questions

Lec 25: Parallel Processors. Announcements

Multiple Issue ILP Processors. Summary of discussions

Computer Architecture

Getting CPI under 1: Outline

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Computer Architecture!

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Introduction to the MMAGIX Multithreading Supercomputer

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

Microarchitecture Overview. Performance

Lecture 1: Course Introduction and Overview Prof. Randy H. Katz Computer Science 252 Spring 1996

Simultaneous Multithreading (SMT)

Embedded DRAM Architectural Trade-Offs

Hyperthreading Technology

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

Jim Keller. Digital Equipment Corp. Hudson MA

Transcription:

High-Performance Architectures for Embedded Memory Systems Computer Science Division University of California, Berkeley kozyraki@cs.berkeley.edu http://iram.cs.berkeley.edu/~kozyraki

Embedded DRAM systems roadmap This presentation Previous talks Desktops? Portable Computers? Network Computers? Cameras/Phones Video Games Graphics/Disks 8 MB 2 MB 32 MB DAC 99, Embedded Memory Tutorial Page 2

Outline Overview of general-purpose processors today Future processor applications & requirements Advantages and challenges of processor-dram integration Future microprocessor architectures characteristics and features compatibility and interaction with embedded DRAM technology Comparisons and conclusions DAC 99, Embedded Memory Tutorial Page 3

Current state-of-the-art processors High-Performance Embedded Data width 64 bit 32 or 64 bit Issue width 3-6 1-2 Execution out-of-order in-order Cache large multi-level small single-level Additional features parallel systems support SIMD extensions integrated I/O controllers on-chip DRAM Metrics peak performance price/performance MIPS/Watt code density Frequency 300 to 600 MHz 40 to 400 MHz Die area 200 to 300 sq. mm 10 to 100 sq. mm Power 20 to 80 Watts 0.3 to 4 Watts Examples Alpha 21264, MIPS R12K, Pentium III, Sparc III ARM-9, MIPS R5K, M32R, SH-4 DAC 99, Embedded Memory Tutorial Page 4

Current microprocessor applications Major applications Benchmarks High-Performance Desktop technical workloads (CAD) office productivity tools Server file systems transaction processing decision support SPEC95 (Int/FP) TPC-C/D (OLTP, decision support) Embedded postscript printers digital cameras networking hardware PDAs disk drives DhryStone DAC 99, Embedded Memory Tutorial Page 5

High-performance microprocessors: power/performance 35 Integer Performance (Spec95) 30 25 20 15 21264 21164 PowerPC G4 PA-8500 R12000 PII-Xeon Ultra II 10 0 20 40 60 80 Power Consumption (Watts) DAC 99, Embedded Memory Tutorial Page 6

Embedded microprocessors: power/performance Performance (DhryStone MIPS) 500 400 300 200 100 M32R/D SA-110 ARM7 VR5464 SH-4 0 0 1 2 3 4 5 Power Consumption (Watts) DAC 99, Embedded Memory Tutorial Page 7

High-performance microprocessors: transistor use PII-Xeon 4.5 3 R12000 2.5 4.7 PA-8500 4 126 Logic Cache PowerPC 750 2.7 3.8 21264 6 9.2 0 5 10 15 20 Million Transistors DAC 99, Embedded Memory Tutorial Page 8

Future microprocessor applications Mobile multimedia computing A single device is: PDA, video game, cell phone, pager, GPS, tape recorder, radio, TV remote Basic interfaces: voice (speech recognition) and image (image/video processing) Small size, battery operated devices Media processing functions are the basic workload DAC 99, Embedded Memory Tutorial Page 9

Requirements on microprocessors High performance for multimedia: real-time performance guarantees support for continuous media data-types fine-grain parallelism coarse-grain parallelism high instruction reference locality, code density high memory bandwidth Low power/energy consumption Low design/verification complexity, scalable design Small size/chip count DAC 99, Embedded Memory Tutorial Page 10

Embedded DRAM advantages summary High memory bandwidth Low memory latency slower than on-chip SRAM though Energy/power efficiency System size benefits Custom configuration interface width, size, number of banks etc DAC 99, Embedded Memory Tutorial Page 11

Embedded DRAM challenges summary edram cost wafer cost cost per DRAM bit (density) yield cost of testing Performance, power and yield of logic components Complexity Optimal memory organization DAC 99, Embedded Memory Tutorial Page 12

Trends in high-performance architecture Advanced superscalar processors VLIW: Very long instruction word processors (IA-64/EPIC) Single chip multiprocessors Reconfigurable processors Vector microprocessors (Vector IRAM) DAC 99, Embedded Memory Tutorial Page 13

Advanced superscalar processors Scale up current designs to issue more instructions (16-32) Trace Cache Multi-Hybrid Branch Predictor I Cache Reservation Stations 2nd Level FU2 FU1... FUn Cache Multibank D Cache Major features: dynamic instruction scheduling in hardware, out-of order execution branch/dependence/stride/data/trace prediction buffers large multibank caches DAC 99, Embedded Memory Tutorial Page 14

Advanced superscalar processors (2) Advantages dynamic scheduling exploits run-time info software compatibility high-performance for current desktop applications Disadvantages relies on high-speed logic and fast, large caches unpredictable performance (high misprediction cost) limited media processing support (MMX-like units) high design/verification complexity high power consumption due to extensive speculation edram perspective cannot fully utilize available edram bandwidth edram unfriendly environment (power, complexity, size) edram for second-level cache? DAC 99, Embedded Memory Tutorial Page 15

VLIW processors (EPIC) Very long instruction word scheme 128-bit long instruction Instruction 0 Instruction 1 Instruction 2 Template Major features instruction scheduling by compiler (dependence analysis, register renaming etc) template carries dependence information large number of registers multiple functional units software speculation and predicated (conditional) execution cache based designs DAC 99, Embedded Memory Tutorial Page 16

VLIW processors (EPIC) (2) Advantages simpler hardware highly scalable Disadvantages ability of compiler to optimize for performance (lack of run-time information)? code size (loop unrolling, software pipelining) software compatibility limited media processing support (MMX-like units) edram perspective cannot fully utilize available edram bandwidth requires high-speed logic to make up for run-time information edram for second-level cache? DAC 99, Embedded Memory Tutorial Page 17

Single chip multiprocessors Place multiple processors on a single chip Dual Dual Dual Dual Issue Issue Issue Issue Processor Processor Processor Processor D Cache I Cache D Cache I Cache D Cache I Cache D Cache I Cache 2nd Level Cache Major features symmetric multiprocessor system (shared memory system) shared second-level cache 4 to 8 uniprocessors, similar to current out-of-order designs DAC 99, Embedded Memory Tutorial Page 18

Single chip multiprocessors (2) Advantages modular design coarse-grained parallelism Disadvantages difficulty of efficient parallel programming lack of parallelizing compilers high power consumption complexity of shared-memory protocols edram perspective can utilize bandwidth of multi-bank edram inherent redundancy multiprocessors require large amount of memory DAC 99, Embedded Memory Tutorial Page 19

Vector microprocessors Vector instructions for i=1 to N : v3[i]=v1[i]+v2[i] Major features vector coprocessor unit instructions define operations on vectors (arrays) of data vector register file strided and indexed memory accesses support for multiple data widths support for DSP/fixed-point conditional/speculative execution support through flag registers VECTOR (N operations) DAC 99, Embedded Memory Tutorial Page 20 v1 + v3 v2 vector length add.vv v3, v1, v2

Vector microprocessors (2) Advantages predictable performance: in-order model, no caches high performance for media processing simple design: no complex issue/speculation logic low power/energy consumption performance through parallel pipelines, not just clock frequency scalable small code size: single instruction loops Disadvantages cannot utilize random instruction-level or thread-level parallelism; just fine-grain parallelism poor performance for many current desktop applications requires high-bandwidth memory system DAC 99, Embedded Memory Tutorial Page 21

Vector processors and edram Vector processors require multi-bank, high-bandwidth memory system: multiple wide DRAM banks, crossbar interconnect Vector processors can tolerate DRAM latency delayed or decoupled vector pipelines edram friendly environment low power, low complexity, modest clock frequencies edram testing use vector processor as BIST engine; 10x faster than scalar processors Logic redundancy use a redundant vector pipeline DAC 99, Embedded Memory Tutorial Page 22

Architecture combinations Vector and SIMD can be combined with superscalar, VLIW and chip multiprocessor Examples: x86 with media extensions, Philips Tri-media Goal: get the best of both worlds superscalar or VLIW for irregular parallelism vector or SIMD for regular (data) parallelism Open questions what is the right processing power balance how to achieve efficient interaction between the two components DAC 99, Embedded Memory Tutorial Page 23

Vector IRAM-1 Scalar core 2-way superscalar MIPS 16KByte I/D caches Memory system 24 MBytes DRAM multi-bank memory system 256-bit synchronous interface crossbar interconnect for 25.6 GB/sec aggregate bandwidth Vector coprocessor 8-KByte vector register file, 32 vector registers supports 64b, 32b, 16b data 2 arithmetic, 2 load/store, 2 flag processing units 4 64-bit pipelines per functional unit separate multi-ported TLB fixed-point arithmetic support no caches; pipeline enhanced to tolerate DRAM latency DAC 99, Embedded Memory Tutorial Page 24

Vector IRAM-1 Block Diagram DAC 99, Embedded Memory Tutorial Page 25

VIRAM-1 Technology Summary Technology: 0.18 micron embedded DRAM-logic process Memory: 24 MBytes Die size: 400 mm 2 Vector pipelines: 4 64-bit (or 8 32-bit or 16 16-bit) Clock Frequency: 200MHz scalar, vector, DRAM Power: ~2 W Performance: 3.2 GFLOPS 32 6.4 GOPS 16 DAC 99, Embedded Memory Tutorial Page 26

VIRAM-1 Floorplan DAC 99, Embedded Memory Tutorial Page 27

Comparison: current desktop and servers SPEC Int SPEC FP TPC (DB) SW Effort Design Scalability Design Complexity SS VLIW CMP RC VIRAM Legend: strength, neutral, weakness DAC 99, Embedded Memory Tutorial Page 28

Comparison: mobile multimedia computing Real-time perf. Cont. data support Energy/Power Code Size Fine-grain parall. Coarse-grain parall Memory BW Design scalbility Design Complixity SS VLIW CMP RC VIRAM DAC 99, Embedded Memory Tutorial Page 29

Comparison: edram perspective BW utilization Latency tolerance Power consumption Need for fast logic DRAM testing Logic redundancy Design scalbility Design complixity SS VLIW CMP RC VIRAM DAC 99, Embedded Memory Tutorial Page 30

Summary (1/3) Future microprocessor applications designed for mobile multimedia devices media performance, energy efficiency and reduced complexity are major requirements Embedded DRAM advantages high memory bandwidth system-on-a-chip benefits Microprocessor architecture approaches advanced superscalar VLIW single chip multiprocessor vector microprocessor DAC 99, Embedded Memory Tutorial Page 31

Summary (2/3) Superscalar & VLIW best with desktop/server applications not good match for embedded DRAM technology Chip Multiprocessor & vector microprocessor modular/scalable designs can utilize edram benefits and address challenges Vector microprocessor high media performance simple, energy efficient hardware great match for edram DAC 99, Embedded Memory Tutorial Page 32

Summary (3/3) Unlikely that edram will make it in the desktop high-performance microprocessors (at least for a while...) Processor architectures developed for mobile media devices address many of the edram challenges Cost and commercial success of edram based processors remains to be seen... DAC 99, Embedded Memory Tutorial Page 33

References (1) C. Kozyrakis, D. Patterson, A New Direction in Computer Architecture Research, IEEE Computer, vol. 31, no. 11, November 1998 Computer Architecture D. Patterson, L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 2nd edition, 1997, Morgan Kaufmann L. Hennessy, D. Patterson, Computer Architecture: A Quantitative Approach, 2nd edition, 1995, Morgan Kaufmann High-performance Processors R. E. Kessler, The Alpha 21264 Microprocessor: Out-Of-Order Execution at 600 Mhz, Hot Chips Conference Record, August 1998 Gary Lauterbach, UltraSPARC-III: A 600 MHz 64-bit Superscalar Processor for 1000- way Scalable Systems, Hot Chips Conference Record, August 1998 M. Choudhury et.al, A 300MHz CMOS Microprocessor with Multi-Media Extensions, Digest of Technical Papers, ISSCC, February 1997 DAC 99, Embedded Memory Tutorial Page 35

References (2) Embedded Processors M. Schlett, Trends in Embedded Microprocessor Design, IEEE Computer, vol. 31, no. 8, August 1998 Toru Shimizu, The M32Rx/D - A Single Chip Microcontroller With a 4MB Internal DRAM, Hot Chips Conference Record, August 1998 J. Choquette, Genesis microprocessor, Hot Chips Conference Record, August 1998 F. Arakawa et.al, SH4 RISC Multimedia Microprocessor, IEEE Micro, vol. 18, no. 2, March 1998 T. Litch et.al, StrongARMing Portable Communications, IEEE Micro, vol. 18, no. 2, March 1998 L. Goudge, S. Segars, Thumb: reducing the cost of 32-bit RISC performance in portable and consumer applications, In the Digest of Papers, COMPCON '96, February 1996 Embedded DRAM D. Patterson et.al, Intelligent RAMs, Digest of Technical Papers, ISSCC, February 1997 R. Fromm et.al., The Energy Efficiency of IRAM Architectures, The 24th International Symposium on Computer Architecture, June 1997 DAC 99, Embedded Memory Tutorial Page 36

References (3) K. Murakami et.al, Parallel Processing RAM Chip with 256Mb DRAM and Quad Processors, Digest of Technical Papers, ISSCC, February 1997 Tadaaki Yamauchi et.al., The Hierarchical Multi-Bank DRAM: A High-Performance Architecture for Memory Integrated with Processors, Proceedings of the 19th Conference on Advanced Research in VLSI, September 1997 J. Dreibelbis et.al, An ASIC Library Granular DRAM Macro with Built-In Self Test, Digest of Technical Papers, ISSCC, February 1998 T. Yabe et.al, A Configurable DRAM Macro Design for 2112 Derivative Organizations, Digest of Technical Papers, ISSCC, February 1998 A. Saulsbury, A. Nowatzyk, Missing the memory wall: the case for processor/memory integration, in proceedings of the 23rd Annual International Conference on Computer Architecture, May 1996 T. Sunaga et.al, A parallel processing chip with embedded DRAM macros. IEEE Journal of Solid-State Circuits, vol. 31, no.10, October 1996 D. Elliott et.al, Computational RAM: a memory-simd hybrid and its application to DSP, Proceedings of the IEEE 1992 Custom Integrated Circuits Conference, May 1992 DAC 99, Embedded Memory Tutorial Page 37

References (4) N. Bowman et.al, "Evaluation of Existing Architectures in IRAM Systems", Workshop on Mixing Logic and DRAM: Chips that Compute and Remember at ISCA '97, June 1997 NEC Corp., Virtual Channel Memory Technology, http://www.nec.com Miyano S. et.al: "A 1.6Gbyte/s Data Transfer Rate 8 Mb Embedded DRAM", IEEE Journal of Solid-State Circuits, v. 30, no. 11, November 1995 Microprocessor Architecture Trends T. Mudge, Strategic Directions in Computer Architecture, ACM Computing Surveys, 28(4):671-678, December 1996 J. Crawford, J. Huck, Motivations and Design Approach for the IA-64 64-Bit Instruction Set Architecture, In the Proceedings of the Microprocessor Forum, October 1997 Y.N. Patt et.al., One Billion Transistors, One Uniprocessor, One Chip, IEEE Computer, 30(9):51-57, September 1997 M. Lipasti, L.P. Shen, Superspeculative Microarchitecture for Beyond AD 2000, IEEE Computer, 30(9):59-66, September 1997 DAC 99, Embedded Memory Tutorial Page 38

References (5) J. Smith, S. Vajapeyam, Trace Processors: Moving to Fourth Generation Microarchitectures, IEEE Computer, 30(9):68-74, September 1997 S.J. Eggers et.al., Simultaneous Multithreading: a Platform for Next-Generation Processors. IEEE MICRO, 17(5):12-19, October 1997 L. Hammond et.al., A Single-Chip Multiprocessor. IEEE Computer, 30(9):79-85, September 1997 E. Waingold et.al, Baring It All to Software: Raw Machines, IEEE Computer, 30(9):86-93, September 1997 S.J. Eggers et.al, Simultaneous Multithreading: a Platform for Next-Generation Processors, IEEE MICRO, 17(5):12-19, October 1997 C.E. Kozyrakis et.al., Scalable Processors in the Billion-Transistor Era: IRAM, IEEE Computer, 30(9):75-78, September 1997 K. Olukotun et.al., Improving the Performance of Speculatively Parallel Applications on the Hydra CMP, In the Proceedings of the 1999 ACM International Conference on Supercomputing, June 1999. R. Barua et.al., Maps: A Compiler-Managed Memory System for Raw Machines, In the Proceedings of the 26th International Symposium on Computer Architecture June1999. S. Goldstein et.al., "PipeRench: a Coprocessor for Streaming Multimedia Acceleration,In the Proceedings of the 26th International Symposium on Computer Architecture June1999. DAC 99, Embedded Memory Tutorial Page 39

References (6) General-purpose architectures and media-processing T. Conte et.al, Challenges to Combining General-Purpose and Multimedia Processors, IEEE Computer, vol. 30, no. 12, December 1997 K. Diefendorff, P. Dubey, How Multimedia Workloads Will Change Processor Design, IEEE Computer, 30(9):43-45, September 1997 DAC 99, Embedded Memory Tutorial Page 40