IMAGINE: Signal and Image Processing Using Streams

Size: px
Start display at page:

Download "IMAGINE: Signal and Image Processing Using Streams"

Transcription

1 IMAGINE: Signal and Image Processing Using Streams Brucek Khailany William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture Group Computer Systems Laboratory Stanford University 1

2 : A Programmable Signal and Image Processor Motivation Applications poorly matched to conventional architectures Key stream architecture features High computational bandwidth (: 48 on-chip ALUs) Stream register organization Data bandwidth hierarchy Performance density of a special purpose processor 0.59 cm 2 CMOS chip, 0.13 µm standard cell, 500 MHz 20 GFLOPS peak performance (40 GOPS fixed point) 10 GFLOPS sustained on several apps > 2 GFLOPS/W, > 5 GOPS/W 2

3 Representative Applications Stereo Depth Extraction Polygon Rendering Render MPEG Encoding/Decoding Encode/ Decode Encoded 2D Data 2D Video Stream 3

4 Stream Processing Input Data Kernel Stream Output Data Image 0 convolve convolve SAD Depth Map Image 1 convolve convolve Little data reuse (pixels never revisited) Highly data parallel (output pixels not dependent on other output pixels) Compute intensive (60 arithmetic operations per memory reference) 4

5 Characteristics of Media Applications Poorly matched to conventional architectures Instruction-Level Parallelism Caches Few arithmetic units Well-matched to modern VLSI technology Lots (100 s s) of ALUs fit on a single chip Communication bandwidth is the scarce resource 5

6 Communication Bandwidth: Care and Feeding of ALUs Special-Purpose Processors: ALUs fed by dedicated wires/memories General-Purpose Processors: Feeding Structure Dwarfs ALUs IP Instr. Cache IR Regs 6

7 Stream Architecture Provides Data Bandwidth Hierarchy SIMD/VLIW Control Stream Register File Peak BW: 2GB/s 32GB/s 544GB/s 7

8 Application Data Bandwidth Usage Stream Register File 2GB/s 32GB/s 544GB/s Memory BW Global RF BW Local RF BW Depth Extractor 0.80 GB/s GB/s GB/s MPEG Encoder 0.47 GB/s 2.46 GB/s GB/s Polygon Rendering 0.78 GB/s 4.06 GB/s GB/s QR Decomposition 0.46 GB/s 3.67 GB/s GB/s 8

9 Stream Register File: Details Arbiter SRF: Single-ported 128KB SRAM (1024 x 32W) 32W/cycle Stream buffers To/From Arithmetic Clusters 9

10 Arithmetic Cluster: Details Local Register File To SRF * * / CU Intercluster Network From SRF Cross Point Units support floating-point / 32-bit / dual 16-bit / quad 8-bit instructions 4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC 17-cycle FDIV (pipelined for 1 FDIV every 7 cycles) 10

11 Programming Environment StereoDepthExtraction( ) { // Load Input Images... // Run Kernels convolve7x7 (RawImage,ConvImage); convolve3x3 (ConvImage,Conv2Image);... // Store Output } Compile-time Run-time Host StreamC C++ compiler stream scheduler KernelC kernel scheduler microcode Convolve7x7( ) {... while(!in.empty()) {... p0 = k0 * in10; p12 = k21 * in32; p34 = k43 * in54; p56 = k65 * in76; sum = (p0 + p12) + (p34 + p56);... } } 11

12 Performance bit applications 19.8 floating-point application bit kernels 20 GOPS floating-point kernel 5 0 depth mpeg qrd dct convolve fft 12

13 Sustained Application Performance Stereo Depth Extraction 320x240 8-bit grayscale 200 frames/second Polygon Rendering 4.5 Million Vertices/sec 5.1 Million Pixels/sec MPEG Encoding 720x bit color 120 frames/second Render Encode SPECviewperf ADVS benchmark (unlit) D Video Stream Encoded 2D Data 13

14 Power Estimates Other Mem Sys Pins SRF Clusters Clock Watts % 1% 2% 6% 23% depth mpeg qrd dct convolve fft average GOPS/W: % 14

15 The Stream Processor Streaming Memory System Host Processor Stream Controller Stream Register File: 32kW SRAM Network Interface Microcontroller: 2K VLIW Instrs Network Stream Processor 15

16 Floorplan 22 million transistors 500 MHz Stream Controller SRF Control Network Interface Micro-Controller 0 TI GS30KA: 0.15 µm L drawn 0.13 µm L eff CMOS process Memory System Streambuffers SRF Streambuffers mm mm 16

17 VLSI Implementation: 22M Transistors with 7 grad students Stream architecture reduces VLSI design complexity Modularity / Replication Long wire delays converted to explicit communications Exposed to microarchitecture, software Design methodology Standard ASIC flow with forced placement of datapaths Bitslice Verilog Improved area, delay Pre-placement wire length estimates Reduce design iterations 17

18 Status team accomplishments Cycle-accurate simulator Software tools Completed synthesizable Verilog Arithmetic units implemented in standard cells Industrial partners Texas Instruments: Fab Intel Future work Circuits/Logic: expected completion 9/15/00 Tapeout: expected Q4/

19 Summary Key stream architecture features Stream register organization Data bandwidth hierarchy Performance density of a special purpose processor 10 GFLOPS sustained on several apps >2 GFLOPS/W, >5 GOPS/W VLSI Implementation Validate architectural concepts Develop experimental prototype 19

EE482S Lecture 1 Stream Processor Architecture

EE482S Lecture 1 Stream Processor Architecture EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu 1 Today s Class Meeting What is EE482C? Material covered

More information

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture 1 Today s Class Meeting EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J Dally Computer Systems Laboratory Stanford University billd@cslstanfordedu What is EE482C? Material covered

More information

Stream Processing for High-Performance Embedded Systems

Stream Processing for High-Performance Embedded Systems Stream Processing for High-Performance Embedded Systems William J. Dally Computer Systems Laboratory Stanford University HPEC September 25, 2002 Stream Proc: 1 Sept 25, 2002 Report Documentation Page Form

More information

IMAGINE: MEDIA PROCESSING

IMAGINE: MEDIA PROCESSING IMAGINE: MEDIA PROCESSING WITH STREAMS THE POWER-EFFICIENT IMAGINE STREAM PROCESSOR ACHIEVES PERFORMANCE DENSITIES COMPARABLE TO THOSE OF SPECIAL-PURPOSE EMBEDDED PROCESSORS. EXECUTING PROGRAMS MAPPED

More information

Stream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop

Stream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Processor Architecture William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Arch: 1 August 22, 2003 Some Definitions A Stream Program expresses a computation as streams

More information

Evaluating the Imagine Stream Architecture

Evaluating the Imagine Stream Architecture Evaluating the Imagine Stream Architecture Jung Ho Ahn, William J. Dally, Brucek Khailany, Ujval J. Kapasi, and Abhishek Das Computer Systems Laboratory Stanford University, Stanford, CA 94305, USA {gajh,billd,khailany,ujk,abhishek}@cva.stanford.edu

More information

The Imagine Stream Processor

The Imagine Stream Processor The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Computer Systems Laboratory Computer Systems Laboratory Stanford University, Stanford, CA

More information

Stanford University Computer Systems Laboratory. Stream Scheduling. Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles

Stanford University Computer Systems Laboratory. Stream Scheduling. Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles Stanford University Concurrent VLSI Architecture Memo 122 Stanford University Computer Systems Laboratory Stream Scheduling Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles

More information

Data Parallel Architectures

Data Parallel Architectures EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003

More information

The Implementation and Analysis of Important Symmetric Ciphers on Stream Processor

The Implementation and Analysis of Important Symmetric Ciphers on Stream Processor 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore The Implementation and Analysis of Important Symmetric Ciphers on Stream Processor

More information

A Bandwidth-efficient Architecture for a Streaming Media Processor

A Bandwidth-efficient Architecture for a Streaming Media Processor A Bandwidth-efficient Architecture for a Streaming Media Processor by Scott Rixner B.S. Computer Science Massachusetts Institute of Technology, 1995 M.Eng. Electrical Engineering and Computer Science Massachusetts

More information

THE VLSI IMPLEMENTATION AND EVALUATION OF AREA- AND ENERGY-EFFICIENT STREAMING MEDIA PROCESSORS

THE VLSI IMPLEMENTATION AND EVALUATION OF AREA- AND ENERGY-EFFICIENT STREAMING MEDIA PROCESSORS THE VLSI IMPLEMENTATION AND EVALUATION OF AREA- AND ENERGY-EFFICIENT STREAMING MEDIA PROCESSORS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES

More information

Stream Processors. Many signal processing applications require. Programmability with Efficiency

Stream Processors. Many signal processing applications require. Programmability with Efficiency WILLIAM J. DALLY, UJVAL J. KAPASI, BRUCEK KHAILANY, JUNG HO AHN, AND ABHISHEK DAS, STANFORD UNIVERSITY Many signal processing applications require both efficiency and programmability. Baseband signal processing

More information

Accelerated Motion Estimation of H.264 on Imagine Stream Processor

Accelerated Motion Estimation of H.264 on Imagine Stream Processor Accelerated Motion Estimation of H.264 on Imagine Stream Processor Haiyan Li, Mei Wen, Chunyuan Zhang, Nan Wu, Li Li, Changqing Xun School of Computer Science, National University of Defense Technology

More information

Lecture 16 Data Level Parallelism (3) EEC 171 Parallel Architectures John Owens UC Davis

Lecture 16 Data Level Parallelism (3) EEC 171 Parallel Architectures John Owens UC Davis Lecture 16 Data Level Parallelism (3) EEC 171 Parallel Architectures John Owens UC Davis Credits John Owens / UC Davis 2007 9. Thanks to many sources for slide material: Computer Organization and Design

More information

CONDITIONAL TECHNIQUES FOR STREAM PROCESSING KERNELS

CONDITIONAL TECHNIQUES FOR STREAM PROCESSING KERNELS CONDITIONAL TECHNIQUES FOR STREAM PROCESSING KERNELS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT

More information

Memory Access Scheduling

Memory Access Scheduling Memory Access Scheduling ECE 5900 Computer Engineering Seminar Ying Xu Mar 4, 2005 Instructor: Dr. Chigan 1 ECE 5900 spring 05 1 Outline Introduction Modern DRAM architecture Memory access scheduling Structure

More information

Exploring the VLSI Scalability of Stream Processors

Exploring the VLSI Scalability of Stream Processors Exploring the VLSI Scalability of Stream Processors Brucek Khailany, William J. Dally, Scott Rixner, Ujval J. Kapasi, John D. Owens, and Brian Towles Computer Systems Laboratory Computer Systems Laboratory

More information

General Purpose Signal Processors

General Purpose Signal Processors General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:

More information

Comparing Reyes and OpenGL on a Stream Architecture

Comparing Reyes and OpenGL on a Stream Architecture Comparing Reyes and OpenGL on a Stream Architecture John D. Owens Brucek Khailany Brian Towles William J. Dally Computer Systems Laboratory Stanford University Motivation Frame from Quake III Arena id

More information

Jim Keller. Digital Equipment Corp. Hudson MA

Jim Keller. Digital Equipment Corp. Hudson MA Jim Keller Digital Equipment Corp. Hudson MA ! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um "## Continued Performance Leadership

More information

Digital Signal Processor Core Technology

Digital Signal Processor Core Technology The World Leader in High Performance Signal Processing Solutions Digital Signal Processor Core Technology Abhijit Giri Satya Simha November 4th 2009 Outline Introduction to SHARC DSP ADSP21469 ADSP2146x

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures

Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures Hamed Fatemi 1,2, Henk Corporaal 2, Twan Basten 2, Richard Kleihorst 3,and Pieter Jonker 4 1 h.fatemi@tue.nl 2 Eindhoven

More information

KiloCore: A 32 nm 1000-Processor Array

KiloCore: A 32 nm 1000-Processor Array KiloCore: A 32 nm 1000-Processor Array Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, Bevan Baas University of California, Davis VLSI Computation

More information

Introduction to Microprocessor

Introduction to Microprocessor Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device

More information

CS 152 Computer Architecture and Engineering

CS 152 Computer Architecture and Engineering CS 152 Computer Architecture and Engineering Lecture 10 -- Cache I 2014-2-20 John Lazzaro (not a prof - John is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152 L10: Cache I UC

More information

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis A Data-Parallel Genealogy: The GPU Family Tree John Owens University of California, Davis Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development

More information

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research

More information

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor EE482C: Advanced Computer Organization Lecture #12 Stream Processor Architecture Stanford University Tuesday, 14 May 2002 Project Proposals Lecture #12: Tuesday, 14 May 2002 Lecturer: Students of the class

More information

Register Organization and Raw Hardware. 1 Register Organization for Media Processing

Register Organization and Raw Hardware. 1 Register Organization for Media Processing EE482C: Advanced Computer Organization Lecture #7 Stream Processor Architecture Stanford University Thursday, 25 April 2002 Register Organization and Raw Hardware Lecture #7: Thursday, 25 April 2002 Lecturer:

More information

Energy efficiency vs. programmability trade-off: architectures and design principles

Energy efficiency vs. programmability trade-off: architectures and design principles Energy efficiency vs. programmability trade-off: architectures and design principles J.P Robelly, H. Seidel, K.C Chen, G. Fettweis Dresden Silicon GmbH. Helmholtzstrasse 18, 01069 Dresden, Germany robelly@dresdensilicon.com

More information

Flexible wireless communication architectures

Flexible wireless communication architectures Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar Southern Methodist University April

More information

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor

More information

Creating a Scalable Microprocessor:

Creating a Scalable Microprocessor: Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B.

More information

MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING

MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF

More information

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System

More information

The Design of the KiloCore Chip

The Design of the KiloCore Chip The Design of the KiloCore Chip Aaron Stillmaker*, Brent Bohnenstiehl, Bevan Baas DAC 2017: Design Challenges of New Processor Architectures University of California, Davis VLSI Computation Laboratory

More information

VLSI Design Automation. Maurizio Palesi

VLSI Design Automation. Maurizio Palesi VLSI Design Automation 1 Outline Technology trends VLSI Design flow (an overview) 2 Outline Technology trends VLSI Design flow (an overview) 3 IC Products Processors CPU, DSP, Controllers Memory chips

More information

Vector IRAM: A Microprocessor Architecture for Media Processing

Vector IRAM: A Microprocessor Architecture for Media Processing IRAM: A Microprocessor Architecture for Media Processing Christoforos E. Kozyrakis kozyraki@cs.berkeley.edu CS252 Graduate Computer Architecture February 10, 2000 Outline Motivation for IRAM technology

More information

PACE: Power-Aware Computing Engines

PACE: Power-Aware Computing Engines PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious

More information

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

FABRICATION TECHNOLOGIES

FABRICATION TECHNOLOGIES FABRICATION TECHNOLOGIES DSP Processor Design Approaches Full custom Standard cell** higher performance lower energy (power) lower per-part cost Gate array* FPGA* Programmable DSP Programmable general

More information

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &

More information

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor

More information

Stream Programming: Explicit Parallelism and Locality. Bill Dally Edge Workshop May 24, 2006

Stream Programming: Explicit Parallelism and Locality. Bill Dally Edge Workshop May 24, 2006 Stream Programming: Explicit Parallelism and Locality Bill Dally Edge Workshop May 24, 2006 Edge: 1 May 24, 2006 Outline Technology Constraints Architecture Stream programming Imagine and Merrimac Other

More information

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems EE382 (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems Mattan Erez The University of Texas at Austin EE382: Principles of Computer Architecture, Fall 2011 -- Lecture

More information

Design space exploration for real-time embedded stream processors

Design space exploration for real-time embedded stream processors Design space exploration for real-time embedded stream processors Sridhar Rajagopal, Joseph R. Cavallaro, and Scott Rixner Department of Electrical and Computer Engineering Rice University sridhar, cavallar,

More information

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Mattan Erez The University of Texas at Austin CScADS Autotuning Workshop July 11, 2007 Snowbird, Utah Stream Processors

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Improving Power Efficiency in Stream Processors Through Dynamic Cluster Reconfiguration

Improving Power Efficiency in Stream Processors Through Dynamic Cluster Reconfiguration Improving Power Efficiency in Stream Processors Through Dynamic luster Reconfiguration Sridhar Rajagopal WiQuest ommunications Allen, T 75 sridhar.rajagopal@wiquest.com Scott Rixner and Joseph R. avallaro

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING

MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy Competitors using generic parts Performance benefits to be had for custom design Original PlayStation: no vector processing or floating point support Geometry issues Photorealism at the core of design

More information

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

All About the Cell Processor

All About the Cell Processor All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,

More information

CO403 Advanced Microprocessors IS860 - High Performance Computing for Security. Basavaraj Talawar,

CO403 Advanced Microprocessors IS860 - High Performance Computing for Security. Basavaraj Talawar, CO403 Advanced Microprocessors IS860 - High Performance Computing for Security Basavaraj Talawar, basavaraj@nitk.edu.in Course Syllabus Technology Trends: Transistor Theory. Moore's Law. Delay, Power,

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

Current Trends in Computer Graphics Hardware

Current Trends in Computer Graphics Hardware Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)

More information

REAL TIME DIGITAL SIGNAL PROCESSING

REAL TIME DIGITAL SIGNAL PROCESSING REAL TIME DIGITAL SIGNAL PROCESSING UTN - FRBA 2011 www.electron.frba.utn.edu.ar/dplab Introduction Why Digital? A brief comparison with analog. Advantages Flexibility. Easily modifiable and upgradeable.

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

Architecture. Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R.

Architecture. Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore The

More information

SA-1500: A 300 MHz RISC CPU with Attached Media Processor*

SA-1500: A 300 MHz RISC CPU with Attached Media Processor* and Bridges Division SA-1500: A 300 MHz RISC CPU with Attached Media Processor* Prashant P. Gandhi, Ph.D. and Bridges Division Computing Enhancement Group Intel Corporation Santa Clara, CA 95052 Prashant.Gandhi@intel.com

More information

Smart Memories: A Modular Reconfigurable Architecture

Smart Memories: A Modular Reconfigurable Architecture Abstract Smart Memories: A Modular Reconfigurable Architecture Trends in VLSI technology scaling demand that future computing devices be narrowly focused to achieve high performance and high efficiency,

More information

Smart Memories: A Modular Reconfigurable Architecture

Smart Memories: A Modular Reconfigurable Architecture Abstract Smart Memories: A Modular Reconfigurable Architecture Trends in VLSI technology scaling demand that future computing devices be narrowly focused to achieve high performance and high efficiency,

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

Merrimac: Supercomputing with Streams

Merrimac: Supercomputing with Streams Merrimac: Supercomputing with Streams William J. Dally Patrick Hanrahan Mattan Erez Timothy J. Knight François Labonté Jung-Ho Ahn Nuwan Jayasena Ujval J. Kapasi Abhishek Das Jayanth Gummaraju Ian Buck

More information

Computers and Microprocessors. Lecture 34 PHYS3360/AEP3630

Computers and Microprocessors. Lecture 34 PHYS3360/AEP3630 Computers and Microprocessors Lecture 34 PHYS3360/AEP3630 1 Contents Computer architecture / experiment control Microprocessor organization Basic computer components Memory modes for x86 series of microprocessors

More information

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA

More information

The Processor That Don't Cost a Thing

The Processor That Don't Cost a Thing The Processor That Don't Cost a Thing Peter Hsu, Ph.D. Peter Hsu Consulting, Inc. http://cs.wisc.edu/~peterhsu DRAM+Processor Commercial demand Heat stiffling industry's growth Heat density limits small

More information

Improving Power Efficiency in Stream Processors Through Dynamic Reconfiguration

Improving Power Efficiency in Stream Processors Through Dynamic Reconfiguration Improving Power Efficiency in Stream Processors Through Dynamic Reconfiguration June 5, 24 Abstract Stream processors support hundreds of functional units in a programmable architecture by clustering those

More information

A Data-Parallel Genealogy: The GPU Family Tree

A Data-Parallel Genealogy: The GPU Family Tree A Data-Parallel Genealogy: The GPU Family Tree Department of Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis Outline Moore s Law brings

More information

Evaluation of Architectural Paradigms for Addressing the Processor-Memory Gap

Evaluation of Architectural Paradigms for Addressing the Processor-Memory Gap Evaluation of Architectural Paradigms for Addressing the Processor-Memory Gap Leonid Oliker, Parry Husbands, Gorden Griem Lawrence Berkeley National Laboratory Berkeley, CA {loliker,pjrhusbands,ggriem@lbl.gov}

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

How to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017)

How to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017) How to build a Megacore microprocessor by Andreas Olofsson (MULTIPROG WORKSHOP 2017) 1 Disclaimers 2 This presentation summarizes work done by Adapteva from 2008-2016. Statements and opinions are my own

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,

More information

Design Space Exploration for Memory Subsystems of VLIW Architectures

Design Space Exploration for Memory Subsystems of VLIW Architectures E University of Paderborn Dr.-Ing. Mario Porrmann Design Space Exploration for Memory Subsystems of VLIW Architectures Thorsten Jungeblut 1, Gregor Sievers, Mario Porrmann 1, Ulrich Rückert 2 1 System

More information

CAD for VLSI. Debdeep Mukhopadhyay IIT Madras

CAD for VLSI. Debdeep Mukhopadhyay IIT Madras CAD for VLSI Debdeep Mukhopadhyay IIT Madras Tentative Syllabus Overall perspective of VLSI Design MOS switch and CMOS, MOS based logic design, the CMOS logic styles, Pass Transistors Introduction to Verilog

More information

The T0 Vector Microprocessor. Talk Outline

The T0 Vector Microprocessor. Talk Outline Slides from presentation at the Hot Chips VII conference, 15 August 1995.. The T0 Vector Microprocessor Krste Asanovic James Beck Bertrand Irissou Brian E. D. Kingsbury Nelson Morgan John Wawrzynek University

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit

More information

MEMORY AND CONTROL ORGANIZATIONS OF STREAM PROCESSORS

MEMORY AND CONTROL ORGANIZATIONS OF STREAM PROCESSORS MEMORY AND CONTROL ORGANIZATIONS OF STREAM PROCESSORS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT

More information

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode

More information

An Overview of Standard Cell Based Digital VLSI Design

An Overview of Standard Cell Based Digital VLSI Design An Overview of Standard Cell Based Digital VLSI Design With examples taken from the implementation of the 36-core AsAP1 chip and the 1000-core KiloCore chip Zhiyi Yu, Tinoosh Mohsenin, Aaron Stillmaker,

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Evaluation of Static and Dynamic Scheduling for Media Processors.

Evaluation of Static and Dynamic Scheduling for Media Processors. Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts 1 and Wayne Wolf 2 1 Dept. of Computer Science, Washington University, St. Louis, MO 2 Dept. of Electrical Engineering, Princeton

More information

Embedded Computation

Embedded Computation Embedded Computation What is an Embedded Processor? Any device that includes a programmable computer, but is not itself a general-purpose computer [W. Wolf, 2000]. Commonly found in cell phones, automobiles,

More information

Windowing System on a 3D Pipeline. February 2005

Windowing System on a 3D Pipeline. February 2005 Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April

More information

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc. Gemini: A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor Sanjiv Kapil Gemini Architect Sun Microsystems, Inc. Design Goals Designed for compute-dense, transaction oriented systems (webservers,

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Lecture 4: RISC Computers

Lecture 4: RISC Computers Lecture 4: RISC Computers Introduction Program execution features RISC characteristics RISC vs. CICS Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) represents an important

More information