Low Power DSP Architectures

Size: px
Start display at page:

Download "Low Power DSP Architectures"

Transcription

1 Overview 1 AnySP SODA++ Increase the Application Domain of the Wireless baseband architecture Diet-SODA SODA- - Taking the sugar out of SODA Low Power DSP Architectures Trevor Mudge, Bredt Professor of Engineering, The University of Michigan, Ann Arbor 1st tubs.city Symposium July 1 3, 2009, Braunschweig Mark Woh 1, Sangwon Seo 1, Ron Dreslinski, Geoff Blake, Scott Mahlke 1, Chaitali Chakrabarti 2, Krisztian Flautner 3 University of Michigan ACAL 1 Arizona State University 2 ARM, Ltd

2 The Old Modern Mobile Mobile Phone Phone 3 Video Recording Future phones are becoming more complex Richer applications require much more requirements Video Editing How do phones handle this now? Higher Data Rates 3D Rendering Advanced Image Processing Photos From Inside Today s Smart Phones 4 Modern phones are looking like Frankenchips! Inside the OMAP3430 Application Processor Some cores unused and functionality duplicated Inside the X-Gold 608 (Representation of QCOM) 4 4 2

3 Cost for Multi-System Support 5 Programmable Unified Architectures Provide Lower Cost Faster Time to Market Support for Multiple Applications (Current and Future) Bug Fixes After Manufacturing So where do we start? Supporting multiple systems is reserved for the most expensive phones Cost is in supporting all the systems that may or may not be used at once Data gathered from - Ramacher, U Software-Defined Radio Prospects for Multistandard Mobile Phones. Computer 40, 10 (Oct. 2007) - Finchelstein, D.F.; Sze, V.; Sinangil, M.E.; Koken, Y.; Chandrakasan, A.P., "A low-power 0.7-V H p video decoder," Solid-State Circuits Conference, A-SSCC ' View of the Unified Architecture World 6 3G ISP V i d e o WiFi GSM 2D/3D 3G ISP V i d e o WiFi GSM 2D/3D 3G ISP V i d e o WiFi GSM 2D/3D 6 6 3

4 Power/Performance Requirements for Multiple Systems 7 Different applications have different power/performance characteristics! We need to design keeping each application in mind! (Not GPP but Domain Specific Processor) The Applications Is there anything we can learn from the applications themselves? 8 8 4

5 H.264 Basics 9 T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. A low-power dual-mode video decoder for mobile applications, IEEE Communications Magazine, volume 44, issue 8, pp , Aug G Wireless Basics 10 Three kernels make up the majority of the work FFT Extract Data from Signals STBC Combine Data into More Reliable Stream LDPC Error Correction on Data Stream

6 11 Mobile Signal Processing Algorithm Characteristics 4G H.264 Algorithm SIMD Scalar Overhead SIMD Width Amount Workload (%) Workload (%) Workload (%) (Elements) of TLP FFT Low STBC High LDPC Low Deblocking Filter Medium Intra PredicMon SIMD 85 comes 5 at a cost! Medium Inverse Transform Register 80 File 5 Power 15 8 High MoMon CompensaMon High Data Movement/Alignment Cost SIMD architectures have to deal with this! Algorithms have different SIMD widths From very large to very small Though SIMD width varies all algorithms can exploit it Large percentage of work can be SIMDized Larger SIMD width tend to have less TLP Traditional SIMD Power Breakdown 12 Register File Power consumes a lot of power in traditional 32-wide SIMD architecture 12 6

7 Register File Access % 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Register Access Bypass Read Bypass Write Lots of power wasted on unneeded register file access! FFT STBC LDPC Deblocking Filter Intra PredicMon Inverse Transform Many of the register file access do not have to go back to the main register file Instruction Pair Frequency 14 Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions A few instruction pairs (3-5) make up the majority of all instruction pairs!

8 Data Alignment Problem! Intra PredicMon 15 Traditional SIMD machines take too long or cost too much to do this Good news small fixed number patterns per kernel H.264 Intra-prediction has 9 different prediction modes Each prediction mode requires a specific permutation More Data Alignment Problems! Inverse Transform 16 Adder tree can accelerate not only matrix operations! Many different video kernels can be accelerated too!

9 Even More Data Alignment! 17 Block based decoding requires us to access different locations of memory for each task We cannot just rely of fetching contiguous sets of data C.H. Meenderinck, A. Azevedo, B.H.H. Juurlink, M. Alvarez, A. Ramirez, Parallel Scalability of Video Decoders, Journal of Signal Processing Systems, August 2008 Techniques like 2D-Wave and 3D-Wave decoding for H.264 helps increase amount of parallelism but we have to be able to access different macroblocks for each parallel computation Summary 18 Conclusion about 4G and H.264 Lots of different sized parallelism From 4 wide to 96 wide to 1024 wide SIMD Which means many different SIMD widths need to be supported Very short lived values Lots of potential for instruction fusings Limited set of shuffle patterns required for each kernel 18 9

10 19 AnySP Design SODA SIMD Architecture Wide SIMD with Simple Shuffle Network

11 AnySP Architecture High Level Banked Memory with SRAM-based Crossbar 8 Groups of 8-Wide Flexible Function Units Multiple Output Adder Tree 128x128 16bit Swizzle Network Temporary Buffer and Bypass Network Datapath AGU and Scalar Pipeline Multi-Width Support 22 Normal 64-Wide SIMD mode all lanes share one AGU Each 8-wide SIMD Group works on different memory locations of the same 8-wide code AGU Offsets

12 AnySP FFU Datapath 23 Flexible Functional Unit allows us to 1. Exploit Pipeline-parallelism by joining two lanes together 2. Handle register bypass and the temporary buffer 3. Join multiple pipelines to process deeper subgraphs 4. Fuse Instruction Pairs AnySP Results

13 Simulation Environment Traditional SIMD architecture comparison SODA at 90nm technology 25 AnySP Synthesized at 90nm TSMC Power, timing, area numbers were extracted Kernels were hand written and optimized 4G based on a NTT DoCoMo 4G test setup H.264 4CIF@30fps AnySP Speedup vs SIMD-based Architecture For all benchmarks we perform more than 2x better than a SIMD-based architecture

14 AnySP Energy-Delay vs SIMD-based Architecture 27 More importantly energy efficiency is much better! AnySP Power Breakdown 28 We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm

15 Conclusion & Future Work Conclusion We have presented an example architecture that could possibly meet the requirements of 100Mbps 4G and HD video on the same platform Under the power budget and meeting the performance at 45nm 29 Future and Ongoing Work Application-specific language Larger class of algorithms for AnySP Better utilization of resources for non-parallel kernels Speedup sequential parts University 29 of Michigan ARM June Diet-SODA

16 Diet SODA SODA, Ardbeg, AnySP may be too powerful for the application Simple Imaging processing for cameras Audio processing for voice Lose flexibility and generality of Ardbeg, AnySP for performance at less # of gates 31 Build a modular design which people can add SIMD groups and special function blocks to increase performance, at cost of area but allow voltage scaling 31 Histogram Equalization 32 Spreads out an unevenly distributed histogram Increases contrast 32 16

17 Histogram Equalization 33 for(i=0; i<length; i++) { for(j=0; j<width; j++){ k = image[i][j]; histogram[k] = histogram[k] + 1; } } for(i=0; i<length; i++){ for(j=0; j<width; j++){ k = image[i][j]; image[i][j] = sum_of_h[k] * constant; } } Hard to parallelize because of histogram[k] We can parallelize the load but we may suffer on the SIMD to scalar transfer 33 Basic Edge Detection detect_edges : 8 directional masks ( Sobel ) + thresholding

18 Basic Edge Detection 35 for(a=-1; a<2; a++){ for(b=-1; b<2; b++){ sum = sum + image[i+a][j+b] * mask_0[a+1][b+1]; } } for(i=0; i<rows; i++){ for(j=0; j<cols; j++){ if(out_image[i][j] > high) { out_image[i][j] = new_hi; } else { out_image[i][j] = new_low; } } } Can be easily parallelized across multiple masks or across multiple images Basic Edge Detection Operations are applied on 3x3 matrices 1) Reduction tree (need a function of sum of 9 values) 2) Select max function 3) Threshold operation if (out_image > threshold) out_image = hi else out_image = low

19 Performance Estimate 37 SequenMal (Mcycle) SIMDized (Mcycle) Histogram EqualizaMon Edge DetecMon Homogeneity Edge DetecMon Filter Estimate for how many cycle we need on 2048x1536 images 3.1 Megapixel Unoptimized SIMDized kernels assuming 16-wide SIMD Most work is done on 9-wide (3x3) data elements Adding multiple SIMD groups can increase performance in most kernels 37 Modular Design Consideration Build a highly application specific modular core Depending on which application supported user can add or remove specialized hardware 38 Design requirement tradeoff Smallest Area Use minimum number of SIMD groups or run SIMD groups faster Faster performance Add more SIMD Groups Lower Power Add more SIMD Groups and lower frequency and lower VDD for same throughput

20 39 The End

AnySP: Anytime Anywhere Anyway Signal Processing

AnySP: Anytime Anywhere Anyway Signal Processing 1 AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh 1, Sangwon Seo 1, Scott Mahlke 1,Trevor Mudge 1, Chaitali Chakrabarti 2, Krisztian Flautner 3 University of Michigan ACAL 1 Arizona State University

More information

An Ultra Low Power SIMD Processor for Wireless Devices

An Ultra Low Power SIMD Processor for Wireless Devices An Ultra Low Power SIMD Processor for Wireless Devices Mark Woh 1, Sangwon Seo 1, Chaitali Chakrabarti 2, Scott Mahlke 1 and Trevor Mudge 1 1 Advanced Computer Architecture Laboratory 2 School of Electrical,

More information

ANYSP: ANYTIME ANYWHERE ANYWAY SIGNAL PROCESSING

ANYSP: ANYTIME ANYWHERE ANYWAY SIGNAL PROCESSING ... ANYSP: ANYTIME ANYWHERE ANYWAY SIGNAL PROCESSING... LOOKING FORWARD, THE COMPUTATION REQUIREMENTS OF MOBILE DEVICES WILL INCREASE BY ONE TO TWO ORDERS OF MAGNITUDE, BUT THEIR POWER REQUIREMENTS WILL

More information

AnySP: Anytime Anywhere Anyway Signal Processing

AnySP: Anytime Anywhere Anyway Signal Processing AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh, Sangwon Seo, Scott Mahlke, Trevor Mudge, Chaitali Chakrabarti 2 and Krisztian Flautner 3 Advanced Computer Architecture Laboratory 2 Department

More information

ARCHITECTURE AND ANALYSIS FOR NEXT GENERATION MOBILE SIGNAL PROCESSING. Mark Woh

ARCHITECTURE AND ANALYSIS FOR NEXT GENERATION MOBILE SIGNAL PROCESSING. Mark Woh ARCHITECTURE AND ANALYSIS FOR NEXT GENERATION MOBILE SIGNAL PROCESSING by Mark Woh A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical

More information

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of

More information

Process Variation in Near-Threshold Wide SIMD Architectures

Process Variation in Near-Threshold Wide SIMD Architectures Process Variation in Near-Threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Charkrabari 2, Scott Mahlke 1, David Blaauw 1, Trevor Mudge 1 1 University

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

Realizing Software Defined Radio A Study in Designing Mobile Supercomputers

Realizing Software Defined Radio A Study in Designing Mobile Supercomputers Realizing Software Defined Radio A Study in Designing Mobile Supercomputers by Yuan Lin A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer

More information

A System Solution for High-Performance, Low Power SDR

A System Solution for High-Performance, Low Power SDR A System Solution for High-Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztián Flautner 2 1 Advanced Computer Architecture Laboratory

More information

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research

More information

On-Chip Vector Coprocessor Sharing for Multicores

On-Chip Vector Coprocessor Sharing for Multicores On-Chip Vector Coprocessor Sharing for Multicores Spiridon F. Beldianu and Sotirios G. Ziavras Electrical and Computer Engineering Department, New Jersey Institute of Technology, Newark, NJ, USA sfb2@njit.edu,

More information

Reconfigurable Multicore Server Processors for Low Power Operation

Reconfigurable Multicore Server Processors for Low Power Operation Reconfigurable Multicore Server Processors for Low Power Operation Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, Trevor Mudge University of Michigan, Advanced Computer Architecture

More information

Customizing Wide-SIMD Architectures for H.264

Customizing Wide-SIMD Architectures for H.264 Customizing Wide- Architectures for H.264 S. Seo, M. Woh, S. Mahlke, T. Mudge Department of Electrical and Computer Engineering University of Michigan, Ann Arbor, MI 48109 Email: swseo,mwoh,mahlke,tnm@umich.edu

More information

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing Linköping University Post Print epuma: a novel embedded parallel DSP platform for predictable computing Jian Wang, Joar Sohl, Olof Kraigher and Dake Liu N.B.: When citing this work, cite the original article.

More information

Flexible wireless communication architectures

Flexible wireless communication architectures Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar Southern Methodist University April

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors

A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors Kai Zhang, ShuMing Chen*, Wei Liu, and Xi Ning School of Computer, National University of Defense Technology #109, Deya Road,

More information

Algorithm-Architecture Co- Design for Efficient SDR Signal Processing

Algorithm-Architecture Co- Design for Efficient SDR Signal Processing Algorithm-Architecture Co- Design for Efficient SDR Signal Processing Min Li, limin@imec.be Wireless Research, IMEC Introduction SDR Baseband Platforms Today are Usually Based on ILP + DLP + MP Massive

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

LIBRA: ACHIEVING EFFICIENT INSTRUCTION- AND DATA- PARALLEL EXECUTION FOR MOBILE APPLICATIONS

LIBRA: ACHIEVING EFFICIENT INSTRUCTION- AND DATA- PARALLEL EXECUTION FOR MOBILE APPLICATIONS LIBRA: ACHIEVING EFFICIENT INSTRUCTION- AND DATA- PARALLEL EXECUTION FOR MOBILE APPLICATIONS by Yongjun Park A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Near-Threshold Computing: Reclaiming Moore s Law

Near-Threshold Computing: Reclaiming Moore s Law 1 Near-Threshold Computing: Reclaiming Moore s Law Dr. Ronald G. Dreslinski Research Fellow Ann Arbor 1 1 Motivation 1000000 Transistors (100,000's) 100000 10000 Power (W) Performance (GOPS) Efficiency (GOPS/W)

More information

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine

More information

A Low-Power DSP for Wireless Communications Hyunseok Lee, Member, IEEE, Chaitali Chakrabarti, Senior Member, IEEE, and Trevor Mudge, Fellow, IEEE

A Low-Power DSP for Wireless Communications Hyunseok Lee, Member, IEEE, Chaitali Chakrabarti, Senior Member, IEEE, and Trevor Mudge, Fellow, IEEE 1310 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 9, SEPTEMBER 2010 A Low-Power DSP for Wireless Communications Hyunseok Lee, Member, IEEE, Chaitali Chakrabarti, Senior

More information

EECS150 - Digital Design Lecture 09 - Parallelism

EECS150 - Digital Design Lecture 09 - Parallelism EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization

More information

SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13

SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 SIMD Divergence Optimization through Intra-Warp Compaction Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 Problem GPU: wide SIMD lanes 16 lanes per warp in this work SIMD

More information

Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability

Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability 2012 IEEE/ACM 45th Annual International Symposium on Microarchitecture Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability Yongjun Park Jason Jong Kyu Park Hyunchul

More information

21 Baseband Processing Architectures for SDR

21 Baseband Processing Architectures for SDR Vijay/Wireless, Networking, Radar, Sensor Array Processing, and Nonlinear Signal Processing 46047_C021 Page Proof page 1 22.6.2009 4:08pm Compositor Name: BMani 21 Baseband Processing Architectures for

More information

Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors

Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors 1 Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors Qi Zheng*, Yajing Chen*, Ronald Dreslinski*, Chaitali Chakrabarti +, Achilleas Anastasopoulos*, Scott Mahlke*, Trevor

More information

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141 EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more

More information

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these

More information

MPSOC 2011 BEAUNE, FRANCE

MPSOC 2011 BEAUNE, FRANCE MPSOC 2011 BEAUNE, FRANCE BOADRES: A SCALABLE BASEBAND PROCESSOR TEMPLATE FOR Gbps RADIOS VICE PRESIDENT, CHAIRMAN OF THE TECHNOLOGY OFFICE PROFESSOR AT THE KATHOLIEKE UNIVERSITEIT LEUVEN STATUS SDR BASEBAND

More information

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 ISSCC 26 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 22.1 A 125µW, Fully Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications Tsu-Ming Liu 1, Ting-An Lin 2, Sheng-Zen Wang 2, Wen-Ping Lee

More information

Inside Intel Core Microarchitecture

Inside Intel Core Microarchitecture White Paper Inside Intel Core Microarchitecture Setting New Standards for Energy-Efficient Performance Ofri Wechsler Intel Fellow, Mobility Group Director, Mobility Microprocessor Architecture Intel Corporation

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

ASIC Design of Shared Vector Accelerators for Multicore Processors

ASIC Design of Shared Vector Accelerators for Multicore Processors 26 th International Symposium on Computer Architecture and High Performance Computing 2014 ASIC Design of Shared Vector Accelerators for Multicore Processors Spiridon F. Beldianu & Sotirios G. Ziavras

More information

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges ELE 455/555 Computer System Engineering Section 4 Class 1 Challenges Introduction Motivation Desire to provide more performance (processing) Scaling a single processor is limited Clock speeds Power concerns

More information

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015 MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015 Video Codecs 70% of internet traffic will be video in 2018 [CISCO] Video

More information

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits

More information

Centip3De: A 64-Core, 3D Stacked, Near-Threshold System

Centip3De: A 64-Core, 3D Stacked, Near-Threshold System 1 1 1 Centip3De: A 64-Core, 3D Stacked, Near-Threshold System Ronald G. Dreslinski David Fick, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew Fojtik, Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman

More information

Implementation of DSP Algorithms

Implementation of DSP Algorithms Implementation of DSP Algorithms Main frame computers Dedicated (application specific) architectures Programmable digital signal processors voice band data modem speech codec 1 PDSP and General-Purpose

More information

Leveraging Mobile GPUs for Flexible High-speed Wireless Communication

Leveraging Mobile GPUs for Flexible High-speed Wireless Communication 0 Leveraging Mobile GPUs for Flexible High-speed Wireless Communication Qi Zheng, Cao Gao, Trevor Mudge, Ronald Dreslinski *, Ann Arbor The 3 rd International Workshop on Parallelism in Mobile Platforms

More information

04 - DSP Architecture and Microarchitecture

04 - DSP Architecture and Microarchitecture September 11, 2015 Memory indirect addressing (continued from last lecture) ; Reality check: Data hazards! ; Assembler code v3: repeat 256,endloop load r0,dm1[dm0[ptr0++]] store DM0[ptr1++],r0 endloop:

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators

More information

Software Defined Modem A commercial platform for wireless handsets

Software Defined Modem A commercial platform for wireless handsets Software Defined Modem A commercial platform for wireless handsets Charles F Sturman VP Marketing June 22 nd ~ 24 th Brussels charles.stuman@cognovo.com www.cognovo.com Agenda SDM Separating hardware from

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

Mighty-Morphing Power-SIMD

Mighty-Morphing Power-SIMD Mighty-Morphing Power-SIMD Ganesh Dasika, Mark Woh, Sangwon Seo, Nathan Clark, Trevor Mudge, and Scott Mahlke Advanced Computer Architecture Laboratory College of Computing University of Michigan - Ann

More information

Specializing Hardware for Image Processing

Specializing Hardware for Image Processing Lecture 6: Specializing Hardware for Image Processing Visual Computing Systems So far, the discussion in this class has focused on generating efficient code for multi-core processors such as CPUs and GPUs.

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2 ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 13 Memory Part 2 Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall12.html

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L6: Advanced Memory Hierarchy Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab 1 due today Reading: Chapter 5.1 5.3 2 1 Overview How to

More information

High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI

High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI Yi Ge Mitsuru Tomono Makiko Ito Yoshio Hirose Recently, the transmission rate for handheld devices has been increasing by

More information

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm Engineering Director, Xilinx Silicon Architecture Group Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm Presented By Kees Vissers Fellow February 25, FPGA 2019 Technology scaling

More information

Cache Justification for Digital Signal Processors

Cache Justification for Digital Signal Processors Cache Justification for Digital Signal Processors by Michael J. Lee December 3, 1999 Cache Justification for Digital Signal Processors By Michael J. Lee Abstract Caches are commonly used on general-purpose

More information

Multimedia in Mobile Phones. Architectures and Trends Lund

Multimedia in Mobile Phones. Architectures and Trends Lund Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Deep Learning Hardware Acceleration

Deep Learning Hardware Acceleration * Deep Learning Hardware Acceleration Jorge Albericio + Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* + now at NVIDIA Andreas Moshovos Disclaimer

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

White Paper. Visual Processing in Rugged Environments. By Joe Lin, General Manager, Industrial Computing Solutions.

White Paper. Visual Processing in Rugged Environments. By Joe Lin, General Manager, Industrial Computing Solutions. White Paper Visual Processing in Rugged Environments By Joe Lin, General Manager, Industrial Computing Solutions www.nexcom.com Complex visual processing is becoming an important part of many rugged industrial

More information

Microprocessor Extensions for Wireless Communications

Microprocessor Extensions for Wireless Communications Microprocessor Extensions for Wireless Communications Sridhar Rajagopal and Joseph R. Cavallaro DRAFT REPORT Rice University Center for Multimedia Communication Department of Electrical and Computer Engineering

More information

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline

More information

Millimeter-Scale Nearly Perpetual Sensor System with Stacked Battery and Solar Cells

Millimeter-Scale Nearly Perpetual Sensor System with Stacked Battery and Solar Cells 1 Millimeter-Scale Nearly Perpetual Sensor System with Stacked Battery and Solar Cells Gregory Chen, Matthew Fojtik, Daeyeon Kim, David Fick, Junsun Park, Mingoo Seok, Mao-Ter Chen, Zhiyoong Foo, Dennis

More information

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices * Nam Sung Kim w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices * modern GPU architectures deeply pipelined for efficient resource sharing several buffering

More information

Accelerating Genomic Sequence Alignment Workload with Scalable Vector Architecture

Accelerating Genomic Sequence Alignment Workload with Scalable Vector Architecture Accelerating Genomic Sequence Alignment Workload with Scalable Vector Architecture Dong-hyeon Park, Jon Beaumont, Trevor Mudge University of Michigan, Ann Arbor Genomics Past Weeks ~$3 billion Human Genome

More information

VLIW DSP Processor Design for Mobile Communication Applications. Contents crafted by Dr. Christian Panis Catena Radio Design

VLIW DSP Processor Design for Mobile Communication Applications. Contents crafted by Dr. Christian Panis Catena Radio Design VLIW DSP Processor Design for Mobile Communication Applications Contents crafted by Dr. Christian Panis Catena Radio Design Agenda Trends in mobile communication Architectural core features with significant

More information

High-Performance VLSI Architecture of H.264/AVC CAVLD by Parallel Run_before Estimation Algorithm *

High-Performance VLSI Architecture of H.264/AVC CAVLD by Parallel Run_before Estimation Algorithm * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 29, 595-605 (2013) High-Performance VLSI Architecture of H.264/AVC CAVLD by Parallel Run_before Estimation Algorithm * JONGWOO BAE 1 AND JINSOO CHO 2,+ 1

More information

Embedded Systems: Hardware Components (part I) Todor Stefanov

Embedded Systems: Hardware Components (part I) Todor Stefanov Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System

More information

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Multimedia Decoder Using the Nios II Processor

Multimedia Decoder Using the Nios II Processor Multimedia Decoder Using the Nios II Processor Third Prize Multimedia Decoder Using the Nios II Processor Institution: Participants: Instructor: Indian Institute of Science Mythri Alle, Naresh K. V., Svatantra

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don RAMP-IV: A Low-Power and High-Performance 2D/3D Graphics Accelerator for Mobile Multimedia Applications Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae,, and Hoi-Jun Yoo oratory Dept. of EECS,

More information

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog International Journal of Electronics and Computer Science Engineering 1007 Available Online at www.ijecse.org ISSN- 2277-1956 Design of a Floating-Point Fused Add-Subtract Unit Using Verilog Mayank Sharma,

More information

Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.

Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems Liang-Gee Chen Distinguished Professor General Director, SOC Center National Taiwan University DSP/IC Design Lab, GIEE, NTU 1

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH Key words: Digital Signal Processing, FIR filters, SIMD processors, AltiVec. Grzegorz KRASZEWSKI Białystok Technical University Department of Electrical Engineering Wiejska

More information

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager Enabling a Richer Multimedia Experience with GPU Compute Roberto Mijat Visual Computing Marketing Manager 1 What is GPU Compute Operating System and most application processing continue to reside on the

More information

Code Compression for DSP

Code Compression for DSP Code for DSP Charles Lefurgy and Trevor Mudge {lefurgy,tnm}@eecs.umich.edu EECS Department, University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 http://www.eecs.umich.edu/~tnm/compress Abstract

More information

High Speed Special Function Unit for Graphics Processing Unit

High Speed Special Function Unit for Graphics Processing Unit High Speed Special Function Unit for Graphics Processing Unit Abd-Elrahman G. Qoutb 1, Abdullah M. El-Gunidy 1, Mohammed F. Tolba 1, and Magdy A. El-Moursy 2 1 Electrical Engineering Department, Fayoum

More information

Quality-of-Service for a High-Radix Switch

Quality-of-Service for a High-Radix Switch Quality-of-Service for a High-Radix Switch Nilmini Abeyratne, Supreet Jeloka, Yiping Kang, David Blaauw, Ronald G. Dreslinski, Reetuparna Das, and Trevor Mudge University of Michigan 51 st DAC 06/05/2014

More information

A LOW-POWER VGA FULL-FRAME FEATURE EXTRACTION PROCESSOR. Dongsuk Jeon, Yejoong Kim, Inhee Lee, Zhengya Zhang, David Blaauw, and Dennis Sylvester

A LOW-POWER VGA FULL-FRAME FEATURE EXTRACTION PROCESSOR. Dongsuk Jeon, Yejoong Kim, Inhee Lee, Zhengya Zhang, David Blaauw, and Dennis Sylvester A LOW-POWER VGA FULL-FRAME FEATURE EXTRACTION PROCESSOR Dongsuk Jeon, Yejoong Kim, Inhee Lee, Zhengya Zhang, David Blaauw, and Dennis Sylvester University of Michigan, Ann Arbor ABSTRACT This paper proposes

More information

Architectural Support for Reducing Parallel Processing Overhead in an Embedded Multiprocessor

Architectural Support for Reducing Parallel Processing Overhead in an Embedded Multiprocessor 2010 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing Architectural Support for Reducing Parallel Processing Overhead in an Embedded Multiprocessor Jian Wang, Joar Sohl and Dake

More information

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

KiloCore: A 32 nm 1000-Processor Array

KiloCore: A 32 nm 1000-Processor Array KiloCore: A 32 nm 1000-Processor Array Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, Bevan Baas University of California, Davis VLSI Computation

More information

Processor Applications. The Processor Design Space. World s Cellular Subscribers. Nov. 12, 1997 Bob Brodersen (http://infopad.eecs.berkeley.

Processor Applications. The Processor Design Space. World s Cellular Subscribers. Nov. 12, 1997 Bob Brodersen (http://infopad.eecs.berkeley. Processor Applications CS 152 Computer Architecture and Engineering Introduction to Architectures for Digital Signal Processing Nov. 12, 1997 Bob Brodersen (http://infopad.eecs.berkeley.edu) 1 General

More information

A Media-Enhanced Vector Architecture for Embedded Memory Systems

A Media-Enhanced Vector Architecture for Embedded Memory Systems A Media-Enhanced Vector Architecture for Embedded Memory Systems Christoforos Kozyrakis Report No. UCB/CSD-99-1059 July 1999 Computer Science Division (EECS) University of California Berkeley, California

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Complexity-effective Enhancements to a RISC CPU Architecture

Complexity-effective Enhancements to a RISC CPU Architecture Complexity-effective Enhancements to a RISC CPU Architecture Jeff Scott, John Arends, Bill Moyer Embedded Platform Systems, Motorola, Inc. 7700 West Parmer Lane, Building C, MD PL31, Austin, TX 78729 {Jeff.Scott,John.Arends,Bill.Moyer}@motorola.com

More information

Course Overview Revisited

Course Overview Revisited Course Overview Revisited void blur_filter_3x3( Image &in, Image &blur) { // allocate blur array Image blur(in.width(), in.height()); // blur in the x dimension for (int y = ; y < in.height(); y++) for

More information

MPSOC Design examples

MPSOC Design examples MPSOC 2007 Eshel Haritan, VP Engineering, Inc. 1 MPSOC Design examples Freescale: ARM1136 + StarCore140e Broadcom: ARM11 + ARM9 + TeakLite + accelerators Qualcomm 4 processors + video, gps, wireless, audio

More information