Bridging Analog Neuromorphic and Digital von Neumann Computing

Size: px
Start display at page:

Download "Bridging Analog Neuromorphic and Digital von Neumann Computing"

Transcription

1 Bridging Analog Neuromorphic and Digital von Neumann Computing Amir Yazdanbakhsh, Bradley Thwaites Advisors: Hadi Esmaeilzadeh and Doug Burger Qualcomm Mentors: Manu Rastogiand Girish Varatkar Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology Qualcomm Innovation Fellowship

2 Energy is a primary constraint Data Center Mobile Internet of Things

3 Data growth vs performance Data growth trends: IDC's Digital Universe Study, December 2012 Performance growth trends: Esmaeilzadeh et al, Dark Silicon and the End of Multicore Scaling, ISCA

4 Approximate computing Embracing error Relax the abstraction of near perfect accuracy in Processing Storage Communication Allows errors to happen to improve performance resource utilization efficiency 6

5 Avoiding overkill design Approximate Computing Application Programming Language Compiler Architecutre Microarchitecture Cost Precision Reliability Cost Circuit Physical Device

6 Adding a third dimension Embracing Error Processor Pareto.Fron0er Data center Energy Desktop Mobile IoT Performance

7 Navigating a three dimensional space Processor Pareto.Fron0er Data center Energy Desktop IoT Mobile Performance

8 Finding the Pareto surface Energy IoT Processor Pareto.Fron0er Mobile Data center Desktop Truffle [ALOS 12] FLEXJAVA [FSE 15] RFVP [PACT 14, IEEE D&T 15] Axilog [DATE 15, IEEE Micro 15] D- NPUs [MICRO 12] A- NPUs [ISCA 14] SNNAP [HPCA 15] GNPU [Micro 15] MITHRA [TechCon 15] Performance (13.5, 11.1, 10%)

9 Accelerating GPU Accelerators Bridging Neuromophic and von Neumann Computing Unleashing the Beast Amir Yazdanbakhsh et al., Neural Acceleration for GPU Throughput Processors, MICRO 2015.

10 Neural Transformation Analog Neural Network Analog Neural Network

11 Analog NPU Integration CPU x 0 x i x n DAC DAC DAC I(x 0 ) I(x i ) I(x n ) R 0 X (I(xi )R(w i )) ADC R(w i ) R(w n ) A-NPU V to I V to I V to I SM A-NPU SM A-NPU SM A-NPU SM SM SM A-NPU A-NPU A-NPU SM SM SM A-NPU A-NPU A-NPU SM SM SM A-NPU A-NPU A-NPU y sigmoid( X (I(x i )R(w i ))) SM SM SM SM A-NPU A-NPU A-NPU A-NPU General- Purpose Code Acceleration with Limited- Precision Analog Computation ISCA 2014 Neural Acceleration for GPU Throughput Processors Micro 2015

12 s w0 w 0 s x0 x 0 s wn w n s xn x n I( x 0 ) Current' Steering' DAC I( x n ) Current' Steering' DAC Resistor' Ladder Resistor' Ladder R( w 0 ) R( w n ) I + (w 0 x 0 ) Diff' Pair I (w 0 x 0 ) V + X wi x i V ( w 0 x 0 ) V X wi x i + -" I + (w n x n ) Diff' Amp Diff' Pair y sigmoid V V ( w n x n ) I (w n x n ) Flash ADC s y y X wi x i

13 Analog Compilation Workflow Limited Bit-Width Topology Restriction Circuit Non-idealities Annotated CUDA Code uchar4'p'='tex2d(img,'x,'y); #pragma(begin_approx) a=min(r,'min(g,b)); b=max(r,'max(g,b)); z=((a+b)'>'254)'?'255:'0; #pragma(end_approx) dst[img.width'*'y'+'x]'='z; Compiler + Customized Training Algorthim Application uchar4'p'='tex2d(img,'x,'y); send.n_data5%r0; send.n_data5%r1; send.n_data5%r2; recv.n_data5%r4; dst[img.width'*'y'+'x]'='z; Accelerator Config w 0 = 0.03,, w 8=0.10 SM SM SM SM A-NPU A-NPU A-NPU A-NPU SM SM SM SM A-NPU A-NPU A-NPU A-NPU SM SM SM SM A-NPU A-NPU A-NPU A-NPU SM SM SM SM A-NPU A-NPU A-NPU A-NPU Programming Compilation (Profiling, Training, Code Generation) Execution

14 Benchmarks Image Processing binarization 27 PTX instructions Finance blackscholes 96 PTX instructions Machine Learning convolution 886 PTX instructions Robotics inversek2j 132 PTX instructions 3D Gaming jmeint 2,250PTX instructions Error: 11.43% Error: 8.23% Error: 9.29% Error: 10.25% Error: 19.70% Image Processing laplacian 51 PTX instructions Machine Vision meanfilter 35 PTX instructions Numerical Analysis newton- raph 44 PTX instructions Image Processing sobel 86 PTX instructions Medical Imaging srad 110 PTX instructions Error: 9.87% Error: 9.21% Error: 11.23% Error: 8.03% Error: 9.87%

15 Analog Neuromorphic versus Conventional Computing

16 I 1 I 0 I 2 I out = I 0 + I 1 + I 2 Kirchhoff's Law + V o I(x n ) R(w n ) V o = I(x n ).R(w n ) Ohm s Law Saturation Property of Transistors

17 Speedup Energy Reduction Energy Delay Quality Reduction 10 % Publications [1] Amir Yazdanbakhsh et al., Neural Acceleration for GPU Throughput Processors Micro [2] Renée St. Amant et al., General- Purpose Code Acceleration with Limited- Precision Analog Computation ISCA 2014.

18 Application Programming Language Compiler Architecutre Microarchitecture Circuit Physical Device Software Architecture Memory Hardware Design FLEXJAVA: Language Support for Safe and Modular Approximate Programming [FSE 2015] ExpAX: A Framework for Automating Approximate Programming [Tech Report 2014] Neural Acceleration for GPU Throughput Processors [Micro 2015] MITHRA: Controlling Quality Tradeoffs in Approximate Acceleration [TechCon 2015] General- Purpose Code Acceleration with Limited- Precision Analog Computation [ISCA 2014] Mitigating the Bandwidth Bottleneck with Approximate Load Value Prediction [IEEE Design and Test 2015] Rollback- Free Value Prediction with Approximate Loads [PACT 2014] Axilog: Abstractions for Approximate Hardware Design and Reuse [IEEE Micro 2015] Axilog: Language Support for Approximate Hardware Support [DATE 2015]

19 Rollback Free Value Prediction Front End Pipelines Load / Store Unit Pipeline L1 Cache Write back Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM)

20 Rollback Free Value Prediction Front End Pipelines Load / Store Unit Pipeline L1 Cache Write back Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM)

21 Rollback Free Value Prediction Front End Pipelines Load / Store Unit Pipeline L1 Cache Write back Full Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM)

22 Rollback Free Value Prediction Front End Pipelines RFVP Predictor Load / Store Unit Pipeline L1 Cache Write back Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM) RFVP Predictor quickly predicts values for approximate load misses RFVP technique mitigates the memory bandwidth bottleneck

23 Speedup 1.4 Energy Reduction 1.3 Bandwidth Consumption Reduction 1.5 Quality Reduction 10 % Publications [1] Amir Yazdanbakhsh et al., Mitigating the Bandwidth Bottleneck with Approximate Load Value Prediction IEEE Design and Test [2] Amir Yazdanbakhsh et al., RFVP: Rollback- Free Value Prediction with Safe- to- Approximate Loads Architecture and Code Optimization (TACO) [submitted]. [3] Bradley Thwaites et al., Rollback- Free Value Prediction with Approximate Loads International Conference on Parallel Architectures and Compilation Techniques (PACT) 2014.

24 module fir (clk, rst, x, y) clk rst input clk, rst; d0 d1 d2 d3 x input [15:0] x; b0 b1 b2 b3 output [31:0] y; m0 m1 m2 m3 * w0 w1 w2 w3 multiplier m1 (b1, d1, w1); multiplier m2 (b2, d2, w2); a1 a2 a3 w4 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); endmodule y

25 module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 x input [15:0] x; output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * w0 w1 w2 w3 multiplier m1 (b1, d1, w1); multiplier m2 (b2, d2, w2); + a1 + a2 w4 + a3 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); endmodule relax(y) y

26 module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 x input [15:0] x; output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * w0 w1 w2 w3 multiplier m1 (b1, d1, w1); multiplier m2 (b2, d2, w2); + a1 + a2 w4 + a3 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); endmodule relax(y) y

27 module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 input [15:0] x; x output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * multiplier m1 (b1, d1, w1); w0 w1 w2 w3 restrict(w1) restrict(w2) restrict(w3) multiplier m2 (b2, d2, w2); a1 a2 a3 w4 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); restrict(w1); restrict(w2); endmodule relax(y) y

28 module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 input [15:0] x; x output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * multiplier m1 (b1, d1, w1); w0 w1 w2 w3 restrict(w1) restrict(w2) restrict(w3) multiplier m2 (b2, d2, w2); a1 a2 a3 w4 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); restrict(w1); restrict(w2); endmodule relax(y) y

29 Energy Reduction Area Reduction Code Annotations Quality Reduction 10 % Publications [1] Divya Mahajan et al., Axilog: Abstractions for Approximate Hardware Design and Reuse IEEE Micro [2] Amir Yazdanbakhsh et al., Axilog: Language Support for Approximate Hardware Design Design Automation and Test in Europe (DATE) 2015.

30 Finding the Pareto surface Energy IoT Processor Pareto.Fron0er Mobile Data center Desktop Truffle [ALOS 12] FLEXJAVA [FSE 15] RFVP [PACT 14, IEEE D&T 15] Axilog [DATE 15, IEEE Micro 15] D- NPUs [MICRO 12] A- NPUs [ISCA 14] SNNAP [HPCA 15] GNPU [Micro 15] MITHRA [TechCon 15] Performance (13.5, 11.1, 10%)

Axilog: Language Support for Approximate Hardware Design

Axilog: Language Support for Approximate Hardware Design Axilog: Language Support for Approximate Hardware Design Amir Yazdanbakhsh Divya Mahajan Bradley Thwaites Jongse Park Anandhavel Nagendrakumar Sindhuja Sethuraman Kar,k Ramkrishnan Nishanthi Ravindran

More information

AxBench: A Multiplatform Benchmark Suite for Approximate Computing

AxBench: A Multiplatform Benchmark Suite for Approximate Computing AxBench: A Multiplatform Benchmark Suite for Approximate Computing Amir Yazdanbakhsh, Divya Mahajan, and Hadi Esmaeilzadeh Georgia Institute of Technology Pejman Lotfi-Kamran Institute for Research in

More information

AxBench: A Benchmark Suite for Approximate Computing Across the System Stack

AxBench: A Benchmark Suite for Approximate Computing Across the System Stack AxBench: A Benchmark Suite for Approximate Computing Across the System Stack Amir Yazdanbakhsh Divya Mahajan Pejman Lotfi-Kamran Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab School of

More information

Core. Error Predictor. Figure 1: Architectural overview of our quality control approach. Approximate Accelerator. Precise.

Core. Error Predictor. Figure 1: Architectural overview of our quality control approach. Approximate Accelerator. Precise. Prediction-Based Quality Control for Approximate Accelerators Divya Mahajan Amir Yazdanbakhsh Jongse Park Bradley Thwaites Hadi Esmaeilzadeh Georgia Institute of Technology Abstract Approximate accelerators

More information

Neural Network based Energy-Efficient Fault Tolerant Architect

Neural Network based Energy-Efficient Fault Tolerant Architect Neural Network based Energy-Efficient Fault Tolerant Architectures and Accelerators University of Rochester February 7, 2013 References Flexible Error Protection for Energy Efficient Reliable Architectures

More information

Amir Yazdanbakhsh. (608)

Amir Yazdanbakhsh.   (608) Amir Yazdanbakhsh http://www.cc.gatech.edu/~ayazdanb/ a.yazdanbakhsh@gatech.edu (608).335.6884 RESEARCH INTERESTS Computer Architecture Approximate Computing Architecture Design for ASIC/FPGA Deep (Reinforcement)

More information

Microprocessor Trends and Implications for the Future

Microprocessor Trends and Implications for the Future Microprocessor Trends and Implications for the Future John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 4 1 September 2016 Context Last two classes: from

More information

RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads

RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads Amir Yazdanbakhsh Gennady Pekhimenko Bradley Thwaites Hadi Esmaeilzadeh Taesoo Kim Onur Mutlu Todd C. Mowry Georgia Institute of Technology

More information

Chapter 1: Fundamentals of Quantitative Design and Analysis

Chapter 1: Fundamentals of Quantitative Design and Analysis 1 / 12 Chapter 1: Fundamentals of Quantitative Design and Analysis Be careful in this chapter. It contains a tremendous amount of information and data about the changes in computer architecture since the

More information

Compilation and Hardware Support for Approximate Acceleration

Compilation and Hardware Support for Approximate Acceleration Compilation and Hardware Support for Approximate Acceleration Thierry Moreau Adrian Sampson Andre Baixo Mark Wyse Ben Ransford Jacob Nelson Luis Ceze Mark Oskin University of Washington Abstract Approximate

More information

Neural Acceleration for GPU Throughput Processors

Neural Acceleration for GPU Throughput Processors Appears in the Proceedings of the 48 th Annual IEEE/ACM International Symposium on Microarchitecture, 2015 Neural Acceleration for Throughput Processors Amir Yazdanbakhsh Jongse Park Hardik Sharma Pejman

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

Neural Acceleration for GPU Throughput Processors

Neural Acceleration for GPU Throughput Processors Neural Acceleration for Throughput Processors Amir Yazdanbakhsh Jongse Park Hardik Sharma Pejman Lotfi-Kamran Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab School of Computer Science,

More information

Approximate Overview of Approximate Computing

Approximate Overview of Approximate Computing Approximate Overview of Approximate Computing Luis Ceze University of Washington PL Architecture With thanks to many colleagues from whom I stole slides: Adrian Sampson, Hadi Esmaeilzadeh, Karin Strauss,

More information

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong

More information

Computer Architecture s Changing Definition

Computer Architecture s Changing Definition Computer Architecture s Changing Definition 1950s Computer Architecture Computer Arithmetic 1960s Operating system support, especially memory management 1970s to mid 1980s Computer Architecture Instruction

More information

Scale-Out Acceleration for Machine Learning

Scale-Out Acceleration for Machine Learning Scale-Out Acceleration for Machine Learning Jongse Park Hardik Sharma Divya Mahajan Joon Kyung Kim Preston Olds Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology

More information

FLEXJAVA:)Language'Support' for'safe'and'modular' Approximate'Programming

FLEXJAVA:)Language'Support' for'safe'and'modular' Approximate'Programming FLEXJAVA:)Language'Support' for'safe'and'modular' Approximate'Programming Jongse Park,'Hadi'Esmaeilzadeh,'Xin'Zhang,' Mayur Naik,'William'Harris Alternative'Computing'Technologies'(ACT)'Lab Georgia'Institute'of'Technology

More information

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem. The VLSI Interconnect Challenge Avinoam Kolodny Electrical Engineering Department Technion Israel Institute of Technology VLSI Challenges System complexity Performance Tolerance to digital noise and faults

More information

Gables: A Roofline Model for Mobile SoCs

Gables: A Roofline Model for Mobile SoCs Gables: A Roofline Model for Mobile SoCs Mark D. Hill, Wisconsin & Former Google Intern Vijay Janapa Reddi, Harvard & Former Google Intern HPCA, Feb 2019 Outline Motivation Gables Model Example Balanced

More information

Neural Computer Architectures

Neural Computer Architectures Neural Computer Architectures 5kk73 Embedded Computer Architecture By: Maurice Peemen Date: Convergence of different domains Neurobiology Applications 1 Constraints Machine Learning Technology Innovations

More information

Neural Acceleration for General-Purpose Approximate Programs

Neural Acceleration for General-Purpose Approximate Programs 2012 IEEE/ACM 45th Annual International Symposium on Microarchitecture Neural Acceleration for General-Purpose Approximate Programs Hadi Esmaeilzadeh Adrian Sampson Luis Ceze Doug Burger University of

More information

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

The Effect of Temperature on Amdahl Law in 3D Multicore Era

The Effect of Temperature on Amdahl Law in 3D Multicore Era The Effect of Temperature on Amdahl Law in 3D Multicore Era L Yavits, A Morad, R Ginosar Abstract This work studies the influence of temperature on performance and scalability of 3D Chip Multiprocessors

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Exploiting Hidden Layer Modular Redundancy for Fault-Tolerance in Neural Network Accelerators

Exploiting Hidden Layer Modular Redundancy for Fault-Tolerance in Neural Network Accelerators Exploiting Hidden Layer Modular Redundancy for Fault-Tolerance in Neural Network Accelerators Schuyler Eldridge Ajay Joshi Department of Electrical and Computer Engineering, Boston University schuye@bu.edu

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory

More information

Approximate Computing on Programmable SoCs via Neural Acceleration

Approximate Computing on Programmable SoCs via Neural Acceleration University of Washington Computer Science and Engineering Technical Report UW-CSE-14-05-01 Approximate Computing on Programmable SoCs via Neural Acceleration Thierry Moreau Jacob Nelson Adrian Sampson

More information

Lecture 1: Introduction and Basics

Lecture 1: Introduction and Basics CS 515 Programming Language and Compilers I Lecture 1: Introduction and Basics Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/5/2017 Class Information Instructor: Zheng (Eddy) Zhang Email: eddyzhengzhang@gmailcom

More information

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

EECS4201 Computer Architecture

EECS4201 Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

ELCT 912: Advanced Embedded Systems

ELCT 912: Advanced Embedded Systems ELCT 912: Advanced Embedded Systems Lecture 2-3: Embedded System Hardware Dr. Mohamed Abd El Ghany, Department of Electronics and Electrical Engineering Embedded System Hardware Used for processing of

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Course web site: teaching/courses/car. Piazza discussion forum:

Course web site:   teaching/courses/car. Piazza discussion forum: Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Big Data Systems on Future Hardware. Bingsheng He NUS Computing Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Is There A Tradeoff Between Programmability and Performance?

Is There A Tradeoff Between Programmability and Performance? Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

ECE 588/688 Advanced Computer Architecture II

ECE 588/688 Advanced Computer Architecture II ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J. UCAS-6 6 > Stanford > Imperial > Verify 2011 Marching Memory マーチングメモリ Tadao Nakamura 中村維男 Based on Patent Application by Tadao Nakamura and Michael J. Flynn 1 Copyright 2010 Tadao Nakamura C-M-C Computer

More information

Transistors and Wires

Transistors and Wires Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis Part II These slides are based on the slides provided by the publisher. The slides

More information

Advanced and parallel architectures

Advanced and parallel architectures Cognome Nome Advanced and parallel architectures Prof. A. Massini June 11, 2015 Exercise 1a (2 points) Exercise 1b (2 points) Exercise 2 (5 points) Exercise 3 (3 points) Exercise 4a (3 points) Exercise

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Computer Architecture!

Computer Architecture! Informatics 3 Computer Architecture! Dr. Boris Grot and Dr. Vijay Nagarajan!! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors

More information

Computer Architecture!

Computer Architecture! Informatics 3 Computer Architecture! Dr. Boris Grot and Dr. Vijay Nagarajan!! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors:!

More information

Advanced Computer Architecture (CS620)

Advanced Computer Architecture (CS620) Advanced Computer Architecture (CS620) Background: Good understanding of computer organization (eg.cs220), basic computer architecture (eg.cs221) and knowledge of probability, statistics and modeling (eg.cs433).

More information

Embedded Systems: Hardware Components (part I) Todor Stefanov

Embedded Systems: Hardware Components (part I) Todor Stefanov Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

New Challenges in Microarchitecture and Compiler Design

New Challenges in Microarchitecture and Compiler Design New Challenges in Microarchitecture and Compiler Design Contributors: Jesse Fang Tin-Fook Ngai Fred Pollack Intel Fellow Director of Microprocessor Research Labs Intel Corporation fred.pollack@intel.com

More information

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018 Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel

More information

REAL TIME DIGITAL SIGNAL PROCESSING

REAL TIME DIGITAL SIGNAL PROCESSING REAL TIME DIGITAL SIGNAL PROCESSING UTN - FRBA 2011 www.electron.frba.utn.edu.ar/dplab Introduction Why Digital? A brief comparison with analog. Advantages Flexibility. Easily modifiable and upgradeable.

More information

Dr. Yassine Hariri CMC Microsystems

Dr. Yassine Hariri CMC Microsystems Dr. Yassine Hariri Hariri@cmc.ca CMC Microsystems 03-26-2013 Agenda MCES Workshop Agenda and Topics Canada s National Design Network and CMC Microsystems Processor Eras: Background and History Single core

More information

Computer Architecture. R. Poss

Computer Architecture. R. Poss Computer Architecture R. Poss 1 ca01-10 september 2015 Course & organization 2 ca01-10 september 2015 Aims of this course The aims of this course are: to highlight current trends to introduce the notion

More information

EITF20: Computer Architecture Part1.1.1: Introduction

EITF20: Computer Architecture Part1.1.1: Introduction EITF20: Computer Architecture Part1.1.1: Introduction Liang Liu liang.liu@eit.lth.se 1 Course Factor Computer Architecture (7.5HP) http://www.eit.lth.se/kurs/eitf20 EIT s Course Service Desk (studerandeexpedition)

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Computer Architecture!

Computer Architecture! Informatics 3 Computer Architecture! Dr. Vijay Nagarajan and Prof. Nigel Topham! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors

More information

CO403 Advanced Microprocessors IS860 - High Performance Computing for Security. Basavaraj Talawar,

CO403 Advanced Microprocessors IS860 - High Performance Computing for Security. Basavaraj Talawar, CO403 Advanced Microprocessors IS860 - High Performance Computing for Security Basavaraj Talawar, basavaraj@nitk.edu.in Course Syllabus Technology Trends: Transistor Theory. Moore's Law. Delay, Power,

More information

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Sept. 5 th : Homework 1 release (due on Sept.

More information

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007 CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007 Name: Solutions (please print) 1-3. 11 points 4. 7 points 5. 7 points 6. 20 points 7. 30 points 8. 25 points Total (105 pts):

More information

Fundamentals of Computer Design

Fundamentals of Computer Design CS359: Computer Architecture Fundamentals of Computer Design Yanyan Shen Department of Computer Science and Engineering 1 Defining Computer Architecture Agenda Introduction Classes of Computers 1.3 Defining

More information

AWS & Intel: A Partnership Dedicated to fueling your Innovations. Thomas Kellerer BDM CSP, Intel Central Europe

AWS & Intel: A Partnership Dedicated to fueling your Innovations. Thomas Kellerer BDM CSP, Intel Central Europe AWS & Intel: A Partnership Dedicated to fueling your Innovations Thomas Kellerer BDM CSP, Intel Central Europe The Digital Service Economy Growth in connected devices enables new business opportunities

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

h Coherence Controllers

h Coherence Controllers High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs

More information

Hardware Software Co-Design: Not Just a Cliché

Hardware Software Co-Design: Not Just a Cliché Hardware Software Co-Design: Not Just a Cliché Adrian Sampson James Bornholt Luis Ceze University of Washington SNAPL 2015 sa pa time immemorial 2005 2015 (not to scale) free lunch time immemorial 2005

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,

More information

Computer Architecture

Computer Architecture Informatics 3 Computer Architecture Dr. Boris Grot and Dr. Vijay Nagarajan Institute for Computing Systems Architecture, School of Informatics University of Edinburgh General Information Instructors: Boris

More information

Adaptable Intelligence The Next Computing Era

Adaptable Intelligence The Next Computing Era Adaptable Intelligence The Next Computing Era Hot Chips, August 21, 2018 Victor Peng, CEO, Xilinx Pervasive Intelligence from Cloud to Edge to Endpoints >> 1 Exponential Growth and Opportunities Data Explosion

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Toward a Memory-centric Architecture

Toward a Memory-centric Architecture Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains

More information

SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS

SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS Samira Khan MEMORY IN TODAY S SYSTEM Processor DRAM Memory Storage DRAM is critical for performance

More information

BREAKING THE MEMORY WALL

BREAKING THE MEMORY WALL BREAKING THE MEMORY WALL CS433 Fall 2015 Dimitrios Skarlatos OUTLINE Introduction Current Trends in Computer Architecture 3D Die Stacking The memory Wall Conclusion INTRODUCTION Ideal Scaling of power

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications

More information

Classification of Semiconductor LSI

Classification of Semiconductor LSI Classification of Semiconductor LSI 1. Logic LSI: ASIC: Application Specific LSI (you have to develop. HIGH COST!) For only mass production. ASSP: Application Specific Standard Product (you can buy. Low

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Memory Computer Technology

More information

In-DRAM Near-Data Approximate Acceleration for GPUs

In-DRAM Near-Data Approximate Acceleration for GPUs Appears in the Proceedings of the 27 th International Conference on Parallel Architectures and Compilation Techniques, 2018 In-DRAM Near-Data Approximate Acceleration for GPUs Amir Yazdanbakhsh Choungki

More information

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1> Chapter 7 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 7 Chapter 7 :: Topics Introduction (done) Performance Analysis (done) Single-Cycle Processor

More information

ADVANCES IN PROCESSOR DESIGN AND THE EFFECTS OF MOORES LAW AND AMDAHLS LAW IN RELATION TO THROUGHPUT MEMORY CAPACITY AND PARALLEL PROCESSING

ADVANCES IN PROCESSOR DESIGN AND THE EFFECTS OF MOORES LAW AND AMDAHLS LAW IN RELATION TO THROUGHPUT MEMORY CAPACITY AND PARALLEL PROCESSING ADVANCES IN PROCESSOR DESIGN AND THE EFFECTS OF MOORES LAW AND AMDAHLS LAW IN RELATION TO THROUGHPUT MEMORY CAPACITY AND PARALLEL PROCESSING Evan Baytan Department of Electrical Engineering and Computer

More information