Bridging Analog Neuromorphic and Digital von Neumann Computing
|
|
- Domenic Little
- 6 years ago
- Views:
Transcription
1 Bridging Analog Neuromorphic and Digital von Neumann Computing Amir Yazdanbakhsh, Bradley Thwaites Advisors: Hadi Esmaeilzadeh and Doug Burger Qualcomm Mentors: Manu Rastogiand Girish Varatkar Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology Qualcomm Innovation Fellowship
2 Energy is a primary constraint Data Center Mobile Internet of Things
3 Data growth vs performance Data growth trends: IDC's Digital Universe Study, December 2012 Performance growth trends: Esmaeilzadeh et al, Dark Silicon and the End of Multicore Scaling, ISCA
4 Approximate computing Embracing error Relax the abstraction of near perfect accuracy in Processing Storage Communication Allows errors to happen to improve performance resource utilization efficiency 6
5 Avoiding overkill design Approximate Computing Application Programming Language Compiler Architecutre Microarchitecture Cost Precision Reliability Cost Circuit Physical Device
6 Adding a third dimension Embracing Error Processor Pareto.Fron0er Data center Energy Desktop Mobile IoT Performance
7 Navigating a three dimensional space Processor Pareto.Fron0er Data center Energy Desktop IoT Mobile Performance
8 Finding the Pareto surface Energy IoT Processor Pareto.Fron0er Mobile Data center Desktop Truffle [ALOS 12] FLEXJAVA [FSE 15] RFVP [PACT 14, IEEE D&T 15] Axilog [DATE 15, IEEE Micro 15] D- NPUs [MICRO 12] A- NPUs [ISCA 14] SNNAP [HPCA 15] GNPU [Micro 15] MITHRA [TechCon 15] Performance (13.5, 11.1, 10%)
9 Accelerating GPU Accelerators Bridging Neuromophic and von Neumann Computing Unleashing the Beast Amir Yazdanbakhsh et al., Neural Acceleration for GPU Throughput Processors, MICRO 2015.
10 Neural Transformation Analog Neural Network Analog Neural Network
11 Analog NPU Integration CPU x 0 x i x n DAC DAC DAC I(x 0 ) I(x i ) I(x n ) R 0 X (I(xi )R(w i )) ADC R(w i ) R(w n ) A-NPU V to I V to I V to I SM A-NPU SM A-NPU SM A-NPU SM SM SM A-NPU A-NPU A-NPU SM SM SM A-NPU A-NPU A-NPU SM SM SM A-NPU A-NPU A-NPU y sigmoid( X (I(x i )R(w i ))) SM SM SM SM A-NPU A-NPU A-NPU A-NPU General- Purpose Code Acceleration with Limited- Precision Analog Computation ISCA 2014 Neural Acceleration for GPU Throughput Processors Micro 2015
12 s w0 w 0 s x0 x 0 s wn w n s xn x n I( x 0 ) Current' Steering' DAC I( x n ) Current' Steering' DAC Resistor' Ladder Resistor' Ladder R( w 0 ) R( w n ) I + (w 0 x 0 ) Diff' Pair I (w 0 x 0 ) V + X wi x i V ( w 0 x 0 ) V X wi x i + -" I + (w n x n ) Diff' Amp Diff' Pair y sigmoid V V ( w n x n ) I (w n x n ) Flash ADC s y y X wi x i
13 Analog Compilation Workflow Limited Bit-Width Topology Restriction Circuit Non-idealities Annotated CUDA Code uchar4'p'='tex2d(img,'x,'y); #pragma(begin_approx) a=min(r,'min(g,b)); b=max(r,'max(g,b)); z=((a+b)'>'254)'?'255:'0; #pragma(end_approx) dst[img.width'*'y'+'x]'='z; Compiler + Customized Training Algorthim Application uchar4'p'='tex2d(img,'x,'y); send.n_data5%r0; send.n_data5%r1; send.n_data5%r2; recv.n_data5%r4; dst[img.width'*'y'+'x]'='z; Accelerator Config w 0 = 0.03,, w 8=0.10 SM SM SM SM A-NPU A-NPU A-NPU A-NPU SM SM SM SM A-NPU A-NPU A-NPU A-NPU SM SM SM SM A-NPU A-NPU A-NPU A-NPU SM SM SM SM A-NPU A-NPU A-NPU A-NPU Programming Compilation (Profiling, Training, Code Generation) Execution
14 Benchmarks Image Processing binarization 27 PTX instructions Finance blackscholes 96 PTX instructions Machine Learning convolution 886 PTX instructions Robotics inversek2j 132 PTX instructions 3D Gaming jmeint 2,250PTX instructions Error: 11.43% Error: 8.23% Error: 9.29% Error: 10.25% Error: 19.70% Image Processing laplacian 51 PTX instructions Machine Vision meanfilter 35 PTX instructions Numerical Analysis newton- raph 44 PTX instructions Image Processing sobel 86 PTX instructions Medical Imaging srad 110 PTX instructions Error: 9.87% Error: 9.21% Error: 11.23% Error: 8.03% Error: 9.87%
15 Analog Neuromorphic versus Conventional Computing
16 I 1 I 0 I 2 I out = I 0 + I 1 + I 2 Kirchhoff's Law + V o I(x n ) R(w n ) V o = I(x n ).R(w n ) Ohm s Law Saturation Property of Transistors
17 Speedup Energy Reduction Energy Delay Quality Reduction 10 % Publications [1] Amir Yazdanbakhsh et al., Neural Acceleration for GPU Throughput Processors Micro [2] Renée St. Amant et al., General- Purpose Code Acceleration with Limited- Precision Analog Computation ISCA 2014.
18 Application Programming Language Compiler Architecutre Microarchitecture Circuit Physical Device Software Architecture Memory Hardware Design FLEXJAVA: Language Support for Safe and Modular Approximate Programming [FSE 2015] ExpAX: A Framework for Automating Approximate Programming [Tech Report 2014] Neural Acceleration for GPU Throughput Processors [Micro 2015] MITHRA: Controlling Quality Tradeoffs in Approximate Acceleration [TechCon 2015] General- Purpose Code Acceleration with Limited- Precision Analog Computation [ISCA 2014] Mitigating the Bandwidth Bottleneck with Approximate Load Value Prediction [IEEE Design and Test 2015] Rollback- Free Value Prediction with Approximate Loads [PACT 2014] Axilog: Abstractions for Approximate Hardware Design and Reuse [IEEE Micro 2015] Axilog: Language Support for Approximate Hardware Support [DATE 2015]
19 Rollback Free Value Prediction Front End Pipelines Load / Store Unit Pipeline L1 Cache Write back Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM)
20 Rollback Free Value Prediction Front End Pipelines Load / Store Unit Pipeline L1 Cache Write back Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM)
21 Rollback Free Value Prediction Front End Pipelines Load / Store Unit Pipeline L1 Cache Write back Full Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM)
22 Rollback Free Value Prediction Front End Pipelines RFVP Predictor Load / Store Unit Pipeline L1 Cache Write back Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM) RFVP Predictor quickly predicts values for approximate load misses RFVP technique mitigates the memory bandwidth bottleneck
23 Speedup 1.4 Energy Reduction 1.3 Bandwidth Consumption Reduction 1.5 Quality Reduction 10 % Publications [1] Amir Yazdanbakhsh et al., Mitigating the Bandwidth Bottleneck with Approximate Load Value Prediction IEEE Design and Test [2] Amir Yazdanbakhsh et al., RFVP: Rollback- Free Value Prediction with Safe- to- Approximate Loads Architecture and Code Optimization (TACO) [submitted]. [3] Bradley Thwaites et al., Rollback- Free Value Prediction with Approximate Loads International Conference on Parallel Architectures and Compilation Techniques (PACT) 2014.
24 module fir (clk, rst, x, y) clk rst input clk, rst; d0 d1 d2 d3 x input [15:0] x; b0 b1 b2 b3 output [31:0] y; m0 m1 m2 m3 * w0 w1 w2 w3 multiplier m1 (b1, d1, w1); multiplier m2 (b2, d2, w2); a1 a2 a3 w4 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); endmodule y
25 module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 x input [15:0] x; output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * w0 w1 w2 w3 multiplier m1 (b1, d1, w1); multiplier m2 (b2, d2, w2); + a1 + a2 w4 + a3 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); endmodule relax(y) y
26 module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 x input [15:0] x; output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * w0 w1 w2 w3 multiplier m1 (b1, d1, w1); multiplier m2 (b2, d2, w2); + a1 + a2 w4 + a3 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); endmodule relax(y) y
27 module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 input [15:0] x; x output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * multiplier m1 (b1, d1, w1); w0 w1 w2 w3 restrict(w1) restrict(w2) restrict(w3) multiplier m2 (b2, d2, w2); a1 a2 a3 w4 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); restrict(w1); restrict(w2); endmodule relax(y) y
28 module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 input [15:0] x; x output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * multiplier m1 (b1, d1, w1); w0 w1 w2 w3 restrict(w1) restrict(w2) restrict(w3) multiplier m2 (b2, d2, w2); a1 a2 a3 w4 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); restrict(w1); restrict(w2); endmodule relax(y) y
29 Energy Reduction Area Reduction Code Annotations Quality Reduction 10 % Publications [1] Divya Mahajan et al., Axilog: Abstractions for Approximate Hardware Design and Reuse IEEE Micro [2] Amir Yazdanbakhsh et al., Axilog: Language Support for Approximate Hardware Design Design Automation and Test in Europe (DATE) 2015.
30 Finding the Pareto surface Energy IoT Processor Pareto.Fron0er Mobile Data center Desktop Truffle [ALOS 12] FLEXJAVA [FSE 15] RFVP [PACT 14, IEEE D&T 15] Axilog [DATE 15, IEEE Micro 15] D- NPUs [MICRO 12] A- NPUs [ISCA 14] SNNAP [HPCA 15] GNPU [Micro 15] MITHRA [TechCon 15] Performance (13.5, 11.1, 10%)
Axilog: Language Support for Approximate Hardware Design
Axilog: Language Support for Approximate Hardware Design Amir Yazdanbakhsh Divya Mahajan Bradley Thwaites Jongse Park Anandhavel Nagendrakumar Sindhuja Sethuraman Kar,k Ramkrishnan Nishanthi Ravindran
More informationAxBench: A Multiplatform Benchmark Suite for Approximate Computing
AxBench: A Multiplatform Benchmark Suite for Approximate Computing Amir Yazdanbakhsh, Divya Mahajan, and Hadi Esmaeilzadeh Georgia Institute of Technology Pejman Lotfi-Kamran Institute for Research in
More informationAxBench: A Benchmark Suite for Approximate Computing Across the System Stack
AxBench: A Benchmark Suite for Approximate Computing Across the System Stack Amir Yazdanbakhsh Divya Mahajan Pejman Lotfi-Kamran Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab School of
More informationCore. Error Predictor. Figure 1: Architectural overview of our quality control approach. Approximate Accelerator. Precise.
Prediction-Based Quality Control for Approximate Accelerators Divya Mahajan Amir Yazdanbakhsh Jongse Park Bradley Thwaites Hadi Esmaeilzadeh Georgia Institute of Technology Abstract Approximate accelerators
More informationNeural Network based Energy-Efficient Fault Tolerant Architect
Neural Network based Energy-Efficient Fault Tolerant Architectures and Accelerators University of Rochester February 7, 2013 References Flexible Error Protection for Energy Efficient Reliable Architectures
More informationAmir Yazdanbakhsh. (608)
Amir Yazdanbakhsh http://www.cc.gatech.edu/~ayazdanb/ a.yazdanbakhsh@gatech.edu (608).335.6884 RESEARCH INTERESTS Computer Architecture Approximate Computing Architecture Design for ASIC/FPGA Deep (Reinforcement)
More informationMicroprocessor Trends and Implications for the Future
Microprocessor Trends and Implications for the Future John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 4 1 September 2016 Context Last two classes: from
More informationRFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads
RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads Amir Yazdanbakhsh Gennady Pekhimenko Bradley Thwaites Hadi Esmaeilzadeh Taesoo Kim Onur Mutlu Todd C. Mowry Georgia Institute of Technology
More informationChapter 1: Fundamentals of Quantitative Design and Analysis
1 / 12 Chapter 1: Fundamentals of Quantitative Design and Analysis Be careful in this chapter. It contains a tremendous amount of information and data about the changes in computer architecture since the
More informationCompilation and Hardware Support for Approximate Acceleration
Compilation and Hardware Support for Approximate Acceleration Thierry Moreau Adrian Sampson Andre Baixo Mark Wyse Ben Ransford Jacob Nelson Luis Ceze Mark Oskin University of Washington Abstract Approximate
More informationNeural Acceleration for GPU Throughput Processors
Appears in the Proceedings of the 48 th Annual IEEE/ACM International Symposium on Microarchitecture, 2015 Neural Acceleration for Throughput Processors Amir Yazdanbakhsh Jongse Park Hardik Sharma Pejman
More informationLecture 1: Gentle Introduction to GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed
More informationNeural Acceleration for GPU Throughput Processors
Neural Acceleration for Throughput Processors Amir Yazdanbakhsh Jongse Park Hardik Sharma Pejman Lotfi-Kamran Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab School of Computer Science,
More informationApproximate Overview of Approximate Computing
Approximate Overview of Approximate Computing Luis Ceze University of Washington PL Architecture With thanks to many colleagues from whom I stole slides: Adrian Sampson, Hadi Esmaeilzadeh, Karin Strauss,
More informationPRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory
Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong
More informationComputer Architecture s Changing Definition
Computer Architecture s Changing Definition 1950s Computer Architecture Computer Arithmetic 1960s Operating system support, especially memory management 1970s to mid 1980s Computer Architecture Instruction
More informationScale-Out Acceleration for Machine Learning
Scale-Out Acceleration for Machine Learning Jongse Park Hardik Sharma Divya Mahajan Joon Kyung Kim Preston Olds Hadi Esmaeilzadeh Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology
More informationFLEXJAVA:)Language'Support' for'safe'and'modular' Approximate'Programming
FLEXJAVA:)Language'Support' for'safe'and'modular' Approximate'Programming Jongse Park,'Hadi'Esmaeilzadeh,'Xin'Zhang,' Mayur Naik,'William'Harris Alternative'Computing'Technologies'(ACT)'Lab Georgia'Institute'of'Technology
More informationPower dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.
The VLSI Interconnect Challenge Avinoam Kolodny Electrical Engineering Department Technion Israel Institute of Technology VLSI Challenges System complexity Performance Tolerance to digital noise and faults
More informationGables: A Roofline Model for Mobile SoCs
Gables: A Roofline Model for Mobile SoCs Mark D. Hill, Wisconsin & Former Google Intern Vijay Janapa Reddi, Harvard & Former Google Intern HPCA, Feb 2019 Outline Motivation Gables Model Example Balanced
More informationNeural Computer Architectures
Neural Computer Architectures 5kk73 Embedded Computer Architecture By: Maurice Peemen Date: Convergence of different domains Neurobiology Applications 1 Constraints Machine Learning Technology Innovations
More informationNeural Acceleration for General-Purpose Approximate Programs
2012 IEEE/ACM 45th Annual International Symposium on Microarchitecture Neural Acceleration for General-Purpose Approximate Programs Hadi Esmaeilzadeh Adrian Sampson Luis Ceze Doug Burger University of
More informationSystem-on-Chip Architecture for Mobile Applications. Sabyasachi Dey
System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationThe Effect of Temperature on Amdahl Law in 3D Multicore Era
The Effect of Temperature on Amdahl Law in 3D Multicore Era L Yavits, A Morad, R Ginosar Abstract This work studies the influence of temperature on performance and scalability of 3D Chip Multiprocessors
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationExploiting Hidden Layer Modular Redundancy for Fault-Tolerance in Neural Network Accelerators
Exploiting Hidden Layer Modular Redundancy for Fault-Tolerance in Neural Network Accelerators Schuyler Eldridge Ajay Joshi Department of Electrical and Computer Engineering, Boston University schuye@bu.edu
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationEnergy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS
Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory
More informationApproximate Computing on Programmable SoCs via Neural Acceleration
University of Washington Computer Science and Engineering Technical Report UW-CSE-14-05-01 Approximate Computing on Programmable SoCs via Neural Acceleration Thierry Moreau Jacob Nelson Adrian Sampson
More informationLecture 1: Introduction and Basics
CS 515 Programming Language and Compilers I Lecture 1: Introduction and Basics Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/5/2017 Class Information Instructor: Zheng (Eddy) Zhang Email: eddyzhengzhang@gmailcom
More informationDNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses
DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationELCT 912: Advanced Embedded Systems
ELCT 912: Advanced Embedded Systems Lecture 2-3: Embedded System Hardware Dr. Mohamed Abd El Ghany, Department of Electronics and Electrical Engineering Embedded System Hardware Used for processing of
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationComputer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationCourse web site: teaching/courses/car. Piazza discussion forum:
Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationBig Data Systems on Future Hardware. Bingsheng He NUS Computing
Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationIs There A Tradeoff Between Programmability and Performance?
Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationECE 588/688 Advanced Computer Architecture II
ECE 588/688 Advanced Computer Architecture II Instructor: Alaa Alameldeen alaa@ece.pdx.edu Fall 2009 Portland State University Copyright by Alaa Alameldeen and Haitham Akkary 2009 1 When and Where? When:
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationMarching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.
UCAS-6 6 > Stanford > Imperial > Verify 2011 Marching Memory マーチングメモリ Tadao Nakamura 中村維男 Based on Patent Application by Tadao Nakamura and Michael J. Flynn 1 Copyright 2010 Tadao Nakamura C-M-C Computer
More informationTransistors and Wires
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis Part II These slides are based on the slides provided by the publisher. The slides
More informationAdvanced and parallel architectures
Cognome Nome Advanced and parallel architectures Prof. A. Massini June 11, 2015 Exercise 1a (2 points) Exercise 1b (2 points) Exercise 2 (5 points) Exercise 3 (3 points) Exercise 4a (3 points) Exercise
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationComputer Architecture!
Informatics 3 Computer Architecture! Dr. Boris Grot and Dr. Vijay Nagarajan!! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors
More informationComputer Architecture!
Informatics 3 Computer Architecture! Dr. Boris Grot and Dr. Vijay Nagarajan!! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors:!
More informationAdvanced Computer Architecture (CS620)
Advanced Computer Architecture (CS620) Background: Good understanding of computer organization (eg.cs220), basic computer architecture (eg.cs221) and knowledge of probability, statistics and modeling (eg.cs433).
More informationEmbedded Systems: Hardware Components (part I) Todor Stefanov
Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationNew Challenges in Microarchitecture and Compiler Design
New Challenges in Microarchitecture and Compiler Design Contributors: Jesse Fang Tin-Fook Ngai Fred Pollack Intel Fellow Director of Microprocessor Research Labs Intel Corporation fred.pollack@intel.com
More informationIntroduction. CSCI 4850/5850 High-Performance Computing Spring 2018
Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel
More informationREAL TIME DIGITAL SIGNAL PROCESSING
REAL TIME DIGITAL SIGNAL PROCESSING UTN - FRBA 2011 www.electron.frba.utn.edu.ar/dplab Introduction Why Digital? A brief comparison with analog. Advantages Flexibility. Easily modifiable and upgradeable.
More informationDr. Yassine Hariri CMC Microsystems
Dr. Yassine Hariri Hariri@cmc.ca CMC Microsystems 03-26-2013 Agenda MCES Workshop Agenda and Topics Canada s National Design Network and CMC Microsystems Processor Eras: Background and History Single core
More informationComputer Architecture. R. Poss
Computer Architecture R. Poss 1 ca01-10 september 2015 Course & organization 2 ca01-10 september 2015 Aims of this course The aims of this course are: to highlight current trends to introduce the notion
More informationEITF20: Computer Architecture Part1.1.1: Introduction
EITF20: Computer Architecture Part1.1.1: Introduction Liang Liu liang.liu@eit.lth.se 1 Course Factor Computer Architecture (7.5HP) http://www.eit.lth.se/kurs/eitf20 EIT s Course Service Desk (studerandeexpedition)
More informationParallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010
Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:
More informationComputer Architecture!
Informatics 3 Computer Architecture! Dr. Vijay Nagarajan and Prof. Nigel Topham! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors
More informationCO403 Advanced Microprocessors IS860 - High Performance Computing for Security. Basavaraj Talawar,
CO403 Advanced Microprocessors IS860 - High Performance Computing for Security Basavaraj Talawar, basavaraj@nitk.edu.in Course Syllabus Technology Trends: Transistor Theory. Moore's Law. Delay, Power,
More informationPERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah
PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Sept. 5 th : Homework 1 release (due on Sept.
More informationCS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007
CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007 Name: Solutions (please print) 1-3. 11 points 4. 7 points 5. 7 points 6. 20 points 7. 30 points 8. 25 points Total (105 pts):
More informationFundamentals of Computer Design
CS359: Computer Architecture Fundamentals of Computer Design Yanyan Shen Department of Computer Science and Engineering 1 Defining Computer Architecture Agenda Introduction Classes of Computers 1.3 Defining
More informationAWS & Intel: A Partnership Dedicated to fueling your Innovations. Thomas Kellerer BDM CSP, Intel Central Europe
AWS & Intel: A Partnership Dedicated to fueling your Innovations Thomas Kellerer BDM CSP, Intel Central Europe The Digital Service Economy Growth in connected devices enables new business opportunities
More informationAchieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation
Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation
More informationh Coherence Controllers
High-Throughput h Coherence Controllers Anthony-Trung Nguyen Microprocessor Research Labs Intel Corporation 9/30/03 Motivations Coherence Controller (CC) throughput is bottleneck of scalable systems. CCs
More informationHardware Software Co-Design: Not Just a Cliché
Hardware Software Co-Design: Not Just a Cliché Adrian Sampson James Bornholt Luis Ceze University of Washington SNAPL 2015 sa pa time immemorial 2005 2015 (not to scale) free lunch time immemorial 2005
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationA 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation
A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,
More informationComputer Architecture
Informatics 3 Computer Architecture Dr. Boris Grot and Dr. Vijay Nagarajan Institute for Computing Systems Architecture, School of Informatics University of Edinburgh General Information Instructors: Boris
More informationAdaptable Intelligence The Next Computing Era
Adaptable Intelligence The Next Computing Era Hot Chips, August 21, 2018 Victor Peng, CEO, Xilinx Pervasive Intelligence from Cloud to Edge to Endpoints >> 1 Exponential Growth and Opportunities Data Explosion
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationToward a Memory-centric Architecture
Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains
More informationSOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS
SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS Samira Khan MEMORY IN TODAY S SYSTEM Processor DRAM Memory Storage DRAM is critical for performance
More informationBREAKING THE MEMORY WALL
BREAKING THE MEMORY WALL CS433 Fall 2015 Dimitrios Skarlatos OUTLINE Introduction Current Trends in Computer Architecture 3D Die Stacking The memory Wall Conclusion INTRODUCTION Ideal Scaling of power
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications
More informationClassification of Semiconductor LSI
Classification of Semiconductor LSI 1. Logic LSI: ASIC: Application Specific LSI (you have to develop. HIGH COST!) For only mass production. ASSP: Application Specific Standard Product (you can buy. Low
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it! Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Memory Computer Technology
More informationIn-DRAM Near-Data Approximate Acceleration for GPUs
Appears in the Proceedings of the 27 th International Conference on Parallel Architectures and Compilation Techniques, 2018 In-DRAM Near-Data Approximate Acceleration for GPUs Amir Yazdanbakhsh Choungki
More informationChapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>
Chapter 7 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 7 Chapter 7 :: Topics Introduction (done) Performance Analysis (done) Single-Cycle Processor
More informationADVANCES IN PROCESSOR DESIGN AND THE EFFECTS OF MOORES LAW AND AMDAHLS LAW IN RELATION TO THROUGHPUT MEMORY CAPACITY AND PARALLEL PROCESSING
ADVANCES IN PROCESSOR DESIGN AND THE EFFECTS OF MOORES LAW AND AMDAHLS LAW IN RELATION TO THROUGHPUT MEMORY CAPACITY AND PARALLEL PROCESSING Evan Baytan Department of Electrical Engineering and Computer
More information