Power Efficient Solutions w/ FPGAs. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions
|
|
- Harry Campbell
- 5 years ago
- Views:
Transcription
1 1
2 Poer Efficient Solutions / FPGs Bill Jenkins ltera Sr. Product Specialist for Programming Language Solutions
3 System Challenges CPU rchitecture is inefficient for most parallel computing applications (big data, search) Result: Excessive poer consumption Bottleneck I/O Memory CPU Bottleneck Bottleneck I/O CPU Bottlenecks are starving the CPU for data Result: Slo Performance (high latency) Market Reaction: Groth of customized hardare and architectures 3
4 Role of FPG Resource Sharing Virtualization of computation, Storage, Netorking ccelerators Netork cceleration, Hypervisor offload Data ccess cceleration lgorithm cceleration Cluster Computing CPU and FPG Cluster Fabric Cluster Interconnect Host CPU FPG DRM 4
5 FPGs Increase Efficiency in the Data Center FPGs can greatly enhance CPU-based data center processing by accelerating algorithms and minimizing bottlenecks 10X+ increase in performance per att Massively parallel architecture Has 10 to 100 times the number of computational units Enables pipelined designs that perform multiple / different instructions in a single clock cycle Better localized memory avoids bottlenecks Programmability enables application-specific accelerators >5M Logic Elements 3200Mbps DDR4 SDRM/ 2.5Tbps HMC 1.5TFLOPs Floating Point DSP Programmable I/O 5
6 Mapping a simple program to an FPG CPU instructions High-level code Mem[100] += 42 * Mem[101] R0 Load Mem[100] R1 Load Mem[101] R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 6
7 First let s take a look at execution on a simple CPU Ldddr LdData Stddr PC Fetch Load Store StData Instruction Op Val addr Baddr Registers LU Op C Caddr B CWriteEnable CData 7 Fixed and general architecture: Op - General cover-all-cases data-paths - Fixed data-idths - Fixed operations
8 Load constant value into register Ldddr LdData Stddr PC Fetch Load Store StData Instruction Op Val addr Baddr Registers LU Op C Caddr B CWriteEnable CData Op Very inefficient use of hardare! 8
9 CPU activity, step by step R0 Load Mem[100] R1 Load Mem[101] Time R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 9
10 On the FPG e unroll the CPU hardare R0 Load Mem[100] R1 Load Mem[101] Space R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 10 Store R0 Mem[100]
11 and specialize by position R0 Load Mem[100] R1 Load Mem[101] 1. Instructions are fixed. Remove Fetch R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 11 Store R0 Mem[100]
12 and specialize R0 Load Mem[100] R1 Load Mem[101] 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 12 Store R0 Mem[100]
13 and specialize R0 Load Mem[100] R1 Load Mem[101] 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops 3. Remove unused Load / Store R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 13 Store R0 Mem[100]
14 and specialize R0 Load Mem[100] R1 Load Mem[101] R2 Load #42 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops 3. Remove unused Load / Store 4. Wire up registers properly! nd propagate state. R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 14
15 and specialize R0 Load Mem[100] R1 Load Mem[101] R2 Load #42 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops 3. Remove unused Load / Store 4. Wire up registers properly! nd propagate state. 5. Remove dead data. R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 15
16 and specialize R0 Load Mem[100] R1 Load Mem[101] R2 Load #42 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops 3. Remove unused Load / Store 4. Wire up registers properly! nd propagate state. 5. Remove dead data. 6. Reschedule! R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 16
17 Custom Data-Path on the FPG Matches Your lgorithm! High-level code Mem[100] += 42 * Mem[101] Custom data-path load load 42 Build exactly hat you need: Operations Data idths Memory size & configuration Efficiency: store Throughput / Latency / Poer 17
18 rchitectural Example: Image Processing Ine x y = 1 1 x = 1 y = 1 I old x + x y + y F x y Convolutions: dataflo can proceed in pipelined fashion No need to ait until the entire execution is complete Start a ne set of data calculations as soon as the first stage completes its execution
19 Processor (CPU/GPU) Implementation for(int y=1; y<height-1; ++y) { for(int x=1; x<idth-1; ++x) { for(int y2=-1; y2<1; ++y2) { for(int x2=-1; x2<1; ++x2) { i2[y][x] += i[y+y2][x+x2] * filter[y2][x2]; Cache CPU Main Memory cache can hide poor memory access patterns 19
20 FPG Implementation Example performance point: 1 pixel per cycle Cache requirements: 9 reads + 1 rite per cycle Memory Cache 9 read ports! Custom Data-path Expensive hardare! Poer overhead Cost overhead: more built in addressing flexibility than e need Why not customize the cache for the application? 20
21 Optimizing the Cache Start out ith the initial picture that is W pixels ide 21
22 Optimizing the Cache Let s remove all the lines that aren t in the neighborhood of the indo 22
23 Optimizing the Cache Take all of the lines and arrange them as a 1D array of pixels 23
24 Optimizing the Cache Remove the pixels at the edges that e don t need for the computation 24
25 Optimizing the Cache What happens hen e move the indo one pixel to the right? We have created a shift register implementation 25
26 Shift Registers in Softare sr[2*w+2] sr[0] data_in data_out[9] pixel_t sr[2*w+3]; hile(keep_going) { // Shift data in #pragma unroll for(int i=1; i<2*w+3; ++i) sr[i] = sr[i-1] sr[0] = data_in; } // Tap output data data_out = {sr[ 0], sr[ 1], sr[ 2], sr[ ], sr[ +1], sr[ +2] sr[2*], sr[2*+1], sr[2*+2]} //... Managing data movement to match the FPG s architectural strengths is key to obtaining high performance 26
27 Traditional OpenCL Implementation of a Pipeline(CPU/GPU) Global Memory (DDR) Buffer Buffer Buffer Buffer Kernel 1 Kernel 2 Kernel 3 High-latency: requires access to global memory High memory-bandidth Requires host coordination to pass buffers from one kernel to another With a particular design example e achieved 183 Images/s on a Stratix V PCIe card
28 Leveraging Kernel-to-Kernel Channels Global Memory (DDR) Buffer Buffer Channel declaration: channel int my_channel; Create a queue: value_type channel(); Kernel 1 Kernel 2 Kernel 3 Channel rite: rite_channel_altera(my_channel, x); Push data into the queue: void rite_channel_altera(channel Channels &ch, value_type data); Lo-latency communication beteen kernels Significantly Pop the first element less memory from the queue bandidth requirements value_type read_channel_altera(channel &ch); Host is not involved in coordinating communication beteen kernels This implementation on the same Stratix V PCIe card resulted in 400 Images/s Channel read: int y = read_channel_altera(my_channel);
29 FPG Code #pragma OPENCL_EXTENSION cl_altera_channel : enable // Declaration of Channel PI data types channel float prod_k1_channel; channel float k1_k2_channel; channel float k2_k3_channel; channel float k3_res_channel; kernel void convolution_prod( int batch_id_begin, int batch_id_end, global const volatile float * restrict input_global) { for(...) { rite_channel_altera( prod_k1_channel, input_global[...]); } rite_channel_altera( k1_k2_channel, input_global[...]);... } Kernels are ritten as standard building blocks that are connected together through channels The concept of having multiple concurrent kernels executing simultaneously and communicating directly on a device is currently unique to FPGs Offered as Vendor Extension Portable in OpenCL 2.0 through the concept of OpenCL Pipes
30 Migration Beteen FPGs In OpenCL, a float uses soft logic in an older FPGs Gen10 FPGs have hardened floating point logic built into the DSP blocks On rria 10 using the same code results in processing 6800 Images/s Stratix 10 expectations: Large increase in floating point resources Higher internal frequencies achievable 1.6x-2x performance increase 12x-16x performance/att efficiency versus Stratix V 30
31 dditional Improvements: IO Channels Kernel Channels are beteen OpenCL kernels IO Channels take data directly from and to IO interfaces in the FPG Camera or video feed could be processed directly in the FPG ithout going through the host Result could be passed out to the graphics card to be displayed or back to host memory for the host to use Private, Local and Global memory can no be used to buffer as needed IO Channels Kernel 1 Kernel 2 Kernel 3 31 FPG Kernel Channels/Pipes
32 Lessons Learned Exploiting pipelining on the FPG requires some attention to coding style to overcome the inherent assumptions of riting softare FPGs do not have caches Need to exploit data reuse in a more explicit ay The concept of dataflo pipelining ill not realize its full potential if e rite intermediate results to memory Bandidth limitations begin to dominate compute Use direct kernel to kernel communication called channels Native support for floating point on the FPG allos order of magnitude performance increase Code can be ported to neer FPGs ithout modification to get performance increase IO Channels can loer latency and improve performance even more by taking the host out of the processing chain even more
System Acceleration Overview. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions
1 System cceleration Overview Bill Jenkins ltera Sr. Product Specialist for Programming Language Solutions Industry Trends Increasing product functionality and performance Smaller time-to-market window
More informationDmitry Denisenko Intel Programmable Solutions Group. November 13, 2016, LLVM-HPC3 SC 16, Salt Lake City, UT
Dmitry Denisenko Intel November 13, 2016, LLVM-HPC3 SC 16, Salt Lake City, UT FPGs are Everywhere! Consumer utomotive Test, Measurement, & Medical Communications Broadcast Military & Industrial Computer
More informationEcole Thématique Architecture et programmation des GPU GIPSA-LAB, Grenoble-Campus, Décembre OpenCL On FPGA Marc Gaucheron ALTERA
Ecole Thématique rchitecture et programmation des GPU GIPS-LB, Grenoble-Campus, 14-17 Décembre 2015 OpenCL On FPG Marc Gaucheron LTER genda FPG architecture overview Conventional way of developing with
More informationAN 831: Intel FPGA SDK for OpenCL
AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1
More informationMichael Kinsner, Dirk Seynhaeve IWOCL 2018
Michael Kinsner, Dirk Seynhaeve IWOCL 2018 Topics 1. FPGA overview 2. Motivating application classes 3. Host pipes 4. Some data 2 FPGA: Fine-grained Massive Parallelism Intel Stratix 10 FPGA: Over 5 Million
More informationDeviceMate, an Integrated C Development System for Network-Enabling Embedded Devices
WHITE paper DeviceMate, an Integrated C Development System for Netork-Enabling Embedded Devices Overvie In addition to meeting traditional challenges, designers of embedded control and equipment are often
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationDigital Design using HDLs EE 4755 Final Examination
Name Digital Design using HDLs EE 4755 Final Examination Thursday, 8 December 26 2:3-4:3 CST Alias Problem Problem 2 Problem 3 Problem 4 Problem 5 Problem 6 Exam Total (3 pts) (2 pts) (5 pts) (5 pts) (
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationOptimizing OpenCL TM for Altera FPGAs
Optimizing OpenCL TM for Altera FPGAs David Neto Principal Design Engineer, Altera Corporation International Workshop on OpenCL, Bristol 2014-05-12 Performance challenge Performance Wanted Multimedia Medical
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationApple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple
Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1 Agenda How Apple uses LLVM to build a GPU Compiler Factors that affect GPU performance The Apple GPU compiler
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationSDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center
SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationA Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)
A Study of Data Partitioning on OpenC-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1 Outline Background and Motivations Data Partitioning on FPGA OpenC on FPGA
More informationIntel HLS Compiler: Fast Design, Coding, and Hardware
white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager
More informationProfiling and Debugging OpenCL Applications with ARM Development Tools. October 2014
Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline
More informationParallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010
Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:
More informationOvercoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics
Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing
More informationCISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan
CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous
More informationThe CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:
The CPU and Memory How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram: 1 Registers A register is a permanent storage location within
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationOpenCL TM & OpenMP Offload on Sitara TM AM57x Processors
OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast
More informationWindowing System on a 3D Pipeline. February 2005
Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationExperiences with Achieving Portability across Heterogeneous Architectures
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance
More informationAdvanced CUDA Optimizations
Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationThe future is parallel but it may not be easy
The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationDeveloping Low Latency NVMe Systems for HyperscaleData Centers. Prepared by Engling Yeo Santa Clara, CA Date: 08/04/2017
Developing Low Latency NVMe Systems for HyperscaleData Centers Prepared by Engling Yeo Santa Clara, CA 95054 Date: 08/04/2017 Quality of Service IOPS, Throughput, Latency Short predictable read latencies
More informationCourse Overview Revisited
Course Overview Revisited void blur_filter_3x3( Image &in, Image &blur) { // allocate blur array Image blur(in.width(), in.height()); // blur in the x dimension for (int y = ; y < in.height(); y++) for
More informationInformats. SAS Informats under OpenVMS. Reading Binary Data CHAPTER 15
321 CHAPTER 15 Informats SAS Informats under OpenVMS 321 Reading Binary Data 321 SAS Informats under OpenVMS A SAS informat is an instruction or template that the SAS System uses to read data values into
More informationOpenCL on FPGAs - Creating custom accelerated solutions
OpenCL on FPGAs - Creating custom accelerated solutions Manuel Greisinger Channel Manager, Central & Eastern Europe Oct 13 th, 2015 ESSEI Technology Day, Gilching, Germany Industry Trends Increasing product
More informationPrecise Exceptions and Out-of-Order Execution. Samira Khan
Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take
More informationWAN_0218. Implementing ReTune with Wolfson Audio CODECs INTRODUCTION
Implementing ReTune ith Wolfson Audio CODECs INTRODUCTION ReTune TM is a technology for compensating for deficiencies in the frequency responses of loudspeakers and microphones and the housing they are
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationUnlocking FPGAs Using High- Level Synthesis Compiler Technologies
Unlocking FPGAs Using High- Leel Synthesis Compiler Technologies Fernando Mar*nez Vallina, Henry Styles Xilinx Feb 22, 2015 Why are FPGAs Good Scalable, highly parallel and customizable compute 10s to
More informationCode Transformations at System Level
ode Transformations at System Level Motivation Multimedia applications MPEG 4 > 2 GOP/s, > 6 G/s, > 10 M storage Embedded systems, So, No Memory is a power bottleneck [Sakurai & Kuroda, ED&T 97] MPEG-2
More informationODP Relationship to NFV. Bill Fischofer, LNG 31 October 2013
ODP Relationship to NFV Bill Fischofer, LNG 31 October 2013 Alphabet Soup NFV - Network Functions Virtualization, a carrier initiative organized under ETSI (European Telecommunications Standards Institute)
More informationFive Ways to Build Flexibility into Industrial Applications with FPGAs
GM/M/A\ANNETTE\2015\06\wp-01154- flexible-industrial.docx Five Ways to Build Flexibility into Industrial Applications with FPGAs by Jason Chiang and Stefano Zammattio, Altera Corporation WP-01154-2.0 White
More informationProfiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015
Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit
More informationGraph Streaming Processor
Graph Streaming Processor A Next-Generation Computing Architecture Val G. Cook Chief Software Architect Satyaki Koneru Chief Technology Officer Ke Yin Chief Scientist Dinakar Munagala Chief Executive Officer
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationVivado HLx Design Entry. June 2016
Vivado HLx Design Entry June 2016 Agenda What is the HLx Design Methodology? New & Early Access features for Connectivity Platforms Creating Differentiated Logic 2 What is the HLx Design Methodology? Page
More informationREAL-TIME 3D GRAPHICS STREAMING USING MPEG-4
REAL-TIME 3D GRAPHICS STREAMING USING MPEG-4 Liang Cheng, Anusheel Bhushan, Renato Pajarola, and Magda El Zarki School of Information and Computer Science University of California, Irvine, CA 92697 {lcheng61,
More informationSDAccel Development Environment User Guide
SDAccel Development Environment User Guide Features and Development Flows Revision History The following table shows the revision history for this document. Date Version Revision 05/13/2016 2016.1 Added
More informationOverview of ROCCC 2.0
Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment
More informationFormats. SAS Formats under OpenVMS. Writing Binary Data CHAPTER 13
263 CHAPTER 13 Formats SAS Formats under OpenVMS 263 Writing Binary Data 263 SAS Formats under OpenVMS A SAS format is an instruction or template that the SAS System uses to rite data values. Most SAS
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationCougar Open CL v1.0. Users Guide for Open CL support for Delphi/ C++Builder and.net
Cougar Open CL v1.0 Users Guide for Open CL support for Delphi/ C++Builder and.net MtxVec version v4, rev 2.0 1999-2011 Dew Research www.dewresearch.com Table of Contents Cougar Open CL v1.0... 2 1 About
More informationHandheld Devices. Kari Pulli. Research Fellow, Nokia Research Center Palo Alto. Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela
OpenCL in Handheld Devices Kari Pulli Research Fellow, Nokia Research Center Palo Alto Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela 1 OpenCL 1.0 Embedded Profile Enables OpenCL on mobile and
More informationTen Reasons to Optimize a Processor
By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor
More informationNext Generation Verification Process for Automotive and Mobile Designs with MIPI CSI-2 SM Interface
Thierry Berdah, Yafit Snir Next Generation Verification Process for Automotive and Mobile Designs with MIPI CSI-2 SM Interface Agenda Typical Verification Challenges of MIPI CSI-2 SM designs IP, Sub System
More informationProfiling & Tuning Applications. CUDA Course István Reguly
Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationMassively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain
Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,
More informationMultimedia Systems 2011/2012
Multimedia Systems 2011/2012 System Architecture Prof. Dr. Paul Müller University of Kaiserslautern Department of Computer Science Integrated Communication Systems ICSY http://www.icsy.de Sitemap 2 Hardware
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationUC-2000 / UC-2000L. User Manual
User Manual UC-2000 / UC-2000L Read this guide thoroughly and follo the installation and operation procedures carefully in order to prevent any damage to the units and/or any devices that connect to them.
More informationInstruction Fetch Energy Reduction Using Loop Caches For Embedded Applications with Small Tight Loops. Lea Hwang Lee, William Moyer, John Arends
Instruction Fetch Energy Reduction Using Loop Caches For Embedded Applications ith Small Tight Loops Lea Hang Lee, William Moyer, John Arends Instruction Fetch Energy Reduction Using Loop Caches For Loop
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationChapter 5A. Large and Fast: Exploiting Memory Hierarchy
Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM
More informationLecture 20: High-level Synthesis (1)
Lecture 20: High-level Synthesis (1) Slides courtesy of Deming Chen Some slides are from Prof. S. Levitan of U. of Pittsburgh Outline High-level synthesis introduction High-level synthesis operations Scheduling
More informationGPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011
GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput
More informationMulti-threading technology and the challenges of meeting performance and power consumption demands for mobile applications
Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationHomework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures
Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang
More informationHigh performance, power-efficient DSPs based on the TI C64x
High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed
More informationIndustrial Data Communications - Fundamentals
Industrial Data Communications - Fundamentals Tutorial 1 This tutorial on the fundamentals of communications is broken don into the folloing sections: Communication Modes Synchronous versus Asynchronous
More informationESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable?
ESE534: Computer Organization Day 22: April 9, 2012 Time Multiplexing Tabula March 1, 2010 Announced new architecture We would say w=1, c=8 arch. 1 [src: www.tabula.com] 2 Previously Today Saw how to pipeline
More informationFAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH
Key words: Digital Signal Processing, FIR filters, SIMD processors, AltiVec. Grzegorz KRASZEWSKI Białystok Technical University Department of Electrical Engineering Wiejska
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationCS 677: Parallel Programming for Many-core Processors Lecture 6
1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationRegisters. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH
PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O T R O L ALU CTL ISTRUCTIO FETCH ISTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMOR ACCESS WRITE BACK A D D A D D A L U
More informationComputer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović
Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović Name: ANSWER SOLUTIONS This is a closed book, closed notes exam. 80 Minutes 17 Pages Notes: Not all questions
More informationModule 4c: Pipelining
Module 4c: Pipelining R E F E R E N C E S : S T A L L I N G S, C O M P U T E R O R G A N I Z A T I O N A N D A R C H I T E C T U R E M O R R I S M A N O, C O M P U T E R O R G A N I Z A T I O N A N D A
More informationSpring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand
Main Memory & DRAM Nima Honarmand Main Memory Big Picture 1) Last-level cache sends its memory requests to a Memory Controller Over a system bus of other types of interconnect 2) Memory controller translates
More informationUnit 2: High-Level Synthesis
Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 14: Speculation II Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CS 246, Harvard University] Tomasulo+ROB Add
More informationNetwork service model. Network service model. Network Layer (part 1) Virtual circuits. By the end of this lecture, you should be able to.
Netork Layer (part ) y the end of this lecture, you should be able to. xplain the operation of distance vector routing algorithm xplain shortest path routing algorithm escribe the major points of RIP and
More informationTwo FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters
Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed
More informationProcessor: Superscalars Dynamic Scheduling
Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationOpenACC (Open Accelerators - Introduced in 2012)
OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in
More informationWolfson Control Write Sequencer
Wolfson Control Write Sequencer The Control Write Sequencer is a function that executes pre-programmed sequences of register operations ith a high degree of autonomy from the host processor. This means
More informationPlug-in Board Editor for PLG150-DR/PLG150-PC
Plug-in Board Editor for PLG150-DR/PLG150-PC Oner s Manual Contents Introduction.........................................2 Starting Up.........................................3 Assigning the PLG150-DR/PLG150-PC
More informationLECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016
LECTURE ON PASCAL GPU ARCHITECTURE Jiri Kraus, November 14 th 2016 ACCELERATED COMPUTING CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 2 ACCELERATED COMPUTING CPU Optimized
More information