Power Efficient Solutions w/ FPGAs. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions

Size: px
Start display at page:

Download "Power Efficient Solutions w/ FPGAs. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions"

Transcription

1 1

2 Poer Efficient Solutions / FPGs Bill Jenkins ltera Sr. Product Specialist for Programming Language Solutions

3 System Challenges CPU rchitecture is inefficient for most parallel computing applications (big data, search) Result: Excessive poer consumption Bottleneck I/O Memory CPU Bottleneck Bottleneck I/O CPU Bottlenecks are starving the CPU for data Result: Slo Performance (high latency) Market Reaction: Groth of customized hardare and architectures 3

4 Role of FPG Resource Sharing Virtualization of computation, Storage, Netorking ccelerators Netork cceleration, Hypervisor offload Data ccess cceleration lgorithm cceleration Cluster Computing CPU and FPG Cluster Fabric Cluster Interconnect Host CPU FPG DRM 4

5 FPGs Increase Efficiency in the Data Center FPGs can greatly enhance CPU-based data center processing by accelerating algorithms and minimizing bottlenecks 10X+ increase in performance per att Massively parallel architecture Has 10 to 100 times the number of computational units Enables pipelined designs that perform multiple / different instructions in a single clock cycle Better localized memory avoids bottlenecks Programmability enables application-specific accelerators >5M Logic Elements 3200Mbps DDR4 SDRM/ 2.5Tbps HMC 1.5TFLOPs Floating Point DSP Programmable I/O 5

6 Mapping a simple program to an FPG CPU instructions High-level code Mem[100] += 42 * Mem[101] R0 Load Mem[100] R1 Load Mem[101] R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 6

7 First let s take a look at execution on a simple CPU Ldddr LdData Stddr PC Fetch Load Store StData Instruction Op Val addr Baddr Registers LU Op C Caddr B CWriteEnable CData 7 Fixed and general architecture: Op - General cover-all-cases data-paths - Fixed data-idths - Fixed operations

8 Load constant value into register Ldddr LdData Stddr PC Fetch Load Store StData Instruction Op Val addr Baddr Registers LU Op C Caddr B CWriteEnable CData Op Very inefficient use of hardare! 8

9 CPU activity, step by step R0 Load Mem[100] R1 Load Mem[101] Time R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 9

10 On the FPG e unroll the CPU hardare R0 Load Mem[100] R1 Load Mem[101] Space R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 10 Store R0 Mem[100]

11 and specialize by position R0 Load Mem[100] R1 Load Mem[101] 1. Instructions are fixed. Remove Fetch R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 11 Store R0 Mem[100]

12 and specialize R0 Load Mem[100] R1 Load Mem[101] 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 12 Store R0 Mem[100]

13 and specialize R0 Load Mem[100] R1 Load Mem[101] 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops 3. Remove unused Load / Store R2 Load #42 R2 Mul R1, R2 R0 dd R2, R0 13 Store R0 Mem[100]

14 and specialize R0 Load Mem[100] R1 Load Mem[101] R2 Load #42 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops 3. Remove unused Load / Store 4. Wire up registers properly! nd propagate state. R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 14

15 and specialize R0 Load Mem[100] R1 Load Mem[101] R2 Load #42 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops 3. Remove unused Load / Store 4. Wire up registers properly! nd propagate state. 5. Remove dead data. R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 15

16 and specialize R0 Load Mem[100] R1 Load Mem[101] R2 Load #42 1. Instructions are fixed. Remove Fetch 2. Remove unused LU ops 3. Remove unused Load / Store 4. Wire up registers properly! nd propagate state. 5. Remove dead data. 6. Reschedule! R2 Mul R1, R2 R0 dd R2, R0 Store R0 Mem[100] 16

17 Custom Data-Path on the FPG Matches Your lgorithm! High-level code Mem[100] += 42 * Mem[101] Custom data-path load load 42 Build exactly hat you need: Operations Data idths Memory size & configuration Efficiency: store Throughput / Latency / Poer 17

18 rchitectural Example: Image Processing Ine x y = 1 1 x = 1 y = 1 I old x + x y + y F x y Convolutions: dataflo can proceed in pipelined fashion No need to ait until the entire execution is complete Start a ne set of data calculations as soon as the first stage completes its execution

19 Processor (CPU/GPU) Implementation for(int y=1; y<height-1; ++y) { for(int x=1; x<idth-1; ++x) { for(int y2=-1; y2<1; ++y2) { for(int x2=-1; x2<1; ++x2) { i2[y][x] += i[y+y2][x+x2] * filter[y2][x2]; Cache CPU Main Memory cache can hide poor memory access patterns 19

20 FPG Implementation Example performance point: 1 pixel per cycle Cache requirements: 9 reads + 1 rite per cycle Memory Cache 9 read ports! Custom Data-path Expensive hardare! Poer overhead Cost overhead: more built in addressing flexibility than e need Why not customize the cache for the application? 20

21 Optimizing the Cache Start out ith the initial picture that is W pixels ide 21

22 Optimizing the Cache Let s remove all the lines that aren t in the neighborhood of the indo 22

23 Optimizing the Cache Take all of the lines and arrange them as a 1D array of pixels 23

24 Optimizing the Cache Remove the pixels at the edges that e don t need for the computation 24

25 Optimizing the Cache What happens hen e move the indo one pixel to the right? We have created a shift register implementation 25

26 Shift Registers in Softare sr[2*w+2] sr[0] data_in data_out[9] pixel_t sr[2*w+3]; hile(keep_going) { // Shift data in #pragma unroll for(int i=1; i<2*w+3; ++i) sr[i] = sr[i-1] sr[0] = data_in; } // Tap output data data_out = {sr[ 0], sr[ 1], sr[ 2], sr[ ], sr[ +1], sr[ +2] sr[2*], sr[2*+1], sr[2*+2]} //... Managing data movement to match the FPG s architectural strengths is key to obtaining high performance 26

27 Traditional OpenCL Implementation of a Pipeline(CPU/GPU) Global Memory (DDR) Buffer Buffer Buffer Buffer Kernel 1 Kernel 2 Kernel 3 High-latency: requires access to global memory High memory-bandidth Requires host coordination to pass buffers from one kernel to another With a particular design example e achieved 183 Images/s on a Stratix V PCIe card

28 Leveraging Kernel-to-Kernel Channels Global Memory (DDR) Buffer Buffer Channel declaration: channel int my_channel; Create a queue: value_type channel(); Kernel 1 Kernel 2 Kernel 3 Channel rite: rite_channel_altera(my_channel, x); Push data into the queue: void rite_channel_altera(channel Channels &ch, value_type data); Lo-latency communication beteen kernels Significantly Pop the first element less memory from the queue bandidth requirements value_type read_channel_altera(channel &ch); Host is not involved in coordinating communication beteen kernels This implementation on the same Stratix V PCIe card resulted in 400 Images/s Channel read: int y = read_channel_altera(my_channel);

29 FPG Code #pragma OPENCL_EXTENSION cl_altera_channel : enable // Declaration of Channel PI data types channel float prod_k1_channel; channel float k1_k2_channel; channel float k2_k3_channel; channel float k3_res_channel; kernel void convolution_prod( int batch_id_begin, int batch_id_end, global const volatile float * restrict input_global) { for(...) { rite_channel_altera( prod_k1_channel, input_global[...]); } rite_channel_altera( k1_k2_channel, input_global[...]);... } Kernels are ritten as standard building blocks that are connected together through channels The concept of having multiple concurrent kernels executing simultaneously and communicating directly on a device is currently unique to FPGs Offered as Vendor Extension Portable in OpenCL 2.0 through the concept of OpenCL Pipes

30 Migration Beteen FPGs In OpenCL, a float uses soft logic in an older FPGs Gen10 FPGs have hardened floating point logic built into the DSP blocks On rria 10 using the same code results in processing 6800 Images/s Stratix 10 expectations: Large increase in floating point resources Higher internal frequencies achievable 1.6x-2x performance increase 12x-16x performance/att efficiency versus Stratix V 30

31 dditional Improvements: IO Channels Kernel Channels are beteen OpenCL kernels IO Channels take data directly from and to IO interfaces in the FPG Camera or video feed could be processed directly in the FPG ithout going through the host Result could be passed out to the graphics card to be displayed or back to host memory for the host to use Private, Local and Global memory can no be used to buffer as needed IO Channels Kernel 1 Kernel 2 Kernel 3 31 FPG Kernel Channels/Pipes

32 Lessons Learned Exploiting pipelining on the FPG requires some attention to coding style to overcome the inherent assumptions of riting softare FPGs do not have caches Need to exploit data reuse in a more explicit ay The concept of dataflo pipelining ill not realize its full potential if e rite intermediate results to memory Bandidth limitations begin to dominate compute Use direct kernel to kernel communication called channels Native support for floating point on the FPG allos order of magnitude performance increase Code can be ported to neer FPGs ithout modification to get performance increase IO Channels can loer latency and improve performance even more by taking the host out of the processing chain even more

System Acceleration Overview. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions

System Acceleration Overview. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions 1 System cceleration Overview Bill Jenkins ltera Sr. Product Specialist for Programming Language Solutions Industry Trends Increasing product functionality and performance Smaller time-to-market window

More information

Dmitry Denisenko Intel Programmable Solutions Group. November 13, 2016, LLVM-HPC3 SC 16, Salt Lake City, UT

Dmitry Denisenko Intel Programmable Solutions Group. November 13, 2016, LLVM-HPC3 SC 16, Salt Lake City, UT Dmitry Denisenko Intel November 13, 2016, LLVM-HPC3 SC 16, Salt Lake City, UT FPGs are Everywhere! Consumer utomotive Test, Measurement, & Medical Communications Broadcast Military & Industrial Computer

More information

Ecole Thématique Architecture et programmation des GPU GIPSA-LAB, Grenoble-Campus, Décembre OpenCL On FPGA Marc Gaucheron ALTERA

Ecole Thématique Architecture et programmation des GPU GIPSA-LAB, Grenoble-Campus, Décembre OpenCL On FPGA Marc Gaucheron ALTERA Ecole Thématique rchitecture et programmation des GPU GIPS-LB, Grenoble-Campus, 14-17 Décembre 2015 OpenCL On FPG Marc Gaucheron LTER genda FPG architecture overview Conventional way of developing with

More information

AN 831: Intel FPGA SDK for OpenCL

AN 831: Intel FPGA SDK for OpenCL AN 831: Intel FPGA SDK for OpenCL Host Pipelined Multithread Subscribe Send Feedback Latest document on the web: PDF HTML Contents Contents 1 Intel FPGA SDK for OpenCL Host Pipelined Multithread...3 1.1

More information

Michael Kinsner, Dirk Seynhaeve IWOCL 2018

Michael Kinsner, Dirk Seynhaeve IWOCL 2018 Michael Kinsner, Dirk Seynhaeve IWOCL 2018 Topics 1. FPGA overview 2. Motivating application classes 3. Host pipes 4. Some data 2 FPGA: Fine-grained Massive Parallelism Intel Stratix 10 FPGA: Over 5 Million

More information

DeviceMate, an Integrated C Development System for Network-Enabling Embedded Devices

DeviceMate, an Integrated C Development System for Network-Enabling Embedded Devices WHITE paper DeviceMate, an Integrated C Development System for Netork-Enabling Embedded Devices Overvie In addition to meeting traditional challenges, designers of embedded control and equipment are often

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Digital Design using HDLs EE 4755 Final Examination

Digital Design using HDLs EE 4755 Final Examination Name Digital Design using HDLs EE 4755 Final Examination Thursday, 8 December 26 2:3-4:3 CST Alias Problem Problem 2 Problem 3 Problem 4 Problem 5 Problem 6 Exam Total (3 pts) (2 pts) (5 pts) (5 pts) (

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Optimizing OpenCL TM for Altera FPGAs

Optimizing OpenCL TM for Altera FPGAs Optimizing OpenCL TM for Altera FPGAs David Neto Principal Design Engineer, Altera Corporation International Workshop on OpenCL, Bristol 2014-05-12 Performance challenge Performance Wanted Multimedia Medical

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

Apple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple

Apple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1 Agenda How Apple uses LLVM to build a GPU Compiler Factors that affect GPU performance The Apple GPU compiler

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)

A Study of Data Partitioning on OpenCL-based FPGAs. Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) A Study of Data Partitioning on OpenC-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1 Outline Background and Motivations Data Partitioning on FPGA OpenC on FPGA

More information

Intel HLS Compiler: Fast Design, Coding, and Hardware

Intel HLS Compiler: Fast Design, Coding, and Hardware white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing

More information

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous

More information

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram: The CPU and Memory How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram: 1 Registers A register is a permanent storage location within

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors 1 Agenda OpenCL Overview of Platform, Execution and Memory models Mapping these models to AM57x Overview of OpenMP Offload Model Compare and contrast

More information

Windowing System on a 3D Pipeline. February 2005

Windowing System on a 3D Pipeline. February 2005 Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Experiences with Achieving Portability across Heterogeneous Architectures

Experiences with Achieving Portability across Heterogeneous Architectures Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron + + University of Virginia ++ Lawrence Livermore

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance

More information

Advanced CUDA Optimizations

Advanced CUDA Optimizations Advanced CUDA Optimizations General Audience Assumptions General working knowledge of CUDA Want kernels to perform better Profiling Before optimizing, make sure you are spending effort in correct location

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

The future is parallel but it may not be easy

The future is parallel but it may not be easy The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Developing Low Latency NVMe Systems for HyperscaleData Centers. Prepared by Engling Yeo Santa Clara, CA Date: 08/04/2017

Developing Low Latency NVMe Systems for HyperscaleData Centers. Prepared by Engling Yeo Santa Clara, CA Date: 08/04/2017 Developing Low Latency NVMe Systems for HyperscaleData Centers Prepared by Engling Yeo Santa Clara, CA 95054 Date: 08/04/2017 Quality of Service IOPS, Throughput, Latency Short predictable read latencies

More information

Course Overview Revisited

Course Overview Revisited Course Overview Revisited void blur_filter_3x3( Image &in, Image &blur) { // allocate blur array Image blur(in.width(), in.height()); // blur in the x dimension for (int y = ; y < in.height(); y++) for

More information

Informats. SAS Informats under OpenVMS. Reading Binary Data CHAPTER 15

Informats. SAS Informats under OpenVMS. Reading Binary Data CHAPTER 15 321 CHAPTER 15 Informats SAS Informats under OpenVMS 321 Reading Binary Data 321 SAS Informats under OpenVMS A SAS informat is an instruction or template that the SAS System uses to read data values into

More information

OpenCL on FPGAs - Creating custom accelerated solutions

OpenCL on FPGAs - Creating custom accelerated solutions OpenCL on FPGAs - Creating custom accelerated solutions Manuel Greisinger Channel Manager, Central & Eastern Europe Oct 13 th, 2015 ESSEI Technology Day, Gilching, Germany Industry Trends Increasing product

More information

Precise Exceptions and Out-of-Order Execution. Samira Khan

Precise Exceptions and Out-of-Order Execution. Samira Khan Precise Exceptions and Out-of-Order Execution Samira Khan Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take

More information

WAN_0218. Implementing ReTune with Wolfson Audio CODECs INTRODUCTION

WAN_0218. Implementing ReTune with Wolfson Audio CODECs INTRODUCTION Implementing ReTune ith Wolfson Audio CODECs INTRODUCTION ReTune TM is a technology for compensating for deficiencies in the frequency responses of loudspeakers and microphones and the housing they are

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Unlocking FPGAs Using High- Level Synthesis Compiler Technologies

Unlocking FPGAs Using High- Level Synthesis Compiler Technologies Unlocking FPGAs Using High- Leel Synthesis Compiler Technologies Fernando Mar*nez Vallina, Henry Styles Xilinx Feb 22, 2015 Why are FPGAs Good Scalable, highly parallel and customizable compute 10s to

More information

Code Transformations at System Level

Code Transformations at System Level ode Transformations at System Level Motivation Multimedia applications MPEG 4 > 2 GOP/s, > 6 G/s, > 10 M storage Embedded systems, So, No Memory is a power bottleneck [Sakurai & Kuroda, ED&T 97] MPEG-2

More information

ODP Relationship to NFV. Bill Fischofer, LNG 31 October 2013

ODP Relationship to NFV. Bill Fischofer, LNG 31 October 2013 ODP Relationship to NFV Bill Fischofer, LNG 31 October 2013 Alphabet Soup NFV - Network Functions Virtualization, a carrier initiative organized under ETSI (European Telecommunications Standards Institute)

More information

Five Ways to Build Flexibility into Industrial Applications with FPGAs

Five Ways to Build Flexibility into Industrial Applications with FPGAs GM/M/A\ANNETTE\2015\06\wp-01154- flexible-industrial.docx Five Ways to Build Flexibility into Industrial Applications with FPGAs by Jason Chiang and Stefano Zammattio, Altera Corporation WP-01154-2.0 White

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

Graph Streaming Processor

Graph Streaming Processor Graph Streaming Processor A Next-Generation Computing Architecture Val G. Cook Chief Software Architect Satyaki Koneru Chief Technology Officer Ke Yin Chief Scientist Dinakar Munagala Chief Executive Officer

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Vivado HLx Design Entry. June 2016

Vivado HLx Design Entry. June 2016 Vivado HLx Design Entry June 2016 Agenda What is the HLx Design Methodology? New & Early Access features for Connectivity Platforms Creating Differentiated Logic 2 What is the HLx Design Methodology? Page

More information

REAL-TIME 3D GRAPHICS STREAMING USING MPEG-4

REAL-TIME 3D GRAPHICS STREAMING USING MPEG-4 REAL-TIME 3D GRAPHICS STREAMING USING MPEG-4 Liang Cheng, Anusheel Bhushan, Renato Pajarola, and Magda El Zarki School of Information and Computer Science University of California, Irvine, CA 92697 {lcheng61,

More information

SDAccel Development Environment User Guide

SDAccel Development Environment User Guide SDAccel Development Environment User Guide Features and Development Flows Revision History The following table shows the revision history for this document. Date Version Revision 05/13/2016 2016.1 Added

More information

Overview of ROCCC 2.0

Overview of ROCCC 2.0 Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment

More information

Formats. SAS Formats under OpenVMS. Writing Binary Data CHAPTER 13

Formats. SAS Formats under OpenVMS. Writing Binary Data CHAPTER 13 263 CHAPTER 13 Formats SAS Formats under OpenVMS 263 Writing Binary Data 263 SAS Formats under OpenVMS A SAS format is an instruction or template that the SAS System uses to rite data values. Most SAS

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Cougar Open CL v1.0. Users Guide for Open CL support for Delphi/ C++Builder and.net

Cougar Open CL v1.0. Users Guide for Open CL support for Delphi/ C++Builder and.net Cougar Open CL v1.0 Users Guide for Open CL support for Delphi/ C++Builder and.net MtxVec version v4, rev 2.0 1999-2011 Dew Research www.dewresearch.com Table of Contents Cougar Open CL v1.0... 2 1 About

More information

Handheld Devices. Kari Pulli. Research Fellow, Nokia Research Center Palo Alto. Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela

Handheld Devices. Kari Pulli. Research Fellow, Nokia Research Center Palo Alto. Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela OpenCL in Handheld Devices Kari Pulli Research Fellow, Nokia Research Center Palo Alto Material from Jyrki Leskelä, Jarmo Nikula, Mika Salmela 1 OpenCL 1.0 Embedded Profile Enables OpenCL on mobile and

More information

Ten Reasons to Optimize a Processor

Ten Reasons to Optimize a Processor By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor

More information

Next Generation Verification Process for Automotive and Mobile Designs with MIPI CSI-2 SM Interface

Next Generation Verification Process for Automotive and Mobile Designs with MIPI CSI-2 SM Interface Thierry Berdah, Yafit Snir Next Generation Verification Process for Automotive and Mobile Designs with MIPI CSI-2 SM Interface Agenda Typical Verification Challenges of MIPI CSI-2 SM designs IP, Sub System

More information

Profiling & Tuning Applications. CUDA Course István Reguly

Profiling & Tuning Applications. CUDA Course István Reguly Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Multimedia Systems 2011/2012

Multimedia Systems 2011/2012 Multimedia Systems 2011/2012 System Architecture Prof. Dr. Paul Müller University of Kaiserslautern Department of Computer Science Integrated Communication Systems ICSY http://www.icsy.de Sitemap 2 Hardware

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

UC-2000 / UC-2000L. User Manual

UC-2000 / UC-2000L. User Manual User Manual UC-2000 / UC-2000L Read this guide thoroughly and follo the installation and operation procedures carefully in order to prevent any damage to the units and/or any devices that connect to them.

More information

Instruction Fetch Energy Reduction Using Loop Caches For Embedded Applications with Small Tight Loops. Lea Hwang Lee, William Moyer, John Arends

Instruction Fetch Energy Reduction Using Loop Caches For Embedded Applications with Small Tight Loops. Lea Hwang Lee, William Moyer, John Arends Instruction Fetch Energy Reduction Using Loop Caches For Embedded Applications ith Small Tight Loops Lea Hang Lee, William Moyer, John Arends Instruction Fetch Energy Reduction Using Loop Caches For Loop

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Lecture 20: High-level Synthesis (1)

Lecture 20: High-level Synthesis (1) Lecture 20: High-level Synthesis (1) Slides courtesy of Deming Chen Some slides are from Prof. S. Levitan of U. of Pittsburgh Outline High-level synthesis introduction High-level synthesis operations Scheduling

More information

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011 GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput

More information

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang

More information

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

Industrial Data Communications - Fundamentals

Industrial Data Communications - Fundamentals Industrial Data Communications - Fundamentals Tutorial 1 This tutorial on the fundamentals of communications is broken don into the folloing sections: Communication Modes Synchronous versus Asynchronous

More information

ESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable?

ESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable? ESE534: Computer Organization Day 22: April 9, 2012 Time Multiplexing Tabula March 1, 2010 Announced new architecture We would say w=1, c=8 arch. 1 [src: www.tabula.com] 2 Previously Today Saw how to pipeline

More information

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH Key words: Digital Signal Processing, FIR filters, SIMD processors, AltiVec. Grzegorz KRASZEWSKI Białystok Technical University Department of Electrical Engineering Wiejska

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS 677: Parallel Programming for Many-core Processors Lecture 6 1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O T R O L ALU CTL ISTRUCTIO FETCH ISTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMOR ACCESS WRITE BACK A D D A D D A L U

More information

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović Name: ANSWER SOLUTIONS This is a closed book, closed notes exam. 80 Minutes 17 Pages Notes: Not all questions

More information

Module 4c: Pipelining

Module 4c: Pipelining Module 4c: Pipelining R E F E R E N C E S : S T A L L I N G S, C O M P U T E R O R G A N I Z A T I O N A N D A R C H I T E C T U R E M O R R I S M A N O, C O M P U T E R O R G A N I Z A T I O N A N D A

More information

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand

Spring 2018 :: CSE 502. Main Memory & DRAM. Nima Honarmand Main Memory & DRAM Nima Honarmand Main Memory Big Picture 1) Last-level cache sends its memory requests to a Memory Controller Over a system bus of other types of interconnect 2) Memory controller translates

More information

Unit 2: High-Level Synthesis

Unit 2: High-Level Synthesis Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 14: Speculation II Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CS 246, Harvard University] Tomasulo+ROB Add

More information

Network service model. Network service model. Network Layer (part 1) Virtual circuits. By the end of this lecture, you should be able to.

Network service model. Network service model. Network Layer (part 1) Virtual circuits. By the end of this lecture, you should be able to. Netork Layer (part ) y the end of this lecture, you should be able to. xplain the operation of distance vector routing algorithm xplain shortest path routing algorithm escribe the major points of RIP and

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

OpenACC (Open Accelerators - Introduced in 2012)

OpenACC (Open Accelerators - Introduced in 2012) OpenACC (Open Accelerators - Introduced in 2012) Open, portable standard for parallel computing (Cray, CAPS, Nvidia and PGI); introduced in 2012; GNU has an incomplete implementation. Uses directives in

More information

Wolfson Control Write Sequencer

Wolfson Control Write Sequencer Wolfson Control Write Sequencer The Control Write Sequencer is a function that executes pre-programmed sequences of register operations ith a high degree of autonomy from the host processor. This means

More information

Plug-in Board Editor for PLG150-DR/PLG150-PC

Plug-in Board Editor for PLG150-DR/PLG150-PC Plug-in Board Editor for PLG150-DR/PLG150-PC Oner s Manual Contents Introduction.........................................2 Starting Up.........................................3 Assigning the PLG150-DR/PLG150-PC

More information

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016 LECTURE ON PASCAL GPU ARCHITECTURE Jiri Kraus, November 14 th 2016 ACCELERATED COMPUTING CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 2 ACCELERATED COMPUTING CPU Optimized

More information