High-Level Synthesis of a Fast Super-Resolution Convolutional Neural Net

Size: px
Start display at page:

Download "High-Level Synthesis of a Fast Super-Resolution Convolutional Neural Net"

Transcription

1 Justin Selig (jss459), Patrick Wang (pw349), Shaurya Luthra (sl2462) ECE 5775, Prof. Zhiru Zhang December 11th, 2017 Introduction High-Level Synthesis of a Fast Super-Resolution Convolutional Neural Net With Dennard scaling and Moore s law coming to an end, hardware specialization and system heterogeneity pave the way for the continual advancement of processing speed and power. Specialization becomes especially useful for specific, compute-heavy applications and enables new problems to be solved with the extra horsepower. One such application is machine learning, specifically that involving Convolutional Neural Networks (CNNs). While CNNs have many uses, one of its most prominent is for image processing and recognition. An image s two-dimensional data representation lends itself particularly well to convolution. One such use-case is Image Super-Resolution, the process of taking one (or several) low resolution (LR) images and creating a high resolution (HR) image. This can be useful for clarifying blurry or LR images from surveillance cameras, crops of images or videos, as well as providing more information for other types of applications that use images as an input such as facial recognition. The goal of our project was to implement a hardware accelerator for Image Super-Resolution focusing on using one LR input image. Furthermore, we chose the Fast Super Resolution CNN (FSRCNN) as described in [2]. As suggested by the name, the primary difference between FSRCNN and regular SRCNN is the speed at which the algorithm runs. Furthermore, FSRCNN has been shown to produce better results than standard SRCNN methods. This improvement in quality, as well as speed, however, comes at the cost of complexity - which will be further described below. Our project involved implementing the FSRCNN algorithm in C++ and further synthesizing hardware modules to run on an FPGA using the Xilinx SDSoC Development Environment. To do this, we had to consider principles in software-hardware co-design, hardware performance optimization, area/resource constraints, and evaluation methodology. These will be described in later sections. As a result of implementing this model, we were able to achieve results comparable to the state-of-the art software-only approach, however with higher throughput and lower latency using hardware. Techniques As described in the FSRCNN paper [2], the FSRCNN algorithm, while faster and more accurate than the SRCNN, is far more complex. While the SRCNN algorithms consists only of 3 stages - pre-processing, feature extraction, and mapping - the FSRCNN algorithm consists of 8 stages - 7 convolution stages, and 1 deconvolution stage. The weights, biases, and PReLU values are taken from the pre-trained models provided in [2]. The stages are illustrated and described in further detail below: Figure 1: FSRCNN Layers Visualized

2 Stage Description: Input Image: LR single channel. Conv. Layer 1: Feature extraction Conv. Layer 2: Shrinking Conv. Layers 3 6: Mapping Conv. Layer 7: Expanding DeConv Layer 8: Deconvolution The above 8 layers result in a total number of weights and a small number of parameters for the PReLU layers (1-7) Our goal was to take the above 8 layers and develop a software model for them using pre-trained model weights. We would then use these weights in our FSRCNN implementation and validate the model outputs using Peak Signal to Noise Ratio (PSNR) as way to quantify the quality of the results. In order to achieve an observable acceleration, we would try to offload as many layers as possible to the FPGA while remaining within resource limits. The equation for PSNR is shown below. 1 P SN R = 20 * log 10 ( N 2 error ) We chose to offload layers 3-6 because these layers have the same parameters and come right after one another. This is important since data may be immediately passed between hardware layers rather than communicated back to the CPU which takes a non-trivial amount of time. Additionally, since these layers are identical in structure, accelerating one layer effectively accelerates all layers performed in hardware (no need to test different HLS optimization techniques for each of the three layers). We chose not to move the deconvolution function to the FPGA becuase of the sheer size of computation. Layer 8, which uses deconvolution, uses 4536 weights (over ⅓ of our total number of weights across 8 layers) as well as a 9x increase in image data due to the upscaling inherent to deconvolution - meaning that a bottleneck in accelerating layer 8 would most likely be the data transfer: moving weights and input/output feature maps in and out of the FPGA as well as the CPU s memory system. A visualizatin of this can be seen in Figure 2. N n=1 Software Implementation Figure 2. System block diagram To start, we reviewed an existing Matlab implementation of FSRCNN provided in [1]. We used the Matlab output for numerical (PSNR) and visual performance comparisons. Furthermore, there were test images and the pre-trained model. From that model, we extracted the weights, biases, PReLU values, and input image values by converting them into 1D arrays and writing to.dat files. Because the MATLAB model used multidimensional arrays and transforms 2D matrices into 1D arrays column by column, some simple matrix transformations and transposes were used to list the data in the correct order. In order to match the convolution structure used in lab 4, output and input

3 feature maps are listed by column (x) then row (y) and weights are listed by output fmap, channel (input fmap), column, then row. This simplifies the indexing into our 1D arrays. From there, the same setup for convolution from lab 4 was used. All parameters are listed as 1D arrays to facilitate different map and filter sizes easily, with important arguments such as number of input fmaps, output fmaps, input edge size, output edge size, weights, biases, PReLU value. Within the function, nested loops are used to perform the convolution. The loop ordering is as follows: for each output fmap for each input fmap for each output fmap location for each filter location perform convolution Additionally, the MATLAB model used PReLU as the activation function instead of ReLU. ReLU adds a bias to the output and caps the lower end at 0. PReLU takes that value, and adds a PReLU bias multiplied with the min of the biased value. Mathematically, this plays out to be as follows. biased = output + weight //relu_output = max(biased,0) output = max(biased, 0) + prelu_weight*min(biased,0) This convolution function on its own, however, will not replicate the MATLAB calculations. This is due to the issue of padding. The convolution function in lab 4 (and consequently the one we used) requires the input fmap to be at least as big as the output fmap, based on filter size. The MATLAB model on the other hand uses the same sized input and output fmaps for each layer. This is because the MATLAB convolution method used pads the input to be big enough for the convolution based on the filter size. The thickness of the outer border b is related to the filter edge size K as follows: b = (K 1 )/2 The resultant padded input fmap now has the edge size I pad = I o + 2 * b. The borders are made up of the nearest original edge element. Essentially, the edges are replicated outward with the corners consisting of the original corners of the fmap. A pad function was written that did just that to match with the MATLAB model, with padding occurring before convolution layers. With the pad function and convolution functions, layers 1 through 7 could now be executed. Layer 8, however, was a deconvolution layer. Deconvolution, or transposed (strided) convolution, is a method of reversing the effect of convolution by padding the input between pixels and performing a direct convolution. This operation is used as the last stage in the FSRCNN network and applies the largest number of weights out of all the layers. The effect of this operation is that the output image ultimately returns to the size of the input. Since the input to this last layer is a dense set of pixels, the conversion to a more coarse representation via deconvolution is effectively a form of upsampling - reversing the shrinking effect of convolution by interpolating between input pixels via a strided convolution where the weights have been previously learned. The operation itself involves the temporary storage of a very large matrix - almost twice the size of the input. Since this layer has such a large number entries in its intermediate operation, it was infeasible to move it to the FPGA since this layer would require a massive amount of memory for storage.

4 The above functions, pad, convolution, and deconvolution, called with their respective layer parameters shown in Table 1, form the FSRCNN function, as well as the copying of input data into a temporary convolution memory, and the copying to the output data array from a temporary convolution memory. With this, the fsrcnn.cpp file was complete. F_Size I_Size I_Padded I_Count O_Count I_Total W_Total Biases PReLU Table 1. Expanded layer parameters Each function was incrementally verified by creating smaller test matrices and inputting them into both our C++ functions and the functions used in the MATLAB model. After incremental testing, the function as a whole was tested by using one of the test images from MATLAB and passing it through our function, then comparing our output data and MATLAB s output data. This was done by copying the data into a spreadsheet and calculating the magnitude of raw error between the two. A different file, main.cpp was used to call our FSRCNN method and input different test images. In addition to being where input data was managed, it was also in this file that the output data was compared to the original ground truth image. The error function used in the MATLAB model was peak signal-to-noise ratio, or PSNR, a metric typically used to measure the reconstruction fidelity of compression/decompression algorithms. This is essentially the magnitude in decibels of the square mean of the difference between the ground truth and our output. Using the MATLAB code as a reference, as well as the math.h library, we recreated our own PSNR function. A shave function was also created, that trimmed the border created by the deconvolution so the final output and the ground truth images dimensions matched. Software Results After running code on the ecelinux servers, we compiled and ran the code on the Xilinx ZC706 boards provided by the instructor. This run resulted in an average PSNR over 3 images (bird, butterfly, and head from the Set5 folder from [1]) of , from Table 2. Note, that the original MATLAB code from [1] results of a mean PSNR over the same 3 images of db for the FSRCNN method and db for the bicubic upscaling. Numerically, our C++ implementation gets a far worse PSNR than upscaling does in MATLAB. However, through experimentation, we discovered that if we take the raw floating point output of our FSRCNN function and do the rest of the processing in MATLAB (float to 8-bit integer conversion and shaving), the PSNR calculated is within 0.1 db of the MATLAB PSNR. The way MATLAB does the float to int conversions is believed to be the culprit of the mismatch.

5 Even when the byte version of our output is inspected visually (see Figure 4), it appears much closer to MATLAB s FSRCNN picture than the bicubic upscaled picture. Another note is that the total layer times do not sum to the total runtime. This is because the software timers are only put around padding and the actual layer itself. While the convolution itself is the operation of interest, the padding takes on the order of 1 msec to run, and was therefore deemed negligible to the overall timing if grouped together with the layers. Within the FSRCNN function, there is also data copying in and out of the function, which is not included. In the main function, there are also additional file writes, shaving, and PSNR calculations. PSNR (db) L1-2 Runtime L3-6 Runtime L7 Runtime L8 Runtime Total Runtime Hardware Implementation Table 2. Software results from the board To accelerate the system, we chose to move layers 3 through 6 into hardware. Because they have the same overall parameters (12 input channels, 12 output fmaps, 3 x 3 filter), they can share the same hardware module, saving area and development time. Additionally, in terms of layer count, they make up one half of all the layers. As single layers, they also have the 3rd most weights (see Table 1), meaning they comprise of a significant portion of the overall execution. In fact, looking at the runtime breakdown in Table 2,as a group, layers 3 through 6 take up 44.8% of the total fsrcnn time. In order to move these layers into hardware, we had to create two hardware functions: perform_conv_hw and perform_bias_hw. The single perform_conv function from before was broken down into two hardware modules in order to support the in order data access requirements from SDSoC. The first function we wrote for hardware was perform_conv_hw, followed by perform_bias_hw, and only used these functions for layer 3. In order to minimize data transfer, the weights necessary for layer 3 were stored locally on hardware. After running our first synthesis, we found that we could generate a bitstream, but it would immediately hang when running it on the ZC706. The reason for this - which was discovered with the help of Sean (TA) and Professor Zhang, was that we were accessing data out of order, and did not properly define our data interfaces. After realizing this we added the following pragma: #pragma SDS data access_pattern(input:sequential,output:sequential) And also used two for loops to copy in our input and output into/out of temporary arrays in order to preserve the sequential access pattern. After adding this pragma and resynthesizing, we found that our code still hung. This ended up being a result of partial data access. With SDSoC we cannot have an array of size M and access N elements when N!=M. We needed to keep the flexible array size in order to make our function modular for different images, so we then went on to use the following pragma: #pragma SDS data copy(input[0:12*(o+k-1)*(o+k-1)], output[0:12*o*o]) The above pragma allowed us to pass in a variable (pointer) to our array and access a variable amount of data dependant on function parameters (that could be resolved at compile time). With the above modifications we were able to run our code on the zc06 boards, but did not achieve much of a runtime boost. In looking at the VHLS reports, we saw a large amount of latency per function being used in our data copy from input into a temporary array, and from the temporary array to output. These copies were done to maintain in order data access as SDSoC does not support INOUT arrays as parameters to HW functions. In order to reduce latency we decided to restructure our convolution loop as shown below:

6 Figure 3. Loop order By reordering our loop and using a temporary variable to handle the accumulating sum per output pixel, we we were able to change our output from out of order access to in order access within our convolution, and got rid of our output copy loop. This in turn greatly reduced latency. The next and final optimizations we made were loop unrolling parts of the inner loops. More specifically we manually unrolled the two loops which iterated over filter pixels - decreasing runtime, as well as increasing accuracy by reducing truncation between individual convolution sums. Finally we unrolled our loop over each input map (m input maps) so that we could compute all the values needed for a single output pixel in parallel. Following these optimizations we attempted to pipeline the loop which iterates over the input columns. In order to do so we had to use 12 line buffers (parallel convolution over input maps), and unfortunately partitioning these line buffers made our resource utilization >100%, and not partitioning the line buffers actually worsened our overall minimum and maximum latencies. With this realization, we instead moved on to move layers 4-6 onto hardware as well - the results of which are discussed below: Hardware Results (Evaluation) After verifying that the design still compiled and ran in software, we ran HLS synthesis of our base hardware design as well as generated bitstreams to program the Xilinx ZC706 boards, the results of which can be seen in Table 3. General Timing Usage PSNR (db) Runtime Period (nsec) Max Latency BRAM DSP FF LUT Table 3. Hardware baseline results There is only a minor speedup when compared to the software baseline, approximately 1.03 times, but this is expected. Floating-point operations are slow in hardware and only 1 layer was moved into hardware. However, this gives us a good baseline for our hardware optimizations to come. Hardware Optimizations (Evaluation) After making sure we had a working hardware baseline, we began optimizing our hardware module for better performance. However, we set certain constraints to meet, namely to maintain a mean PSNR of at least 26 db and to keep resource utilization under 50%. The total resources of the ZC706 can be seen in Table 4. BRAM DSP FF LUT Table 4. Total resources available on ZC706 board

7 With the accuracy and area constraints in mind, the first optimization made after the hardware baseline design was the change to fixed-point arithmetic instead of floating-point arithmetic for the hardware convolution and biasing functions. Experimenting with different values, we discovered that we needed at least 4 integer bits to accurately represent the image data and at least 16 total bits to keep the PSNR above 26 db. Being only 16 bits wide, this halves the area required to store the weights, input, and output data. Due to its simplicity, it also provided a speed boost when looking at maximum latency of about 3 times. At this point, our total BRAM utilization was still well below the 50% limit, so we also moved the weights of layers 4 through 6 into hardware and rewrote the fsrcnn function to do all 4 layers in hardware. This only increased BRAM count (and negligible changes in FF and LUT usage due to select logic) and would provide an even better overall speed increase, though this isn t apparent by just looking at the Vivado synthesis reports. After deciding which fixed-point parameters to use, we next exploited inherent parallelism of the calculations. First, the innermost filter loops were manually unrolled. This way, the summing of the weights multiplied with the input values can be parallelized. We found that this also increases our PSNR as the accumulation happens in one step, reducing the accuracy lost of repeated accumulations. However, the PSNR increase was not enough to further drop the bit-width of our fixed-point type. We were also able to completely unroll the input fmap loop shown in Figure 3, as well as partition the input buffer completely in the first dimension to allow for simultaneous access, facilitating parallel calculation. Hardware Optimization Results (Evaluation) The results of the above-mentioned operations are shown below in Table 5. In terms of speedup of max latency, we achieved a 205 times decrease when compared to the baseline hardware design. For area, we utilized less BRAMs in the optimized design despite holding more data, but DSP, FF, and LUT did increase. They are, however, well below the 50% limit as they are 20.5%, 13.7%, 2.7%, and 10.1% respectively. While the program does have more parallelism than what we exploited, due to the sizes of our buffers, it was infeasible to further pipeline or partition beyond what we did due to the overhead of muxes resulting from variable array accesses. General Timing Usage PSNR (db) Runtime Period (nsec) Max Latency BRAM DSP FF LUT Table 5. Final optimized hardware results For timing, we achieved a total speedup of times over the software running on the ZC706 board. However, looking at the timing breakdown of the optimized design in Table 6 and comparing to that of the software in Table 2, our target layers (3-6) achieved a 46.7 times speedup. While this is far less than the 205 times decrease in maximum latency for the module, the bottleneck is likely the data transfer as well as the padding function, which is currently in software but is included in the runtime breakdown of the target layers. The results are still fairly significant, and when considering that when the runtime of layers 3-6 are eliminated completely, the speedup threshold is 1.673, we managed to come very close. L1-2 Runtime L3-6 Runtime L7 Runtime L8 Runtime Table 6. Final optimized runtime breakdown

8 For a visual representation of our final function output, Figure 4 shows a test image that we used to evaluate the output of our network. In the upper-right is the raw ground-truth image. The left side of the image shows the LR input to the network which we obtained by taking the original ground-truth image, downsampling it by averaging neighboring pixels, then upsampling by inserting average-intensity pixels in between existing pixels to return the image to its original size. The lower-right image shows the HR output of our network after passing in the LR image. It is observed that the FSRCNN network restored some image quality to the LR image. The program writes three text files with the byte values tab delimited. These can be converted into image files by our Python script once they are copied to a local machine, as the ecelinux server does not have the required Python libraries. The script is in parse_image.py and needs to have the text files in the same directory. Project Management Figure 4: Actual output from running an image through the FSRCNN network Division of Labor Patrick Wang Shaurya Luthra Justin Selig MATLAB scripts to generate C++ convolution model MATLAB code to evaluate correctness of C++ model System framework Perform_conv and matrix printing functions, fsrcnn.cpp, main.cpp General system debugging and bug squashing Data accumulation and statistics Hardware optimizations: Fixed-point Line buffer (attempt, not successful) C++ debugging for convolution function First pass at PSNR and matrix printing functions (corrected by Patrick) Software -> Hardware perform_bias_hw perform_conv_h w SDSoC integration/ Makefile creation Data transfer Pragmas Hardware optimizations Loop reorder Loop unrolling Aided with Data Accumulation: Ran synthesis / Flashed FPGA Python.txt ->.PNG script Deconvolution function and testing Padding and convolution debug General system debugging Hardware optimizations Weight buffer (attempt, not successful) Attempt at hardware timing Project presentation setup, formatting, and editing Project writeup setup, formatting, and editing Writeup Sections

9 Week 1: 11/6/2017 Reference model research (All) C++ software port (Patrick and Justin) Week 2: 11/13/2017 Finished C++ software port (Patrick and Justin) PSNR (Shaurya) Debugging/verification (All) Week 3: 11/20 Begins SDSoC exploration (Shaurya) perform_conv_hw Perform_bias_hw SDSoC Makefile (Shaurya) Week 4: 11/27 Continued SDSoC exploration (Shaurya) Thanksgiving break (No work done) Week 5: 12/4/2017 Working SW baseline (with SDSoC Makefile) (Shaurya and Patrick) Working HW baseline (Shaurya) HW Optimizations (All) Results and data collection (All) Project development: Overall the project developed at a fairly steady pace and involved taking an incremental approach. We began by first writing software (C++) based off a MATLAB baseline that we found on the website of the authors who wrote the paper which served as our inspiration. In doing so we faced many challenges. The first challenge was translation. MATLAB has many magic functions (like PSNR), that we had to write ourselves. Furthemore, as MATLAB is 1 indexed and has the ability to randomly access array elements quite efficiently (something we cannot do in hardware), we had to spend quite a bit of time restructuring the convolution and deconvolution function for proper array access. Following this we began moving software over to hardware (layer 3). In doing so we saw many challenges in setting up SDSoC as well as in getting the data transfer setup so that our software would not hang on the actual FPGAs. We were eventually able to properly write our own makefile and set up the hardware functions so that we could run layers in hardware. After being able to run a basic layer in HW, we moved on to optimizations. We faced many challenges in regard to how we could restructure our program to minimize latency without overdoing resource utilization. In the end we were able to acheive all of our goals, and unload 4 layers into HW Conclusion The FSRCNN is a compute-intensive deep neural net which makes it an optimal candidate for hardware-acceleration. By implementing a high-level synthesis of the FSRCNN, we achieved a significant speedup over CPU for the inference phase of the network. In general, an implementation of inference in hardware is both power and time-efficient. Therefore, this type of model has many use-cases for image-processing in real-time applications or in power-constrained devices such as embedded processors. Moving forward, there are plenty of ways we might address some issues we had while developing this project. For one, the most efficient implementation would be that involving hardware entirely. If we obtained a large enough FPGA, we could eliminate the CPU-FPGA communication bottleneck by storing pre-trained weights in on-chip BRAM. Alternatively, large layers such as deconvolution could be broken up into multiple parts through a tiling process wherein convolution operations are performed on separate segments of an image and combined to form a single output. Similarly, there are algorithmic optimizations that we could implement to speed up computation such as using a line-buffer for convolution operations.

10 References [1] C. Dong, C. C. Loy and X. Tang, "Accelerating the Super-Resolution Convolutional Neural Network," [2] V. Alberto. An Example of a Convolutional Neural Network for Image Super-Resolution [3] V. Alberto. An Example of a Convolutional Neural Network for Image Super-Resolution--Tutorial utorial

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction

More information

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Tutorial on Software-Hardware Codesign with CORDIC

Tutorial on Software-Hardware Codesign with CORDIC ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Tutorial on Software-Hardware Codesign with CORDIC 1 Introduction So far in ECE5775

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University

ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University Optical Flow on FPGA Ian Thompson (ijt5), Joseph Featherston (jgf82), Judy Stephen

More information

Lab 1: CORDIC Design Due Friday, September 8, 2017, 11:59pm

Lab 1: CORDIC Design Due Friday, September 8, 2017, 11:59pm ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 1: CORDIC Design Due Friday, September 8, 2017, 11:59pm 1 Introduction COordinate

More information

Example-Based Image Super-Resolution Techniques

Example-Based Image Super-Resolution Techniques Example-Based Image Super-Resolution Techniques Mark Sabini msabini & Gili Rusak gili December 17, 2016 1 Introduction With the current surge in popularity of imagebased applications, improving content

More information

Single Image Super Resolution of Textures via CNNs. Andrew Palmer

Single Image Super Resolution of Textures via CNNs. Andrew Palmer Single Image Super Resolution of Textures via CNNs Andrew Palmer What is Super Resolution (SR)? Simple: Obtain one or more high-resolution images from one or more low-resolution ones Many, many applications

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models [Supplemental Materials] 1. Network Architecture b ref b ref +1 We now describe the architecture of the networks

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Introduction to Neural Networks

Introduction to Neural Networks ECE 5775 (Fall 17) High-Level Digital Design Automation Introduction to Neural Networks Ritchie Zhao, Zhiru Zhang School of Electrical and Computer Engineering Rise of the Machines Neural networks have

More information

Image Compression System on an FPGA

Image Compression System on an FPGA Image Compression System on an FPGA Group 1 Megan Fuller, Ezzeldin Hamed 6.375 Contents 1 Objective 2 2 Background 2 2.1 The DFT........................................ 3 2.2 The DCT........................................

More information

Deep Back-Projection Networks For Super-Resolution Supplementary Material

Deep Back-Projection Networks For Super-Resolution Supplementary Material Deep Back-Projection Networks For Super-Resolution Supplementary Material Muhammad Haris 1, Greg Shakhnarovich 2, and Norimichi Ukita 1, 1 Toyota Technological Institute, Japan 2 Toyota Technological Institute

More information

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer

More information

A Novel Multi-Frame Color Images Super-Resolution Framework based on Deep Convolutional Neural Network. Zhe Li, Shu Li, Jianmin Wang and Hongyang Wang

A Novel Multi-Frame Color Images Super-Resolution Framework based on Deep Convolutional Neural Network. Zhe Li, Shu Li, Jianmin Wang and Hongyang Wang 5th International Conference on Measurement, Instrumentation and Automation (ICMIA 2016) A Novel Multi-Frame Color Images Super-Resolution Framewor based on Deep Convolutional Neural Networ Zhe Li, Shu

More information

Intel HLS Compiler: Fast Design, Coding, and Hardware

Intel HLS Compiler: Fast Design, Coding, and Hardware white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager

More information

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS 1 RONNIE O. SERFA JUAN, 2 CHAN SU PARK, 3 HI SEOK KIM, 4 HYEONG WOO CHA 1,2,3,4 CheongJu University E-maul: 1 engr_serfs@yahoo.com,

More information

EECS150 - Digital Design Lecture 09 - Parallelism

EECS150 - Digital Design Lecture 09 - Parallelism EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization

More information

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,

More information

Improving Area and Resource Utilization Lab

Improving Area and Resource Utilization Lab Lab Workbook Introduction This lab introduces various techniques and directives which can be used in Vivado HLS to improve design performance as well as area and resource utilization. The design under

More information

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141 EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators

More information

SDSoC: Session 1

SDSoC: Session 1 SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Ke Ma 1, and Yao Song 2 1 Department of Computer Sciences 2 Department of Electrical and Computer Engineering University of Wisconsin-Madison

More information

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang Profiling the Performance of Binarized Neural Networks Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang 1 Outline Project Significance Prior Work Research Objectives Hypotheses Testing Framework

More information

LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection

LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection This tutorial will introduce you to high-level synthesis (HLS) concepts using LegUp. You will apply HLS to a real problem:

More information

FPGA Matrix Multiplier

FPGA Matrix Multiplier FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri

More information

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Efficient Hardware Acceleration on SoC- FPGA using OpenCL Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk

More information

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication Introduction All processors offer some form of instructions to add, subtract, and manipulate data.

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac

More information

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity Abstract: This project aims at creating a benchmark for Deep Learning (DL) algorithms

More information

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Elaina Chai December 12, 2014 1. Abstract In this paper I will discuss the feasibility of an implementation of an algorithm

More information

CO Computer Architecture and Programming Languages CAPL. Lecture 15

CO Computer Architecture and Programming Languages CAPL. Lecture 15 CO20-320241 Computer Architecture and Programming Languages CAPL Lecture 15 Dr. Kinga Lipskoch Fall 2017 How to Compute a Binary Float Decimal fraction: 8.703125 Integral part: 8 1000 Fraction part: 0.703125

More information

High-Level Synthesis: Accelerating Alignment Algorithm using SDSoC

High-Level Synthesis: Accelerating Alignment Algorithm using SDSoC High-Level Synthesis: Accelerating Alignment Algorithm using SDSoC Steven Derrien & Simon Rokicki The objective of this lab is to present how High-Level Synthesis (HLS) can be used to accelerate a given

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),

More information

2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into

2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into 2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into the viewport of the current application window. A pixel

More information

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Lecture - 05 Classification with Perceptron Model So, welcome to today

More information

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining ECE 5775 (Fall 17) High-Level Digital Design Automation More Pipelining Announcements HW 2 due Monday 10/16 (no late submission) Second round paper bidding @ 5pm tomorrow on Piazza Talk by Prof. Margaret

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601 Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network Nathan Sun CIS601 Introduction Face ID is complicated by alterations to an individual s appearance Beard,

More information

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding

More information

Face Hallucination Based on Eigentransformation Learning

Face Hallucination Based on Eigentransformation Learning Advanced Science and Technology etters, pp.32-37 http://dx.doi.org/10.14257/astl.2016. Face allucination Based on Eigentransformation earning Guohua Zou School of software, East China University of Technology,

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection

Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection Thomas M. Benson 1 Daniel P. Campbell 1 David Tarjan 2 Justin Luitjens 2 1 Georgia Tech Research Institute {thomas.benson,dan.campbell}@gtri.gatech.edu

More information

Exploring OpenCL Memory Throughput on the Zynq

Exploring OpenCL Memory Throughput on the Zynq Exploring OpenCL Memory Throughput on the Zynq Technical Report no. 2016:04, ISSN 1652-926X Chalmers University of Technology Bo Joel Svensson bo.joel.svensson@gmail.com Abstract The Zynq platform combines

More information

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch Motivation Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 OpenMP FPGA Device

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication Erik H. D Hollander Electronics and Information Systems Department Ghent University, Ghent, Belgium Erik.DHollander@ugent.be

More information

UNIVERSITY OF CALIFORNIA, SANTA CRUZ BOARD OF STUDIES IN COMPUTER ENGINEERING

UNIVERSITY OF CALIFORNIA, SANTA CRUZ BOARD OF STUDIES IN COMPUTER ENGINEERING UNIVERSITY OF CALIFORNIA, SANTA CRUZ BOARD OF STUDIES IN COMPUTER ENGINEERING CMPE13/L: INTRODUCTION TO PROGRAMMING IN C SPRING 2012 Lab 3 Matrix Math Introduction Reading In this lab you will write a

More information

Single Pass Connected Components Analysis

Single Pass Connected Components Analysis D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

More information

HEAD HardwarE Accelerated Deduplication

HEAD HardwarE Accelerated Deduplication HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing Acceleration with FPGA December 9, 2016 Insu Jang Seikwon Kim Seonyoung Lee Executive Summary A-Z development of deduplication SW version

More information

Convolutional Neural Networks for Object Classication in CUDA

Convolutional Neural Networks for Object Classication in CUDA Convolutional Neural Networks for Object Classication in CUDA Alex Krizhevsky (kriz@cs.toronto.edu) April 16, 2009 1 Introduction Here I will present my implementation of a simple convolutional neural

More information

DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS

DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS Georgios Zacharopoulos Giovanni Ansaloni Laura Pozzi DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS Università della Svizzera italiana (USI Lugano), Faculty

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

High Speed Pipelined Architecture for Adaptive Median Filter

High Speed Pipelined Architecture for Adaptive Median Filter Abstract High Speed Pipelined Architecture for Adaptive Median Filter D.Dhanasekaran, and **Dr.K.Boopathy Bagan *Assistant Professor, SVCE, Pennalur,Sriperumbudur-602105. **Professor, Madras Institute

More information

Designing a Hardware in the Loop Wireless Digital Channel Emulator for Software Defined Radio

Designing a Hardware in the Loop Wireless Digital Channel Emulator for Software Defined Radio Designing a Hardware in the Loop Wireless Digital Channel Emulator for Software Defined Radio Janarbek Matai, Pingfan Meng, Lingjuan Wu, Brad Weals, and Ryan Kastner Department of Computer Science and

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Accelerating 3D Geometry Transformation with Intel MMX TM Technology

Accelerating 3D Geometry Transformation with Intel MMX TM Technology Accelerating 3D Geometry Transformation with Intel MMX TM Technology ECE 734 Project Report by Pei Qi Yang Wang - 1 - Content 1. Abstract 2. Introduction 2.1 3 -Dimensional Object Geometry Transformation

More information

Overview of ROCCC 2.0

Overview of ROCCC 2.0 Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment

More information

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)

More information

Advanced FPGA Design Methodologies with Xilinx Vivado

Advanced FPGA Design Methodologies with Xilinx Vivado Advanced FPGA Design Methodologies with Xilinx Vivado Alexander Jäger Computer Architecture Group Heidelberg University, Germany Abstract With shrinking feature sizes in the ASIC manufacturing technology,

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA

DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA J.Jayalakshmi 1, S.Ali Asgar 2, V.Thrimurthulu 3 1 M.tech Student, Department of ECE, Chadalawada Ramanamma Engineering College, Tirupati Email

More information

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules Report Explainable Machine Learning Dynamic Routing Between Capsules Author: Michael Dorkenwald Supervisor: Dr. Ullrich Köthe 28. Juni 2018 Inhaltsverzeichnis 1 Introduction 2 2 Motivation 2 3 CapusleNet

More information

1. Getting started with GPUs and the DAS-4 For information about the DAS-4 supercomputer, please go to:

1. Getting started with GPUs and the DAS-4 For information about the DAS-4 supercomputer, please go to: 1. Getting started with GPUs and the DAS-4 For information about the DAS-4 supercomputer, please go to: http://www.cs.vu.nl/das4/ For information about the special GPU node hardware in the DAS-4 go to

More information

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm Instructions This is an individual assignment. Individual means each student must hand in their

More information

M.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India

M.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 5 ISSN : 2456-3307 High Performance Scalable Deep Learning Accelerator

More information

Module 2: Computer Arithmetic

Module 2: Computer Arithmetic Module 2: Computer Arithmetic 1 B O O K : C O M P U T E R O R G A N I Z A T I O N A N D D E S I G N, 3 E D, D A V I D L. P A T T E R S O N A N D J O H N L. H A N N E S S Y, M O R G A N K A U F M A N N

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

OPTICAL Character Recognition systems aim at converting

OPTICAL Character Recognition systems aim at converting ICDAR 2015 COMPETITION ON TEXT IMAGE SUPER-RESOLUTION 1 Boosting Optical Character Recognition: A Super-Resolution Approach Chao Dong, Ximei Zhu, Yubin Deng, Chen Change Loy, Member, IEEE, and Yu Qiao

More information

Lab 1: FPGA Physical Layout

Lab 1: FPGA Physical Layout Lab 1: FPGA Physical Layout University of California, Berkeley Department of Electrical Engineering and Computer Sciences EECS150 Components and Design Techniques for Digital Systems John Wawrzynek, James

More information

Image Compression With Haar Discrete Wavelet Transform

Image Compression With Haar Discrete Wavelet Transform Image Compression With Haar Discrete Wavelet Transform Cory Cox ME 535: Computational Techniques in Mech. Eng. Figure 1 : An example of the 2D discrete wavelet transform that is used in JPEG2000. Source:

More information

Outline of Presentation Field Programmable Gate Arrays (FPGAs(

Outline of Presentation Field Programmable Gate Arrays (FPGAs( FPGA Architectures and Operation for Tolerating SEUs Chuck Stroud Electrical and Computer Engineering Auburn University Outline of Presentation Field Programmable Gate Arrays (FPGAs( FPGAs) How Programmable

More information

Reconfigurable Computing. Introduction

Reconfigurable Computing. Introduction Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally

More information

C-Brain: A Deep Learning Accelerator

C-Brain: A Deep Learning Accelerator C-Brain: A Deep Learning Accelerator that Tames the Diversity of CNNs through Adaptive Data-level Parallelization Lili Song, Ying Wang, Yinhe Han, Xin Zhao, Bosheng Liu, Xiaowei Li State Key Laboratory

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and

More information

CMPE 655 Fall 2016 Assignment 2: Parallel Implementation of a Ray Tracer

CMPE 655 Fall 2016 Assignment 2: Parallel Implementation of a Ray Tracer CMPE 655 Fall 2016 Assignment 2: Parallel Implementation of a Ray Tracer Rochester Institute of Technology, Department of Computer Engineering Instructor: Dr. Shaaban (meseec@rit.edu) TAs: Akshay Yembarwar

More information

ECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance. Prof. Peter Bermel January 13, 2017

ECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance. Prof. Peter Bermel January 13, 2017 ECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance Prof. Peter Bermel January 13, 2017 Outline Time Scaling Examples General performance strategies Computer architectures

More information

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign

More information

Using a Scalable Parallel 2D FFT for Image Enhancement

Using a Scalable Parallel 2D FFT for Image Enhancement Introduction Using a Scalable Parallel 2D FFT for Image Enhancement Yaniv Sapir Adapteva, Inc. Email: yaniv@adapteva.com Frequency domain operations on spatial or time data are often used as a means for

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Deep-Pipelined FPGA Implementation of Ellipse Estimation for Eye Tracking

Deep-Pipelined FPGA Implementation of Ellipse Estimation for Eye Tracking Deep-Pipelined FPGA Implementation of Ellipse Estimation for Eye Tracking Keisuke Dohi, Yuma Hatanaka, Kazuhiro Negi, Yuichiro Shibata, Kiyoshi Oguri Graduate school of engineering, Nagasaki University,

More information

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction C O N D I T I O N S C O N D I T I O N S Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.86: Parallel Computing Spring 21, Agarwal Handout #5 Homework #

More information

Binary Adders. Ripple-Carry Adder

Binary Adders. Ripple-Carry Adder Ripple-Carry Adder Binary Adders x n y n x y x y c n FA c n - c 2 FA c FA c s n MSB position Longest delay (Critical-path delay): d c(n) = n d carry = 2n gate delays d s(n-) = (n-) d carry +d sum = 2n

More information

Embedded Hardware-Efficient Real-Time Classification with Cascade Support Vector Machines

Embedded Hardware-Efficient Real-Time Classification with Cascade Support Vector Machines 1 Embedded Hardware-Efficient Real-Time Classification with Cascade Support Vector Machines Christos Kyrkou, Member, IEEE, Christos-Savvas Bouganis, Member, IEEE, Theocharis Theocharides, Senior Member,

More information