Characterization of OpenCL on a Scalable FPGA Architecture

Similar documents
Altera SDK for OpenCL

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

FPGA Acceleration of 3D Component Matching using OpenCL

Exploring Automatically Generated Platforms in High Performance FPGAs

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Copyright 2012, Elsevier Inc. All rights reserved.

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Intel HLS Compiler: Fast Design, Coding, and Hardware

Exploring OpenCL Memory Throughput on the Zynq

Convey Wolverine Application Accelerators. Architectural Overview. Convey White Paper

Cover TBD. intel Quartus prime Design software

Cover TBD. intel Quartus prime Design software

VXS-610 Dual FPGA and PowerPC VXS Multiprocessor

FPGA Solutions: Modular Architecture for Peak Performance

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Dr. Yassine Hariri CMC Microsystems

VXS-621 FPGA & PowerPC VXS Multiprocessor

The Convey HC-2 Computer. Architectural Overview. Convey White Paper

Ten Reasons to Optimize a Processor

Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort

Cymric A Framework for Prototyping Near-Memory Architectures

EECS4201 Computer Architecture

Fundamentals of Quantitative Design and Analysis

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Lecture 1: Introduction

Is There A Tradeoff Between Programmability and Performance?

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Simplify Software Integration for FPGA Accelerators with OPAE

ECE 8823: GPU Architectures. Objectives

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

To hear the audio, please be sure to dial in: ID#

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors

Higher Level Programming Abstractions for FPGAs using OpenCL

Did I Just Do That on a Bunch of FPGAs?

World s most advanced data center accelerator for PCIe-based servers

LegUp: Accelerating Memcached on Cloud FPGAs

Requirements for Scalable Application Specific Processing in Commercial HPEC

Design and Implementation of High Performance DDR3 SDRAM controller

Virtualized SQL Server Performance and Scaling on Dell EMC XC Series Web-Scale Hyper-converged Appliances Powered by Nutanix Software

Pactron FPGA Accelerated Computing Solutions

Exploring FPGA-specific Optimizations for Irregular OpenCL Applications

General Purpose GPU Computing in Partial Wave Analysis

HES-7 ASIC Prototyping

Overview of ROCCC 2.0

Meet the Increased Demands on Your Infrastructure with Dell and Intel. ServerWatchTM Executive Brief

Welcome. Altera Technology Roadshow 2013

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

Best Practices for Setting BIOS Parameters for Performance

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.

S2C K7 Prodigy Logic Module Series

The S6000 Family of Processors

The rcuda middleware and applications

AMD Opteron Processors In the Cloud

PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern FPGA-based Computing

High-Level Synthesis Techniques for In-Circuit Assertion-Based Verification

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

FPGA-based Supercomputing: New Opportunities and Challenges

high performance medical reconstruction using stream programming paradigms

Hybrid Threading: A New Approach for Performance and Productivity

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

ALTERA FPGAs Architecture & Design

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

ReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware

Four-Socket Server Consolidation Using SQL Server 2008

Simplify System Complexity

Dell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration

Minimizing Thermal Variation in Heterogeneous HPC System with FPGA Nodes

ECE 486/586. Computer Architecture. Lecture # 2

Xilinx Vivado/SDK Tutorial

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Energy scalability and the RESUME scalable video codec

The Nios II Family of Configurable Soft-core Processors

Employing Multi-FPGA Debug Techniques

GPUfs: Integrating a file system with GPUs

Five Ways to Build Flexibility into Industrial Applications with FPGAs

Parallel Computing: Parallel Architectures Jin, Hai

CASE STUDY: Using Field Programmable Gate Arrays in a Beowulf Cluster

Outline Marquette University

An FPGA-Based Optical IOH Architecture for Embedded System

What is PXImc? By Chetan Kapoor, PXI Product Manager National Instruments

ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT

SimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

International IEEE Symposium on Field-Programmable Custom Computing Machines

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Simplify System Complexity

Axiomtek Broadwell-U Embedded Board & SoM White Paper

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

Memory Systems IRAM. Principle of IRAM

10 Steps to Virtualization

Transcription:

Characterization of OpenCL on a Scalable FPGA Architecture Shanyuan Gao and Jeremy Chritz Pico Computing, Inc. Email: {sgao, jchritz}@picocomputing.com 1 Abstract The recent release of Altera s SDK for OpenCL has greatly eased the development of FPGA-based systems. Research have shown performance improvements brought by OpenCL using a single FPGA device. However, to meet the objectives of high performance computing, OpenCL needs to be evaluated using multiple FPGAs. This work has proposed a scalable FPGA architecture for high performance computing. The design includes multiple FPGA modules and a high performance backplane. The modular nature of this architecture supports the combination of different FPGAs, as well as provides for easy hardware updates. FPGA modules based on Stratix V are compatible with Altera s OpenCL tool flow. The evaluation has tested the native IO performance of the architecture and the results have demonstrated scalability using six FPGAs. The host-to-device peak bandwidth is measured as 13.1 GB/s for read operation and 12.1 GB/s for write operation. The FPGA-to-memory bandwidth is measured as 64.5 GB/s in total. An OpenCL AES is selected to test the scalable multi-fpga architecture. The test results have shown peak throughput is achiveded when six FPGAs are used. The throughput per watt shows 5 improvement using four FPGAs, over a general-purpose processor. I. INTRODUCTION FPGAs have long been recognized for the important role they play in high performance computing (HPC), owing primarily to their inherent parallelism and low power consumption [1] [3]. However, the application of FPGAs in the HPC domain has historically been limited to those with deep expertise in hardware, firmware, operating systems, and HPC applications. The FPGA community is striving to improve productivity and the ease of use in all the above aspects. But the variables between software application and underlying FPGA device such as defining layers of functionality, selecting communication protocols, developing algorithms, performing verification, etc. still takes several engineers weeks or months to deliver a project and maintain it. Moreover, porting or upgrading a design from one FPGA device to another means repeating the process described above. This is not a productive or sustainable engineering model, and it must be changed. To this end, an ideal FPGA system for HPC applications should address the aforementioned issues and bring: Short development time from high level description to system configuration (bitstream). Scalable multiple FPGA support to realize maximum parallelism. Easy upgrade of hardware devices. OpenCL (Open Computing Language) is an open specification widely used in many HPC systems that leverages the parallelism inherent to heterogeneous computing devices. The OpenCL standard is supported by various vendors for different devices. From a performance point of view: a) These heterogeneous computing devices are capable of processing a large number of threads in parallel, while the general processor is only capable of a limited number of threads b) Threads on heterogeneous computing devices are commonly light weight, while threads on the host are heavy weight. Consequently, the cost for context switching on computing devices is smaller than that of the host processor. From a software programmer s perspective: a) The OpenCL code running on the computing device is easy to write, employing the same syntax as high level programming languages (for example, C language). b) The OpenCL code is portable from one computing device to another so it can easily benefit from hardware upgrade or tradeoff evaluation. Altera s OpenCL solution is selected in this work because the tools automate the build process, and tremendously reduce the entire development time. However, Altera s current OpenCL solution only supports PCI Express Gen2 8 with Configuration via Protocol (CvP) enabled. As most host machines now have Gen3 16 and most Altera board vendors only have one or two FPGAs per board, at least half of the Gen3 16 bandwidth is wasted. To address the application of FPGA in an HPC system, with all the issues described, in this work, we have: Designed a scalable FPGA architecture that supports multiple FPGAs on a single backplane. Developed a solution using Altera s OpenCL on multiple FPGAs. Tested the performance of the architecture and experimented with the scalability of an OpenCL application Calculated the power efficiency (throughput per watt). The rest of the paper is organized as follows: Section II provides some OpenCL background and OpenCL related work on FPGA. Section III presents the design of implementation of the multi-fpga architecture. Section IV shows the experiment results. Section V concludes this paper and describes future work.

2 A. OpenCL II. BACKGROUND AND RELATED WORK OpenCL [4] is an open standard that enables applications to be executed on various computing architectures, such as general purpose CPUs, GPGPUs, FPGAs, as well as other special processors. As a result, applications can benefit from the underlying hardware features without programmers spending a great deal of time learning hardware details. OpenCL adopts the processor-accelerator concept, defining host and device in the platform. The application is divided into the host program and the program. When the application starts, the host executes the host program and passes computationally intensive workload to the device; the device is programmed either using sources or binaries on the fly, executes the computationally intensive workload, and passes the result back to the host. The latency between the host and the device is often larger than the host and its memory. To overcome this issue, data should be properly sized so it will be pipelined into the device. The program is a piece of computationally intensive code within the application identified by the programmer through profiling. The source is written in a C-like language. Hardware vendors often have their OpenCL implementation optimized for their hardware, which can be applied by setting #pragma and attribute in the code. B. OpenCL on FPGA Since the introduction of the OpenCL standard, there have been several attempts to implement OpenCL on FPGAs. However, there are some obstacles that need to be addressed before implementing OpenCL on FPGAs. C to Hardware Description Language (HDL): The OpenCL standard uses a C-like language to describe high level functions, whereas FPGAs use HDL to describe functions in Register Transfer Level (RTL). Translating high level language to HDL can be done by hand, but it requires expertise in HDL, and the process involves repeated development and verification cycles. To speed up the translation time, there are several high level synthesis tools that help convert C to HDL code [5] [8]. High level synthesis tools only take seconds or minutes to covert C function to HDL code. Integration: The generated HDL core will not work without a framework which provides necessary control and data IO. For example, if the FPGA board sits on a PCI Express slot, the framework should set up a stable PCI Express link, calibrate DDR memory, connect all necessary peripheral devices, and control the generated HDL core. The integration process requires minimum FPGA knowledge of clock, interface, and timing. Building : OpenCL has two ways of creating the executables on the device: a) clcreateprogramwithsource, which compiles the source and loads the executable during the runtime; b) clcreateprogramwithbinary, which loads a pre-compiled binary of the executable. GPGPUs generally use the former method because the compile time is negligible (seconds). Due to the place and route time, FPGA tools take much longer (hours) to generate a configuration. Thus pre-compiling the design is the only efficient way to create executables. Kernel reloading: Programming a configuration onto FPGA will erase the original configuration, which will disconnect the physical link between the device and the host. Special engineering work needs to be done to keep the physical link alive without rebooting the host machine. One can conclude from the above that in order to run OpenCL on FPGA, an ideal tool flow should automatically convert a source into a system-level FPGA configuration, program the FPGA, and run the OpenCL application without disconnecting the physical link. The initial exploration of OpenCL on FPGA starts with high level synthesis tools, which are designed to convert high level semantics such as C or C++ into HDL code [5] [8]. Designers can direct the tools to create interfaces, utilize vendor primitives, or optimize hardware logic. Some tools can even simulate and verify the design. High level synthesis tools tremendously reduce the development time in creating HDL code. However, they still require that designers manually integrate the generated HDL code into the final system. Cartwright et al. [9] have created FSM SYS Builder that assembles IP components into the system. They have used the OpenCL standard mapping OpenCL APIs to hthread, a hardware-based micro OS in a system-on-chip (SoC) environment. Owaida et al. [1] have focused on the compiler tools. An architectural synthesis tool called Silicon OpenCL, which can generate hardware accelerators and SoC systems from OpenCL programs. The evaluation used a single FPGA and a single static. The work in [11] has shifted the abstraction level. Using Convey s HC-1 platform, the onboard CPU is used as the compute device, while four onboard FPGAs (Application Engine) are used as compute units. Kernels are replicated to test different configurations. Source-to-source translation is used to convert OpenCL to C source, which is then feed into Auto-ESL (now Xilinx Vivado HLS) to generate the HDL core. The final integration involves Convey s PDK framework and Xilinx ISE tools. The evaluation has used four FPGAs and a static. Taneem [12] in his Master Thesis has studied an OpenCL framework as a unified programming framework for CPU, GPGPU, and FPGA. Specifically for FPGA, a static OpenCL framework with controller, host interface, and memory controller is proposed.

3 x16 x16 x8 x8 FPGA Module FPGA Module FPGA Module 1... FPGA Module 1... FPGA Module 5 FPGA Module 5 Fig. 1. Block diagram of Pico Computing s Architecture Fig. 2. M56 and EX7 backplane Backplane PCI Express PCIswitch Express switch Backplane host Altera [13], [14] has introduced the industry s first FPGA support for OpenCL. A similar OpenCL framework described in [12] is used for each FPGA board. Kernels are compiled into HDL code using Altera s proprietary tools and stitched into the framework. The is built in pipelined fashion to emulate parallel work items in progress. The middle layer library translates OpenCL APIs to FPGA transactions. With Altera s CvP, the host can program the FPGA on the fly without disconnecting the physical link. III. D ESIGN AND I MPLEMENTATION A. Pico Computing s Architecture We have previously designed several FPGA modules adopted in many projects [15], [16]. The philosophy behind this design approach is to pack as many FPGAs as possible onto a single backplane, while providing flexibility to change or upgrade the FPGA modules. Shown in Figure 1, on the backplane, FPGA modules are connected through a central PCI Express switch to the host and appear to the host as independent PCI Express devices. Depending on the physical dimensions of the FPGA module, one full-length backplane (312 mm) can carry up to six FPGA modules. Therefore, a single 4U server with eight PCI Express backplanes is able to carry as many as 48 FPGAs. Additionally, because the architecture is based on modules that snap onto the backplane, designers can explore different FPGA options, including FPGAs of different types, sizes, or vendors. B. M56 Module and EX7 Backplane The M56 module has a Stratix V A3 FPGA (5SGXA3E3H29C2), 8 GB DDR3 memory, 256 Mb EPCQ flash, and a JTAG connector. The GPIO connector provides 46 LVDS signal pairs and two sets of MGT pair. On the backside of the module, a Samtec connector provides PCI Express connection capable of Gen3 8. The M56 measures 44.49 98.57 (mm), so six M56s can fit on a single full-length PCI Express backplane. With heat sink installed, the M56 module and the backplane occupies double-slot depth. In this work, we have also designed the EX7 backplane, on which a PLX PEX878 switch provides a PCI Express Gen3 16 connection to the host. Figure 2 shows a photo of EX7 with four M56 modules. C. OpenCL Framework and Tool Flow The Altera OpenCL flow uses Altera Offline Compiler (aoc) to compile OpenCL source into an HDL core and stitches the core into a firmware framework. The framework needs to be developed for each different FPGA board. We have developed the OpenCL framework based on Altera s reference design. As shown in Figure 3, within the framework, the PCI Express communicates with the host, and the memory controller accesses the DDR3 memory. The blank area circled with dotted line is where compiled OpenCL s reside. To fulfill the goal of dynamic configuration during runtime, Altera s CvP is used. The framework (shaded area) is constrained as a logiclock region and exported as a framework partition. All OpenCL builds share the same framework partition, so a CvP update would not overwrite the framework partition, and therefore not affect the PCI Express link. The CvP function is currently only available with a PCI Express Gen2 interface. As such, the framework of the M56 module is configured as a Gen2 8 interface. On the host side, the application calls clcreateprogramwithbinary instead of clcreateprogramwithsource to load the generated configuration (aocx file) onto the FPGAs. The rest of the OpenCL flow in application remains unchanged.

4 source PCI Express Gen2 x8 + aoc => PCI Express Gen2 x8 DDR3 Controller framework DDR3 Controller <>.aocx Fig. 3. OpenCL framework and tool flow TABLE I RESOURCE UTILIZATION OF OPENCL FRAMEWORK Logic Registers Memory DSP blocks Stratix V A3 1283 5132 957 256 OpenCL Framework 19% 8% 2% % Pico s Framework 11% 6% 12% % D. AES application To experiment with the architecture using OpenCL, the application is selected with following criteria: A well-known application: This application should be a common application and it should be easy to find, as well as repeat the experiment. Suitable for FPGA: The application should fit within the resources available on the FPGA. Scalability: The application should scale well with multiple FPGA devices. In this work, an AES OpenCL implementing the AES 256-bit algorithm in ECB mode is selected. The AES is constructed as an engine (dynamic library) that can be linked into the OpenSSL framework. In the experiment, the host application generates a certain size (2 GB) of data and encrypts the data on one or more FPGAs. Throughput (encryption rate) is calculated by dividing the workload over the execution time. The code was originally developed by Liu et al. [17] at Virginia Tech. With a couple of line changes, the benchmark is successfully built with a single-line command and run on the M56 module. Through experiments, the best throughput is achieved when chunk size, used in the pipeline, is 256 MB or below. It is understood that many modern processors have AES instruction built in the Instruction Set Architecture (ISA), which can achieve very high throughput. However, the experiments in this work are not targeting AES operation, but general OpenCL applications which are not implemented as instructions in the ISA. Therefore, the same OpenCL-AES test is conducted on the CPU for reference. A. Experimental Setup IV. EVALUATION In this work, a host system is set up with Intel i7-477k processor and 32GB DDR3 memory. The PCI Express slot on Intel DH87RL motherboard provides a Gen3 16 connection. CentOS 6.5 with Linux 2.6.32 is installed as the operating system. For reference purpose, Intel s OpenCL 1.2 Development Kit is installed. Altera Quartus 13.1 is used for developing OpenCL applications. B. Native Performance The first test measures the resource utilization of the OpenCL framework. To obtain the size of the OpenCL framework, a blank file is used to build the design. The first row in Table I lists the resources available on the Stratix V A3 chip. The second row in Table I lists the resource usage reported by Altera s OpenCL tools. As a reference, the third row shows resource usage of Pico Computing s framework. Pico Computing s framework currently is not compatible with Altera s OpenCL flow, but it has a similar infrastructure including PCI Express, DMA engine, interconnect, and memory controller. It can be observed that Altera s OpenCL framework occupies more of the resources than our framework. The second test measures the bandwidth between the host machine and the FPGA devices. The read and write operations are from the perspective of the host. Note the PCI Express Gen3 16 has a theoretical bandwidth of 15.75 GB/s, while the theoretical bandwidth on the M56 with PCI Express Gen2 x8 connection is 4 GB/s. In the test, six M56 modules are deployed on the EX7 backplane. According to [18], threading is not safe with OpenCL, which means multiple threads could currupt the shared runtime address space. During the test, the Linux system call fork() is used to generate multiple processes, with each process accessing one FPGA device in its independent address space. To ensure the transaction occurs at the same time, system time is reported in each process.

5 16 14 12 Bandwidth (GB/s) 1 8 6 4 Gen3 read 2 Gen3 write Gen2 read Gen2 write 1 2 3 4 5 6 7 Number of FPGA devices Fig. 4. Bandwidth between the host and FPGA devices 1 ideal measured 8 Bandwidth (GB/s) 6 4 2 1 2 3 4 5 6 7 Number of FPGA devices Fig. 5. Bandwidth between the FPGA devices and DDR3 memory In Figure 4, the blue lines (with circles and squares marks) show the IO performance of multiple M56 modules on EX7 backplane. The red lines (with asterisk and cross marks) show the same test using the same M56s on a Gen2 16 backplane. The horizontal lines depict the theoretical bandwidth of Gen3 and Gen2, respectively. It can be seen that the IO bandwidth grows linearly on the Gen3 backplane when less than four M56 modules are used. When more than four M56 modules are used, the read bandwidth saturates around 13.1 GB/s; the write bandwidth slows down the linear growth and it reaches 12.1 GB/s. Running the same IO test on a Gen2 backplane, four M56s saturate the IO bandwidth. The third experiment has tested the combined read and write bandwidth between the FPGA and DDR3 memory. Each M56 module has 8GB DDR3 memory running at 8 MHz; the ideal bandwidth between the FPGA and DDR3 memory is 12.8 GB/s. Figure 5 shows that the peak bandwidth grows linearly as more M56s are involved. The total bandwidth is 64.5 GB/s for six M56 modules. C. AES on Single M56 During the AES application test, we have experimented with different optimization settings described in [19] on the AES encryption. The attribute num_simd_work_items vectorizes the data path accessing the, while num_compute_units duplicates the entire. The num_simd_work_items attribute only takes power of 2, such as 2, 4, 8 as input, while num_compute_units has no limit to the number of copies. Table II shows resource utilization, build time, and power consumption of different attribute settings. The first row is the original code without optimization. The second and third row (labeled as SIMD 2 and SIMD 4) vectorize the data path by 2 and 4, respectively. The fourth and fifth row (labeled as COMP 2 and COMP 3) duplicate the by 2 and 3, respectively. The tool fails to build when vectorizing data over 8 and duplicating over 4, because these optimization requires more resources than Stratix V A3 can provide. Figure 6 presents the throughput (encryption rate) measured from OpenCL-AES test. The first blue (dark color) bar shows OpenCL throughput of i7-477k executing on all eight logic cores at 3.4 GHz. The remaining blue (dark color) bars show the OpenCL throughput of M56 using different attribute settings. As shown in Figure 6, vectorizing the data path by 4,

6 TABLE II RESOURCE, BUILD TIME AND POWER OF OPENCL AES KERNEL attr. Logic Reg. Memory DSP Time Power Original 44% 12% 3% % 7 min 29.6 W SIMD 2 62% 13% 29% % 162 min 29.9 W SIMD 4 98% 14% 31% % 523 min 3.1 W COMP 2 68% 15% 35% % 82 min 3.6 W COMP 3 89% 17% 41% % 94 min 31. W 6 Throughput Throughput/Power.2 Throughput (Gb/s) 4.5 3 1.5.15.1.5 Throughput/Power (Gb/s/watt) CPU Orig. SIMD 2 SIMD 4 COMP 3 COMP 4 OpenCL optimization using different attributes Fig. 6. Throughput of OpenCL-AES on M56 which has consumed the most resource in Table II, has achieved the highest throughput of 5.1 Gb/s while the same test has achieved 4.4 Gb/s on i7-477k. During this test, a DC power supply was used to measure the current draw of the FPGA and the backplane. For the i7-477k processor, the thermal design power (TDP) of 84 W from the datasheet has been used [2] in the calculation. However, the TDP value is a nomial value and the actual power of i7-477k running all eight logic cores at 3.4 GHz may be higher than the TDP value. Listed in the last column of Table II, power consumption is calculated by multiplying the DC voltage and current. The green (light color) bars plotted in Figure 6 show throughput/power as the power efficiency result. The power efficiency of using a single FPGA is 3 better than i7-477k. D. AES on Multiple M56 While the EX7 backplane can accommodate multiple FPGA modules, six M56 modules are used to run the OpenCL- AES test in parallel. In this test, all FPGAs are using the AES SIMD 4 configurations. To access multiple FPGAs, the host application creates one process for each FPGA. The wall time is reported to ensure the computation occurs in parallel. 25 Throughput Throughput/Power.3 Throughput (Gb/s) 2 15 1 5.24.18.12.6 Throughput/Power (Gb/s/watt) 1 2 3 4 5 6 7 Number of FPGA devices Fig. 7. Total throughput of OpenCL-AES on multiple M56 modules

7 TABLE III POWER CONSUMPTION OF MULTIPLE FPGAS # FPGAs 1 2 3 4 5 6 Power (W) 17.9 3.4 44.5 58.8 72.2 85.6 99.1 In Figure 7, the blue (dark color) bars show that the peak throughput grows with the number of FPGAs. The peak throughput reaches 19.8 Gb/s using six FPGAs. However, the performance growth slows down as more FPGAs are adopted, which is due to the overhead introduced by managing multiple FPGAs. When five or six FPGAs are used, the throughput almost saturates due to the limited bandwidth of PCI Express Gen3 16, which is similar to the result observed in Figure 4. Table III lists the power consumed by the FPGA and the backplane. In the first column where the FPGA number is, the 17.9 W is consumed by EX7 backplane. In Figure 7, the power efficiency (throughput/total power) is plotted in green (light color) bars. The peak power efficiency is achieved when four FPGAs are used, which is about 5 over i7-477k. When more than four FPGAs are used, the power efficiency reduces due to the drop of the throughput. V. CONCLUSION AND FUTURE WORK In this work, a scalable FPGA architecture for high performance computing has been designed. The design includes multiple FPGA modules and a high performance backplane. The modular nature of this architecture supports the combination of different FPGAs, as well as provides for easy hardware updates. The FPGA module is based on Stratix V, which is compatible with Altera s OpenCL tool flow. The evaluation has tested the native IO performance and the results have demonstrated linear scalability using six FPGAs. The host-to-device peak bandwidth is measured as 13.1 GB/s for read operation and 12.1 GB/s for write operation. The total FPGA-to-memory bandwidth is measured as 64.5 GB/s. An OpenCL-AES test results have shown the peak throughput is achieved when six FPGA modules are adopted. Compared against general-purpose processor, the throughput per watt shows 3 improvement using a single FPGA and 5 improvement using four FPGAs. In the course of experiments, several OpenCL benchmark suites [21] [23] are evaluated on this designed architecture. However, some benchmark s fail to compile because these benchmarks are targeting GPGPUs. These instruction-based code can be easily executed on GPGPU but can yield a large netlist that does not fit the M56. We plan to investigate the no-fit issue in the future. For the multi-fpga AES test, which consistently transfers data between the host and the devices, Figure 7 shows that the optimal number of FPGA modules on a Gen3 16 backplane is four. The overall performance is bounded by the IO between the host and device. To investigate how six or more M56s will benefit OpenCL applications, we plan to evaluate other OpenCL applications which are more compute-bound on multiple EX7s in the future. ACKNOWLEDGEMENT We would like to thank Dr. Peter Yiannacouras from Altera for his help with the OpenCL flow. REFERENCES [1] M. Gokhale et al., Splash: A reconfigurable linear logic array, in ICPP (1) 9, 199, pp. 526 532. [2] A. Krasnov et al., Ramp blue: A message-passing manycore system in fpgas. in FPL 7, 27, pp. 54 61. [3] A. G. Schmidt et al., An evaluation of an integrated on-chip/off-chip network for high-performance reconfigurable computing, Int. J. Reconfig. Comp., 212. [4] Khronos Group, The OpenCL specification, October 29. [5] Xilinx, Vivado High-Level Synthesis, Nov. 213, URL: http://www.xilinx.com/produces/design-tools/vivado/integration/esl-design. [6] Impulse Accelerated Technologies, Inc., Jul. 214, URL: http://www.impulsec.com/. [7] J. Villarreal et al., Designing modular hardware accelerators in c with roccc 2., in Field-Programmable Custom Computing Machines (FCCM), 21 18th IEEE Annual International Symposium on, May 21, pp. 127 134. [8] J. Tripp et al., Trident: an fpga compiler framework for floating-point algorithms, in Field Programmable Logic and Applications, 25. International Conference on, Aug 25, pp. 317 322. [9] E. Cartwright et al., Creating hw/sw co-designed mpsopc s from high level programming models, in High Performance Computing and Simulation (HPCS), 211 International Conference on, July 211. [1] M. Owaida et al., Synthesis of platform architectures from opencl programs, in Field-Programmable Custom Computing Machines (FCCM), 211 IEEE 19th Annual International Symposium on, May 211, pp. 186 193. [11] P. Athanas, K. Kepa, and K. Shagrithaya, Enabling development of opencl applications on fpga platforms, in Proceedings of the 213 IEEE 24th International Conference on Application-specific Systems, Architectures and Processors (ASAP), ser. ASAP 13, 213. [12] A. Taneem, Opencl framework for a cpu, gpu, and fpga platform, Master s thesis, University of Toronto, 211. [13] T. Czajkowski et al., From opencl to high-performance hardware on fpgas, in Field Programmable Logic and Applications (FPL), 212 22nd International Conference on, Aug 212, pp. 531 534. [14] Altera, Implementing FPGA Design with the OpenCL Standard, Nov. 213, White Paper, URL: http://www.altera.com/literature/wp/wp-1173- opencl.pdf. [15] R. Kirchgessner et al., Virtualrc: A virtual fpga platform for applications and tools portability, in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ser. FPGA 12, 212. [16] C. Olson et al., Hardware acceleration of short read mapping, in Field-Programmable Custom Computing Machines (FCCM), 212 IEEE 2th Annual International Symposium on, April 212. [17] Z. Liu and A. R. M. Ganesh, OpenCL-AES, Dec. 211, URL: http://www.github.com/softboysxp/opencl-aes. [18] Altera, Altera SDK for OpenCL Programming Guide, Nov. 213, URL: http://www.altera.com/literature/lit-opencl-sdk.jsp. [19] Altera, Altera SDK for OpenCL Optimization Guide, Nov. 213, URL: http://www.altera.com/literature/lit-opencl-sdk.jsp. [2] Intel, Intel Core i7-477k Processor, 213, URL: http://ark.intel.com/products/75123/intel-core-i7-477k-processor-8m-cache-up-to-3 9 GHz.

[21] W. Feng et al., Opencl and the 13 dwarfs: A work in progress, in Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering, ser. ICPE 12, 212. [22] S. Seo, G. Jo, and J. Lee, Performance characterization of the nas parallel benchmarks in opencl, in Workload Characterization (IISWC), 211 IEEE International Symposium on, Nov 211. [23] A. Danalis et al., The scalable heterogeneous computing (shoc) benchmark suite, in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, ser. GPGPU 1. New York, NY, USA: ACM, 21, pp. 63 74. 8