Understanding Peak Floating-Point Performance Claims

Similar documents
Enabling Impactful DSP Designs on FPGAs with Hardened Floating-Point Implementation

White Paper Taking Advantage of Advances in FPGA Floating-Point IP Cores

Intel Arria 10 FPGA Performance Benchmarking Methodology and Results

Intel HLS Compiler: Fast Design, Coding, and Hardware

Stratix II vs. Virtex-4 Performance Comparison

Leveraging the Intel HyperFlex FPGA Architecture in Intel Stratix 10 Devices to Achieve Maximum Power Reduction

Stratix vs. Virtex-II Pro FPGA Performance Analysis

Intel Stratix 10 Logic Array Blocks and Adaptive Logic Modules User Guide

Intel Arria 10 Native Floating- Point DSP Intel FPGA IP User Guide

Memory Optimization for OpenCL on Intel FPGAs Exercise Manual

High Bandwidth Memory (HBM2) Interface Intel FPGA IP Design Example User Guide

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

White Paper Assessing FPGA DSP Benchmarks at 40 nm

Early Power Estimator for Intel Stratix 10 FPGAs User Guide

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

FPGAs Provide Reconfigurable DSP Solutions

Nios II Performance Benchmarks

Simplify Software Integration for FPGA Accelerators with OPAE

Power Consumption in 65 nm FPGAs

Cover TBD. intel Quartus prime Design software

Quick Start Guide for Intel FPGA Development Tools on the Nimbix Cloud

Cover TBD. intel Quartus prime Design software

Intel Stratix 10 Variable Precision DSP Blocks User Guide

AN 831: Intel FPGA SDK for OpenCL

AN 839: Design Block Reuse Tutorial

FPGA Adaptive Software Debug and Performance Analysis

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

GPU vs FPGA : A comparative analysis for non-standard precision

Simplify System Complexity

Intel Stratix 10 H-tile Hard IP for Ethernet Design Example User Guide

Intel Xeon with FPGA IP Asynchronous Core Cache Interface (CCI-P) Shim

Intel Acceleration Stack for Intel Xeon CPU with FPGAs 1.0 Errata

AN 464: DFT/IDFT Reference Design

ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point Fast Fourier Transform Simulation

Intel Quartus Prime Software Download and Installation Quick Start Guide

8. Introduction to UniPHY IP

10. Introduction to UniPHY IP

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices

Customizable Flash Programmer User Guide

Design Guidelines for Optimal Results in High-Density FPGAs

EITF35: Introduction to Structured VLSI Design

FPGA VHDL Design Flow AES128 Implementation

FFT MegaCore Function User Guide

FPGA system development What you need to think about. Frédéric Leens, CEO

FPGA Power Management and Modeling Techniques

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

White Paper. Floating-Point FFT Processor (IEEE 754 Single Precision) Radix 2 Core. Introduction. Parameters & Ports

Field Programmable Gate Array (FPGA) Devices

Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas

Using Graphics Chips for General Purpose Computation

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Intel Quartus Prime Pro Edition

Intel Stratix 10 Analog to Digital Converter User Guide

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

Nios II Custom Instruction User Guide

SerialLite III Streaming IP Core Design Example User Guide for Intel Stratix 10 Devices

Power Optimization in FPGA Designs

Experiment 3. Digital Circuit Prototyping Using FPGAs

Utility Reduced Logic (v1.00a)

Low Power Design Techniques

An Independent Analysis of Floating-point DSP Design Flow and Performance on Altera 28-nm FPGAs. Berkeley Design Technology, Inc.

DDR and DDR2 SDRAM Controller Compiler User Guide

H100 Series FPGA Application Accelerators

The Nios II Family of Configurable Soft-core Processors

SerialLite III Streaming IP Core Design Example User Guide for Intel Arria 10 Devices

Intel Acceleration Stack for Intel Xeon CPU with FPGAs Version 1.2 Release Notes

LogiCORE IP Floating-Point Operator v6.2

Simplify System Complexity

Intel Quartus Prime Pro Edition Software and Device Support Release Notes

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

Altera SDK for OpenCL

Zynq-7000 All Programmable SoC Product Overview

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Intel Quartus Prime Pro Edition Software and Device Support Release Notes

Intel MAX 10 FPGA Device Overview

High-Tech-Marketing. Selecting an FPGA. By Paul Dillien

AN 818: Static Update Partial Reconfiguration Tutorial

White Paper Performing Equivalent Timing Analysis Between Altera Classic Timing Analyzer and Xilinx Trace

System Debugging Tools Overview

AN 818: Static Update Partial Reconfiguration Tutorial

4. TriMatrix Embedded Memory Blocks in HardCopy IV Devices

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

FPGA architecture and design technology

Design Tools for 100,000 Gate Programmable Logic Devices

Copyright 2017 Xilinx.

Xilinx(Ultrascale) Vs. Altera(ARRIA 10) Test Bench

PCI Express Multi-Channel DMA Interface

PowerPlay Early Power Estimator User Guide for Cyclone III FPGAs

FPGAs: FAST TRACK TO DSP

25G Ethernet Intel Stratix 10 FPGA IP Design Example User Guide

MAX 10 FPGA Device Overview

Scaling Up to TeraFLOPs Performance with the Virtex-7 Family and High-Level Synthesis

2011 Signal Processing CoDR: Technology Roadmap W. Turner SPDO. 14 th April 2011

MAX 10 FPGA Device Overview

Quick Start Guide for Intel FPGA Development Tools on the Microsoft* Azure* Platform

World s most advanced data center accelerator for PCIe-based servers

AN 834: Developing for the Intel HLS Compiler with an IDE

Transcription:

white paper FPGA Understanding Peak ing-point Performance Claims Learn how to calculate and compare the peak floating-point capabilities of digital signal processors (DSPs), graphics processing units (GPUs), and FPGAs. Authors Michael Parker Principal DSP Planning Manager Intel Programmable Solutions Group Introduction DSPs, GPUs, and FPGAs serve as accelerators for the CPU, providing both performance and power efficiency benefits. Given the variety of computing architectures available, designers need a uniform method to compare performance and power efficiency. The accepted method is to measure floating-point operations per second (FLOPS), where a FLOP is defined as either an addition or multiplication of single (3 bit) or double (64 bit) precision numbers in conformance with the IEEE 754 standard. All higher order functions, such as divide, square root, and trigonometric operators, can be constructed using adders and multipliers. As these operators, as well as other common functions such as fast Fourier transforms (FFTs) and matrix operators, require both adders and multipliers, there is commonly a 1:1 ratio of adders and multipliers in all these architectures. DSP, GPU, and FPGA performance comparison We compare the performance of the DSP, GPU, and FPGA architectures based on their peak FLOPS rating. The peak FLOPS rating is determined by multiplying the sum of the adders and multipliers by the maximum operation frequency. This represents the theoretical limit for computations, which can never be achieved in practice. It is generally not possible to implement useful algorithms that can keep all the computational units occupied all the time. It does however provide a useful comparison metric. Table of Contents Introduction...1 DSP, GPU, and FPGA performance comparison...1 ing-point in FPGAs using programmable logic... Challenges to determine floatingpoint performance... Benchmarking floating-point designs...3 How not to calculate FPGA GFLOPS...3 Conclusion...5 Where to get more information...6 DSP peak GFLOPS An example is provided for Texas Instruments TMS3C667x DSP. This DSP contains eight DSP cores, with each core containing two processing subsystems. 4 MB MSM SRAM MSMC 3 KB L1 P-Cache C66x Pac 3 KB L1 D-Cache 51 KN L Cache 8 s at up to 1.5 GHz Figure 1. TMS3C667x DSP Architecture.M C66x Adders 16 Fixed iplies per Cycle (per Side) 4 ing iplies per Cycle (per Side) Source Information: www.ti.com

Each subsystem contains four single-precision floating-point adders and four single-precision floating-point multipliers. This is a total of 64 adders and 64 multipliers. The fastest version available runs at 1.5 GHz, providing a peak of 16 GigaFLOPS (GFLOPS). GPU peak GFLOPS One the most powerful GPUs is the NVIDIA Tesla K. This GPU is based upon CUDA cores, each with a single floatingpoint multiple-adder unit, which can execute one per clock cycle in single-precision floating-point configuration. There are 19 CUDA cores in each Streaming iprocessor (SMX) processing engine. The K actually contains 15 SMX engines, although only 13 are available (for example, due to process yield issues). This gives a total of,496 available CUDA cores, with FLOPS per clock cycle, running at a maximum of 76 MHz. This provides a peak single-precision floating-point performance of 3,5 GFLOPS. FPGA peak GFLOPS Intel offers hardened floating-point engines in their FPGAs. A single-precision floating-point multiplier and adder have been incorporated into the hard DSP blocks embedded throughout the programmable logic structures. A mediumsized, midrange Intel Arria 1 FPGA is the 1AX66. This device has 1,678 DSP blocks, each of which can perform FLOPS per clock cycle, resulting in 3,376 FLOPS each clock cycle. At a rated speed of 45 MHz (for floating point; the fixed-point modes are higher), this provides for 1,5 GFLOPS. Computed in a similar fashion, Intel states that up to 1, GFLOPS, or 1 TeraFLOPS, of singleprecision performance are available in the high-end Intel Stratix 1 FPGAs, achieved with a combination of both clock rate increases and larger devices with much more DSP computing resources. ing-point in FPGAs using programmable logic ing point has always been available in FPGAs using the programmable logic of the FPGA. Furthermore, with programmable logic based floating point, an arbitrary precision level can be implemented, and is not restricted to industry-standard single and double precision. Intel offers seven different levels of floating-point precision. However, determining the peak floating-point performance of a given FPGA using programmable logic implementation is not at all straightforward. Therefore, the peak floating-point rating of Intel FPGAs is based solely on the capability of the hardened floating-point engines, and assumes that the programmable logic is not used for floating point, but rather for the other parts of a design, such as the data control and scheduling circuits, I/O interfaces, internal and external memory interfaces, and other needed functionality. Challenges to determine floating-point performance There are several factors that make the calculation of floating-point performance using programmable logic very difficult. The amount of logic to build one single-precision floating-point multiplier and adder can be determined by File (3,768 x 3 bit) CUDA Dispatch Port Operand Collector FP Unit INT Unit Result Queue Fused iplier Plus Adder Interconnect Network Legend: 64 KB Shared Memory/L1 Cache : Load/Store Unit : Special Function Unit Source Information: www.nvidia.com Figure. GP-GPU Architecture

3 Pipeline IEEE 754 Single-Precision ing Point Output 3 96 Input s Pipeline Pipeline IEEE 754 Single-Precision ing Point 3 Figure 3. ing-point DSP Block Architecture in FPGAs consulting the FPGA vendor s floating-point intellectual property (IP) user guide. However, one vital piece of information is not reported in the user guide, and that is the routing resources required. To implement floating point, large barrel shifters, which happen to consume tremendous amounts of the programmable routing (interconnect between the programmable logic elements), are required. All FPGAs have a given amount of interconnect to support the logic, which is based on what a typical fixed-point FPGA design will use. Unfortunately, floating point does require a much higher degree of this interconnect than most fixed-point designs. When a single instance of a floating-point function is created, it can draw upon routing resources in the general region of the logic elements used. However, when large numbers of floating-point operators are packed together, the result is routing congestion. This causes a large reduction in achievable design clock rates as well as logic usage which is much higher than a comparable fixed-point FPGA design. Intel has a proprietary synthesis technique known as fused datapath which does mitigate this to some extent, and allows very large floating-point designs to be implemented in the logic fabric, and leveraging fixed-point 7 x 7 multipliers for single precision, or 54 x 54 for double precision. In addition, the FPGA logic cannot be fully used. As the design takes up a large percentage of the available logic resources, the clock rate or f MAX at which timing closure can be achieved is reduced, and eventually timing closure cannot be achieved at all. Typically, 7-9% of the logic can actually be used, and with dense floating-point designs, it tends to be at the lower end of this range. Benchmarking floating-point designs Due to these reasons, it is nearly impossible to calculate the floating-point capacity of an FPGA when implemented in programmable logic. Instead, the best method is to build benchmark floating-point designs, which include the timing closure process. Alternatively, the FPGA vendor can supply such designs, which would greatly aid in estimating what is possible in a given FPGA. Intel provides benchmark designs on 8 nm FPGAs, which cover basic as well as complex floating-point designs. The published results show that with 8 nm FPGAs, several hundred GFLOPS can be achieved for simpler algorithms such as FFTs, and just over 1 GFLOPS for complex algorithms such as QR and Cholesky decomposition. For the benchmark results, refer to the Radar Processing: FPGAs or GPUs? (PDF) white paper. In addition, Berkeley Design Technology, Inc. (BDTI), a thirdparty technology analysis firm, conducted an independent analysis on complex, high-performance floating-point DSP designs on Intel's 8 nm FPGAs. For information from this independent analysis, refer to the An Independent Analysis of ing-point DSP Design Flow and Performance on Intel 8-nm FPGAs (PDF) white paper by BDTI. Many other floating-point benchmark designs have also been implemented using OpenCL on Intel s 8 nm Stratix V FPGAs, and are available upon request. These designs are in the process of migrating to the Arria 1 FPGAs, which will provide dramatic performance improvements due to the hardened floating-point DSP block architecture. How not to calculate FPGA GFLOPS To reiterate, floating point has always been possible in FPGAs, implemented largely in programmable logic and using fixed-point multiplier resources. But the degree of ambiguity in actual performance and throughput can allow for some very aggressive marketing claims. We present a real-life example of how such an artificial benchmark could be constructed using Xilinx s 8 nm and nm FPGAs as the case study. Refer to Table 1 for resource information by Xilinx to construct a single floating-point function and determine the performance for the Kintex-7 FPGA, built on the 8 nm process. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

Characterization of Single-Precision Format on Kintex-7 FPGAs Resources Maximum Frequency (MHz) Operation Add/Subtract Embedded Type DSP48E1 (speed optimized, full usage) Number LUT-FF Pairs 354 FPGA Logic LUTs 38 FFs 34 18K Block RAMs Kintex-7-1 Speed Grade 463 Logic (speed optimized, no usage) 578 379 6 55 Logic (low latency) 656 5 668 447 Fused iply-add DSP48E1 (full usage) DSP48E1 (medium usage) 4 1,83 1,46 8 766 1,33 1,38 488 438 Accumulator DSP48E1 (full usage) 5 1,537 894 1,551 41 DSP48E1 (medium usage) 1,566 915 1,588 41 Logic 1,66 1,97 1,55 375 Fixed to float Int3 input 13 164 9 559 to fixed Int3 result 4 175 37 557 to float Single to double 7 1 7 65 Compare Programmable 51 48 1 64 Divide RATE = 1 1,5 777 1,656 45 RATE = 6 56 9 9 46 Source Information: www.xilinx.com Table 1. Xilinx ing-point Functions Resources and f MAX Note: The Virtex-7 device shares the same core architecture as the Kintex-7 device, so the numbers in Table 1 can be used as reference for our example that follows. The largest logic density 8 nm device offered by Xilinx is the Virtex-7 XC7VT device, with 1,954.56K logic cells (LCs) and,16 DSP slices (with 18 x 5 fixed-point multipliers). If the fused multiply-add function was chosen for benchmarking, only 1,44 units could be built due to the scarcity of DSP slices. Also, the rated speed is 438 MHz for a single instantiation in the fastest speed device, making this a poor choice for a peak GFLOPS claim, although the multiplier-adder function is the standard architecture used in floating-point architectures. To maximize the floating-point rating, the best choice is to use the add/subtract function. The best strategy is to build as many adders as possible until the DSP48E slices are exhausted, and build the remaining adders with pure logic. The DSP resourced adders require DSP slices and 354 LUT-FF pairs (or LCs) and a single instantiation can operate at 463 MHz. The logic-based adder uses 578 LCs, and a single instantiation can operate at 55 MHz. We assume 1% logic and 1% DSP slices are used (this requires enough routing to be available to utilize all of the logic). Using this method, the Xilinx could claim 1,996 GFLOPS of single-precision floating-point performance. Similarly, at nm, the largest logic density device offered by Xilinx is the UltraScale XCVU44 device. It features 4,47.48K LCs and,88 DSP slices (each of which supports 18 x 7 fixed-point multipliers). Table 1 can also be used for nm Ultrascale devices, as the device core architectures are very similar. The nm process will also offer an increase in maximum frequency, assumed at 1%. Number of Adders Adder Type and f MAX f MAX GFLOPS LCs per Adder LCs Required per Adder Required 1,8 DSP based 463 MHz 5.4 354 38.3K LE,16,7 Logic based 55 MHz 1,496 578 1,57.K LE Total 3,8 1,996 1,954.5K LE,16 Table. Calculated Peak GFLOPS on 8 nm Xilinx FPGA

Number of Adders Adder Type and f MAX f MAX GFLOPS LCs per Adder LCs Required per Adder Required 1,44 DSP based 518.56 MHz 746.7 354 59.8K LEs,88 6,743 Logic based 616 MHz 4,157.7 578 3,897.4K LEs Total 8,183 4,94 4,47.K LEs,88 Table 3. Calculated Peak GFLOPS on nm Xilinx FPGA Just coincidentally, this number is what Xilinx claims, although the method used to determine the GFLOPS is not provided. There is a reason for the 1:1 ratio of floating-point multipliers and adders in most computer architectures. Applications using multipliers include matrix multiplication, FFTs, higher order mathematical functions, such as square root, trigonometric, and log, matrix inversion, and all linear algebra intense designs. Any useful floating-point application requires a large number of floating-point multipliers; hence the practical reason why computer architectures offer equal numbers of multipliers and adders. Although high ratings can be claimed by using only floating-point adders (no multipliers), such designs have no application benefits. As every experienced FPGA designer knows, the design f MAX decreases as the percentage of logic used increases. Generally, logic usage of above 8% results in significant degradation in the clock rate at which timing closure can be achieved. Furthermore, a large amount of logic is required for other functions, which further reduce the available logic for floating-point computations. The choice of FPGA family is also significant. The family chosen is very high-cost devices, generally used for ASIC prototyping applications, where device cost and power consumption are not important factors. Using the FPGA family optimized for DSP applications would dramatically lower GFLOPS calculations. While not demonstrated here, early power estimators are available from FPGA vendors. A design using all of the resources and f MAX defined in Table 3 would have power consumption in the hundreds of watts. This particular FPGA is intended for ASIC prototyping applications, where clock rates and toggle rates are low. High clock rates and toggle rates will dramatically increase power consumption beyond the VCC current and power dissipation capabilities of the device. In short, this is not a recognized method of benchmarking by the industry, and hence, the numbers obtained using this method should not be used to compare with the peak floating-point performance claims from other semiconductor vendors. The peak GFLOPS figure should represent the achievable performance of a given device. Of course, some derating is still needed, as almost no algorithm can keep all the computational units performing useful calculations at 1% duty cycle. Conclusion FPGAs with hardened floating-point DSP blocks are now available, and provide single-precision performance from 16 to 1,5 GFLOPS in midrange Arria 1 devices, and up to 1, GFLOPS in high-end Stratix 1 devices. These peak GFLOPS metrics are computed based on the same transparent methodology used on CPUs, GPUs, and DSPs. This methodology provides designers a reliable technique for the baseline comparison of the peak floating-point computing capabilities of devices with very different architectures. The next level of comparison should be based on representative benchmark designs implemented on the platform of interest. For FPGAs lacking hard floating-point circuits, using the vendor-calculated theoretical GFLOPS numbers is quite unreliable. Any FPGA floating-point claims based on a logic Maximum DSP Capabilities Zynq-7 7 Series UltraScale Logic Cells Block RAM Fixed-Point Performance ing-point Performance Transceivers Transceiver Performance 444K 3, Kb,,6 GMACS 778 GFLOPs 16 1.313 Gbps 1,955K 68 Mb 3,6 5,335 GMACS, GFLOPs 96 1.5 Gbps 13.1 Gbps 8.5 Gbps 4,4K 115 Mb 5,5 8,151 GMACS 4,93 GFLOPs 14 16.3 Gbps 3.75 Gbps Source Information: www.xilinx.com Please note the assumptions used in this calculation: Only floating-point adders, and no floatingpoint multipliers were used 1% of the FPGA logic were used Timing for a complete design was closed at the f MAX of a single floating-point operator Zero logic resources were available for memory interfacing, control circuits, etc. Devices chosen were not the DSP-rich members of the FPGA family, but rather a device useful for emulating ASIC logic as a prototype platform Table 4. Xilinx ing-point Performance Claims

implementation at over 5 GFLOPS should be viewed with a high level of skepticism. In this case, a representative benchmark design implementation is essential to make a comparative judgment. The FPGA compilation report showing logic, memory, and other resources along with the achieved clock rate, should also be provided. Better yet, any party claiming a specific performance level should be willing to make available the compiled design file, allowing the results to be replicated. Where to get more information For more information about Intel and Arria 1 FPGAs, visit https://www.altera.com/products/fpga/arria-series/arria-1/ overview.html 1 http://www.altera.com/literature/po/bg-floating-point-fpga.pdf http://www.altera.com/literature/wp/wp-1197-radar-fpga-or-gpu.pdf 3 http://www.altera.com/literature/wp/wp-1187-bdti-altera-fp-dsp-design-flow.pdf 17 Intel Corporation. All rights reserved. Intel, the Intel logo, Altera, ARRIA, CYCLONE, ENPIRION, MAX, NIOS, QUARTUS, and STRATIX words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Other marks and brands may be claimed as the property of others. Intel reserves the right to make changes to any products and services at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. * Other marks and brands may be claimed as the property of others. Please Recycle WP-1-1.1