Xilinx(Ultrascale) Vs. Altera(ARRIA 10) Test Bench

Similar documents
Intel Arria 10 FPGA Performance Benchmarking Methodology and Results

Early Power Estimator for Intel Stratix 10 FPGAs User Guide

Cover TBD. intel Quartus prime Design software

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

DINI Group. FPGA-based Cluster computing with Spartan-6. Mike Dini Sept 2010

PowerPlay Early Power Estimator User Guide

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

Understanding Peak Floating-Point Performance Claims

Cover TBD. intel Quartus prime Design software

FPGA system development What you need to think about. Frédéric Leens, CEO

High-Tech-Marketing. Selecting an FPGA. By Paul Dillien

Advanced FPGA Design Methodologies with Xilinx Vivado

AL8253 Core Application Note

Stratix vs. Virtex-II Pro FPGA Performance Analysis

FPGA for Software Engineers

Quartus II Software Device Support Release Notes

AL8259 Core Application Note

Experiment 3. Digital Circuit Prototyping Using FPGAs

Quartus II Software Version 10.0 SP1 Device Support

PowerPlay Early Power Estimator User Guide for Cyclone III FPGAs

New! New! New! New! New!

VIVADO TUTORIAL- TIMING AND POWER ANALYSIS

Field Programmable Gate Array (FPGA) Devices

Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas

Quartus II Software Version 10.0 Device Support Release Notes

Speedster22i FPGA Family Power Estimator Tool User Guide. UG054- July 28, 2015

10GBase-R PCS/PMA Controller Core

Hardware Implementation of the Code-based Key Encapsulation Mechanism using Dyadic GS Codes (DAGS)

Lab 5. Using Fpro SoC with Hardware Accelerators Fast Sorting

An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers

VHDL Essentials Simulation & Synthesis

Advanced Synthesis Techniques

Employing Multi-FPGA Debug Techniques

ECE 645: Lecture 1. Basic Adders and Counters. Implementation of Adders in FPGAs

Intel Quartus Prime Pro Edition Software and Device Support Release Notes

FPGA Power Management and Modeling Techniques

4DM4 Lab. #1 A: Introduction to VHDL and FPGAs B: An Unbuffered Crossbar Switch (posted Thursday, Sept 19, 2013)

Design Once with Design Compiler FPGA

Nios II Performance Benchmarks

Implementing MATLAB Algorithms in FPGAs and ASICs By Alexander Schreiber Senior Application Engineer MathWorks

ALTERA FPGA Design Using Verilog

SDRAM Interface Clocking for the NanoBoard 2

DDR and DDR2 SDRAM Controller Compiler User Guide

Power Optimization in FPGA Designs

TOE10G-IP Multisession Demo Instruction Rev Nov-16

AN 462: Implementing Multiple Memory Interfaces Using the ALTMEMPHY Megafunction

Low Power Design Techniques

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

An Introduction to Programmable Logic

June 2003, ver. 1.2 Application Note 198

FEC Core Area Comparison and Model

1. Overview for the Arria II Device Family

Supported Device Family (1) Supported User Interfaces. Simulation Models Supported S/W Drivers. Simulation. Notes:

Verilog Essentials Simulation & Synthesis

Block Diagram. mast_sel. mast_inst. mast_data. mast_val mast_rdy. clk. slv_sel. slv_inst. slv_data. slv_val slv_rdy. rfifo_depth_log2.

SHA Core, Xilinx Edition. Core Facts

Mentor Graphics Solutions Enable Fast, Efficient Designs for Altera s FPGAs. Fall 2004

NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES

PINE TRAINING ACADEMY

KC705 PCIe Design Creation with Vivado August 2012

Stratix II vs. Virtex-4 Performance Comparison

Synthesis Options FPGA and ASIC Technology Comparison - 1

Method We follow- How to Get Entry Pass in SEMICODUCTOR Industries for 3rd year engineering. Winter/Summer Training

DisplayPort MegaCore. Altera Technology Roadshow 2013

Power Consumption in 65 nm FPGAs

FFT MegaCore Function User Guide

Quartus II Prime Foundation

Advanced FPGA Design Methodologies with Xilinx Vivado

OTU2 I.9 FEC IP Core (IP-OTU2EFECI9) Data Sheet

Signal Conversion in a Modular Open Standard Form Factor. CASPER Workshop August 2017 Saeed Karamooz, VadaTech

ALTERA FPGAs Architecture & Design

Quartus Prime Standard Edition Software and Device Support Release Notes Version 15.1

SHA3 Core Specification. Author: Homer Hsing

FABRICATION TECHNOLOGIES

System Debugging Tools Overview

NVMe-IP Introduction for Xilinx Ver1.7E

Physics 536 Spring Illustrating the FPGA design process using Quartus II design software and the Cyclone II FPGA Starter Board.

Recommended Design Techniques for ECE241 Project Franjo Plavec Department of Electrical and Computer Engineering University of Toronto

Building Interfaces with Arria 10 High-Speed Transceivers

Stratix FPGA Family. Table 1 shows these issues and which Stratix devices each issue affects. Table 1. Stratix Family Issues (Part 1 of 2)

11. Analyzing Timing of Memory IP

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

NVMe-IP Introduction for Intel

A secret of bees 1 MODELLING EXERCISE. (by Thilo Gross)

Performance Evaluation of Cryptographic Algorithms on Reconfigurable Hardware: MD5 based on Timing and Area Implementation

Interfacing DDR2 SDRAM with Stratix II, Stratix II GX, and Arria GX Devices

Advanced FPGA Design. Jan Pospíšil, CERN BE-BI-BP ISOTDAQ 2018, Vienna

UltraScale Devices Gen3 Integrated Block for PCI Express v4.4

EITF35: Introduction to Structured VLSI Design

A Prototype Storage Subsystem based on PCM

Arria 10 Transceiver PHY User Guide

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Quad Serial Gigabit Media Independent v3.4

High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS

The world s most reliable and mature full hardware ultra-low latency TCP, MAC and PCS IP Cores.

MicroProcessor Engineering 133 Hill Lane Southampton SO15 5AF UK

Stratix II vs. Virtex-4 Power Comparison & Estimation Accuracy

FIFO Generator v13.0

Intel Quartus Prime Pro Edition Software and Device Support Release Notes

Energy scalability and the RESUME scalable video codec

Transcription:

Xilinx(Ultrascale) Vs. Altera(ARRIA 10) Test Bench By Roy Messinger www.hwdebugger.com roy.messinger@hwdebugger.com 1

1 GENERAL In the following document I will show a thorough comparison I've conducted between 2 FPGA's of vendor's families; Altera ARRIA 10 & Xilinx UltraScale Kinetis. The comparison put emphasis on frequency, utilization, power & compilation time. I've carried out this comparison in an attempt to find the 'best' vendor suited for my needs. I did not give any 'discounts' to this or that vendor. All the tests I've conducted were purely identical in term of exactly the same code and software preferences. See important notes at last page for further info. 2 WHAT I'VE CHECKED WAS: Frequency. Utilization. Thermal power. Compilation time. 3 FPGA COMPONENTS I ve chosen these FPGA s to compare two similar components, in term of RAM, size, and various other characteristics. Altera Xilinx Component GX480, (10AX048K1F35E1HG) KU035 (XCKU035-1FFVA1156C) System Logic [k] RAM [Mb] PCI-Gen 3 Transcv I/O 629 28 2*8 lanes 36 396 444 25 2*8 lanes 16 520 2

4 TEST BENCH METHODOLOGY How did I carry out the comparison? For the comparison I have used a VHDL component of a state machine (about 20 states). This FSM implements some heavy logic and runs at 400MHz. I've designed 2 small projects of only this component, both in Altera (Quartus) & Xilinx (Vivado). After each successful compilation, I've checked the timing analysis and replicated the component to push the FPGA capabilities to the edge (space, frequency). I've used virtual pins on all comps so no need to connect the comp ports to the FPGA pins (no connection to IO buffers). I did not alter anything in each of the softwares. I've left the default values of implementation/synthesis setting as they were. Virtual pins Comp. Compile in Vivado & Quartus Passes timing req.? No Compare to second vendor. FPGA Yes Replicate Replicate component 3

5 TEST BENCH HARDWARE Compilation computers (both with Windows 7 OS): o Altera: Quartus version 17.0.0. E5-2643 @3.4GHz (Xeon), 32GB RAM. o Xilinx: Vivado version 2016.4. I7-6700 @3.4GHz, 32GB RAM. Component chosen were close to the same spec (to what I need): o Altera: 10AX048K1F35E1HG; GX480, highest speed grade. o Xilinx: XCKU035-1FFVA1156C; KU035, highest slowest speed grade (see notes at last page). o Both comps are the same package dimension (35mm*35mm). 4

6 TEST RESULTS I've ran 3 sets of tests. I've defined them as Test A, Test B, Test C. Test A, 400MHz: Each input is connected to all instantiations, as shown. Internal Outputs, obviously, are separated: Test B, 500MHz: Each input is connected to all instantiations, as shown. Outputs, obviously, are separated: Test C, 400MHz: Each input is connected to each instantiation, as shown. Outputs, obviously, are separated:... 2 Clocks are created for the design in SDC (Quartus) & XDC (Vivado); 100MHZ & 400MHz/500MHz This is NOT a real design, but one that can compare the performances between both vendors as it uses a real component and simulates HW FPGA development phases. The code is the same. Test A & Test B are closer to a real world implementation in my point of view, as it defines relations between different instantiations inside the FPGA. Test B is intended to push the FPGA to the edge, in term of frequency, as both vendors do not reach this frequency but are supposed to do their best effort. I've also implemented Test C to ease the vendors Synthesis, Optimizations & Place & Route phases and see what happens then, when there's no relation between different instantiations. The frequency comparison is between the WNS in Vivado (Worst Negative Slack, it's the worse of the worst) and max frequency result in Quartus, which is based on the setup timing in 100c of the timing report (it is the worse of the worst). Both vendor tools have the default preferences (no 'best efforts', etc.). 5

Test A (at 400MHz): 6

These are the results for 400MHz: Desired freq. Replicated Components Max. Frequency [MHz] Altera Xilinx ARRIA 10 ULTRA- SCALE 400 4 430 423 400 5 433 413 400 7 417 409 400 8 395 411 400 9 433 414 400 10 403 414 400 11 419 411 400 12 383 411 400 13 401 411 400 14 389 410 400 15 420 409 400 16 409 409 400 17 402 410 400 18 370 412 400 19 316 417 400 20 383 420 400 25 362 411 400 30 364 416 400 35 315 410 400 37 315 411 400 40 315 387 400 45 330 392 General Notes & conclusions for Test A: a. The same VHDL component was used with exact same parameters The code is the same. b. Compilation times of Vivado (Xilinx) were 20% faster than Quartus. c. Frequency column values above 400MHz shows the maximum frequency achieved, even though not required. d. Ultrascale(Xilinx) slope is much more stable and linear than ARRIA 10(Altera), and keeps steady slope above the 400MHz target frequency until it cannot hold on. In continuous to section C., I've now compared both projects in 500MHz, where even though both vendors cannot reach such high frequency, they will tend to do their best effort to reach the highest frequency they can. 7

Test B (at 500MHz): 8

These are the results for 500MHz: Desired freq. Replicated components Xilinx Achieved frequency [MHz] Altera Achieved frequency [MHz] Xilinx Utiization [%] Altera Utilization [%] Xilinx Utilization [LUT] Altera Utilization [ALM] Xilinx Normalized utilization Altera Normalizaed Utilization % Xilinx/Altera usage 500 18 471 371 24.6 21 50,056 38,519 87,598 102,075 86 500 19 497 381 26 22.2 52,825 40,712 92,444 107,887 86 500 20 480 316 27.4 23.3 55,586 42,715 97,276 113,195 86 500 21 488 341 28.7 24.4 58,373 44,743 102,153 118,569 86 500 22 450 392 30.1 25.5 61,158 46,858 107,027 124,174 86 500 23 492 341 31.5 26.7 63,951 48,995 111,914 129,837 86 500 24 461 362 32.8 27.8 66,708 51,026 116,739 135,219 86 500 25 413 312 34.2 29 69,506 53,197 121,636 140,972 86 500 26 459 396 35.6 30.3 72,288 55,595 126,504 147,327 86 500 27 450 314 37 31.4 75,087 57,685 131,402 152,865 86 500 28 473 388 38.3 32.6 77,803 59,877 136,155 158,674 86 500 29 469 332 39.7 33.9 80,616 62,173 141,078 164,758 86 500 30 489 334 41.1 35.1 83,418 64,382 145,982 170,612 86 500 31 466 384 42.4 36.2 86,152 66,394 150,766 175,944 86 General Notes & conclusions for Test B: a. Both vendors could not reach 500MHz, nevertheless, Ultrascale managed to be way over ARRIA 10 in terms of frequency, space and compilation time. b. Regarding logic elements usage, there's a fix value of 86% usage ratio between Xilinx logic usage and Altera logic usage (Xilinx usage is lower than Altera). I've used Xilinx formulas to compare CLB(LUT)'s to ALM's. c. ARRIA 10(Altera) vs. Ultrascale (Xilinx) usage logic ratio is kept fixed all along, showing both Altera and Xilinx replication algorithm does not change, as the usage of logic elements is raising linear when replications increase which is a good thing when comparing apples to apples'. 9

Test C (at 400MHz): 10

Desired freq. Replicated components Xilinx Achieved frequency [MHz] Altera Achieved frequency [MHz] Xilinx Compilation time Altera compilation time Xilinx Utiization [%] Altera Utilization [%] Xilinx Utilization [LUT] Altera Utilization [ALM] Xilinx Normalized utilization Altera Normalizaed Utilization Xilinx/Altera utilization ratio [%] Power Dissipation Xilinx [W] Power Dissipation Altera [W] 400 8 410 420 08:42 15:27 400 9 411 424 09:48 18:30 400 10 412 419 10:46 20:00 400 11 409 409 11:15 21:37 400 12 410 417 12:58 20:24 400 13 414 406 13:00 25:01 400 14 409 418 13:25 28:00 400 15 410 420 13:32 28:01 400 16 418 401 14:24 31:24 400 17 408 394 14:06 32:09 400 18 419 411 15:47 33:00 400 19 410 423 15:39 36:02 400 20 411 408 16:52 37:00 Though pwr dissipation not 'real' because virtual pins are used, still, the comparison between vendors is 'legal' as we can compare between them. 400 21 420 405 28:00 40:00 29 32 1.66 3.27 400 22 409 416 30:00 38:22 30 34 1.7 3.38 400 23 408 412 32:00 39:30 31 36 1.78 3.48 400 24 418 398 32:20 41:24 33 37 1.83 3.6 400 25 420 371 33:00 43:55 34 39 1.89 400 26 411 411 36:00 45:48 36 40 1.95 3.75 400 27 409 410 36:00 45:40 37 42 2 4 400 28 410 409 40:00 50:40 38 43 2 4 400 29 411 415 41:10 52:21 40 45 400 30 409 407 26:00 54:00 41 46 83,448 85,093 146,034 225,496 65 2.17 4.172 400 31 416 406 42:00 56:29 42 48 400 32 408 407 42:00 57:44 44 49 5.3 400 33 414 402 48:14 58:23 45 51 91,761 93,598 160,582 248,035 65 2.34 4.46 400 34 412 404 46:30 58:44 47 53 400 35 409 404 50:00 01:01:52 48 54 400 36 401 380 47:37 01:05:00 400 37 401 393 52:21 59:39 400 38 408 417 50:00 01:07:02 400 39 407 334 57:30 01:10:00 53 60 108,271 110,627 189,474 293,162 65 2.577 4.9 400 40 409 395 53:03 01:02:00 400 41 409 408 55:00 01:11:00 56 63 113,857 116,295 199,250 308,182 65 2.685 400 42 404 359 56:55 01:01:05 400 43 402 395 58:52 01:13:00 59 66 5.25 400 44 390 393 01:03:00 01:12:00 60 68 122,357 124,801 214,125 330,723 65 2.846 400 45 410 406 1:04:00 01:19:00 62 70 2.9 400 46 404 394 1:05:01 01:22:00 63 71 2.95 5.457 400 47 378 397 01:09:00 01:23:00 64 73 3.008 5.5 400 48 409 371 01:06:00 01:29:00 66 3.06 11

General Notes & conclusions for Test C: a. In this test, though less realistic in my point of view, both vendors can hold more replications till they fail timing requirements. Nevertheless, ARRIA 10 (Altera) keeps failing at much earlier points than Ultrascale (Xilinx). b. Xilinx Compilation times are about 20% faster than Altera. c. Regarding logic elements usage, there's a fix value of 65% usage ratio between Xilinx logic usage and Altera logic usage (Xilinx usage is lower than Altera). I've used Xilinx formulas to compare LUT's to ALM's. d. In this test I've also compared Thermal Power: Ultrascale consumes about 50% less power than ARRIA 10 (meaning less overall heat and power supply current needed). 12

7 TEST RESULTS SUMMARY So, overall: A. When comparing Altera ARRIA 10 GX480, F35, to Xilinx UltraScale KU035, A1156: Compilation time (Xilinx 20% less). Frequency (Xilinx were much more stable and higher freq.) Thermal power (Xilinx almost 50% less power). Utilization (Xilinx to Altera ratio 86%). B. Even when I compared Altera s GX320 to Xilinx s KU035 (Altera smaller comp to 'same' Xilinx comp), the Xilinx s KU035 had better results, in all these characteristics. For example, when compiling Altera s GX320, F35 (same package as Altera s GX480) which should be 'equal' to Xilinx s KU035, for 44 replications: Quartus utilization for GX320 for 44 replications, Test C: Logic utilization (in ALMs) 139,107 / 119,900 ( 116 % ) And compilation failed. Not enough place in device. Xilinx utilization for KU035 for 44 replications, Test C: 60%. C. When compared ARRIA 10 GX270 to Xilinx s KU035, I had similar results in all characteristics (did not check all replications). Notes: 2 very important keynotes I've discovered after conducting this comparison (which should tip the scale in favor of Intel/Altera, and nevertheless, Xilinx results are much better): Xilinx FPGA chosen was smaller than Altera. This means Xilinx P&R algorithm must work harder to reach the desired frequency (since less space is available). Nevertheless, Xilinx results are much better. Xilinx FPGA speed is the slowest, compared to Altera (which is the fastest). This means Altera results should be better. Nevertheless, it is much worse. 13