Xilinx(Ultrascale) Vs. Altera(ARRIA 10) Test Bench By Roy Messinger www.hwdebugger.com roy.messinger@hwdebugger.com 1
1 GENERAL In the following document I will show a thorough comparison I've conducted between 2 FPGA's of vendor's families; Altera ARRIA 10 & Xilinx UltraScale Kinetis. The comparison put emphasis on frequency, utilization, power & compilation time. I've carried out this comparison in an attempt to find the 'best' vendor suited for my needs. I did not give any 'discounts' to this or that vendor. All the tests I've conducted were purely identical in term of exactly the same code and software preferences. See important notes at last page for further info. 2 WHAT I'VE CHECKED WAS: Frequency. Utilization. Thermal power. Compilation time. 3 FPGA COMPONENTS I ve chosen these FPGA s to compare two similar components, in term of RAM, size, and various other characteristics. Altera Xilinx Component GX480, (10AX048K1F35E1HG) KU035 (XCKU035-1FFVA1156C) System Logic [k] RAM [Mb] PCI-Gen 3 Transcv I/O 629 28 2*8 lanes 36 396 444 25 2*8 lanes 16 520 2
4 TEST BENCH METHODOLOGY How did I carry out the comparison? For the comparison I have used a VHDL component of a state machine (about 20 states). This FSM implements some heavy logic and runs at 400MHz. I've designed 2 small projects of only this component, both in Altera (Quartus) & Xilinx (Vivado). After each successful compilation, I've checked the timing analysis and replicated the component to push the FPGA capabilities to the edge (space, frequency). I've used virtual pins on all comps so no need to connect the comp ports to the FPGA pins (no connection to IO buffers). I did not alter anything in each of the softwares. I've left the default values of implementation/synthesis setting as they were. Virtual pins Comp. Compile in Vivado & Quartus Passes timing req.? No Compare to second vendor. FPGA Yes Replicate Replicate component 3
5 TEST BENCH HARDWARE Compilation computers (both with Windows 7 OS): o Altera: Quartus version 17.0.0. E5-2643 @3.4GHz (Xeon), 32GB RAM. o Xilinx: Vivado version 2016.4. I7-6700 @3.4GHz, 32GB RAM. Component chosen were close to the same spec (to what I need): o Altera: 10AX048K1F35E1HG; GX480, highest speed grade. o Xilinx: XCKU035-1FFVA1156C; KU035, highest slowest speed grade (see notes at last page). o Both comps are the same package dimension (35mm*35mm). 4
6 TEST RESULTS I've ran 3 sets of tests. I've defined them as Test A, Test B, Test C. Test A, 400MHz: Each input is connected to all instantiations, as shown. Internal Outputs, obviously, are separated: Test B, 500MHz: Each input is connected to all instantiations, as shown. Outputs, obviously, are separated: Test C, 400MHz: Each input is connected to each instantiation, as shown. Outputs, obviously, are separated:... 2 Clocks are created for the design in SDC (Quartus) & XDC (Vivado); 100MHZ & 400MHz/500MHz This is NOT a real design, but one that can compare the performances between both vendors as it uses a real component and simulates HW FPGA development phases. The code is the same. Test A & Test B are closer to a real world implementation in my point of view, as it defines relations between different instantiations inside the FPGA. Test B is intended to push the FPGA to the edge, in term of frequency, as both vendors do not reach this frequency but are supposed to do their best effort. I've also implemented Test C to ease the vendors Synthesis, Optimizations & Place & Route phases and see what happens then, when there's no relation between different instantiations. The frequency comparison is between the WNS in Vivado (Worst Negative Slack, it's the worse of the worst) and max frequency result in Quartus, which is based on the setup timing in 100c of the timing report (it is the worse of the worst). Both vendor tools have the default preferences (no 'best efforts', etc.). 5
Test A (at 400MHz): 6
These are the results for 400MHz: Desired freq. Replicated Components Max. Frequency [MHz] Altera Xilinx ARRIA 10 ULTRA- SCALE 400 4 430 423 400 5 433 413 400 7 417 409 400 8 395 411 400 9 433 414 400 10 403 414 400 11 419 411 400 12 383 411 400 13 401 411 400 14 389 410 400 15 420 409 400 16 409 409 400 17 402 410 400 18 370 412 400 19 316 417 400 20 383 420 400 25 362 411 400 30 364 416 400 35 315 410 400 37 315 411 400 40 315 387 400 45 330 392 General Notes & conclusions for Test A: a. The same VHDL component was used with exact same parameters The code is the same. b. Compilation times of Vivado (Xilinx) were 20% faster than Quartus. c. Frequency column values above 400MHz shows the maximum frequency achieved, even though not required. d. Ultrascale(Xilinx) slope is much more stable and linear than ARRIA 10(Altera), and keeps steady slope above the 400MHz target frequency until it cannot hold on. In continuous to section C., I've now compared both projects in 500MHz, where even though both vendors cannot reach such high frequency, they will tend to do their best effort to reach the highest frequency they can. 7
Test B (at 500MHz): 8
These are the results for 500MHz: Desired freq. Replicated components Xilinx Achieved frequency [MHz] Altera Achieved frequency [MHz] Xilinx Utiization [%] Altera Utilization [%] Xilinx Utilization [LUT] Altera Utilization [ALM] Xilinx Normalized utilization Altera Normalizaed Utilization % Xilinx/Altera usage 500 18 471 371 24.6 21 50,056 38,519 87,598 102,075 86 500 19 497 381 26 22.2 52,825 40,712 92,444 107,887 86 500 20 480 316 27.4 23.3 55,586 42,715 97,276 113,195 86 500 21 488 341 28.7 24.4 58,373 44,743 102,153 118,569 86 500 22 450 392 30.1 25.5 61,158 46,858 107,027 124,174 86 500 23 492 341 31.5 26.7 63,951 48,995 111,914 129,837 86 500 24 461 362 32.8 27.8 66,708 51,026 116,739 135,219 86 500 25 413 312 34.2 29 69,506 53,197 121,636 140,972 86 500 26 459 396 35.6 30.3 72,288 55,595 126,504 147,327 86 500 27 450 314 37 31.4 75,087 57,685 131,402 152,865 86 500 28 473 388 38.3 32.6 77,803 59,877 136,155 158,674 86 500 29 469 332 39.7 33.9 80,616 62,173 141,078 164,758 86 500 30 489 334 41.1 35.1 83,418 64,382 145,982 170,612 86 500 31 466 384 42.4 36.2 86,152 66,394 150,766 175,944 86 General Notes & conclusions for Test B: a. Both vendors could not reach 500MHz, nevertheless, Ultrascale managed to be way over ARRIA 10 in terms of frequency, space and compilation time. b. Regarding logic elements usage, there's a fix value of 86% usage ratio between Xilinx logic usage and Altera logic usage (Xilinx usage is lower than Altera). I've used Xilinx formulas to compare CLB(LUT)'s to ALM's. c. ARRIA 10(Altera) vs. Ultrascale (Xilinx) usage logic ratio is kept fixed all along, showing both Altera and Xilinx replication algorithm does not change, as the usage of logic elements is raising linear when replications increase which is a good thing when comparing apples to apples'. 9
Test C (at 400MHz): 10
Desired freq. Replicated components Xilinx Achieved frequency [MHz] Altera Achieved frequency [MHz] Xilinx Compilation time Altera compilation time Xilinx Utiization [%] Altera Utilization [%] Xilinx Utilization [LUT] Altera Utilization [ALM] Xilinx Normalized utilization Altera Normalizaed Utilization Xilinx/Altera utilization ratio [%] Power Dissipation Xilinx [W] Power Dissipation Altera [W] 400 8 410 420 08:42 15:27 400 9 411 424 09:48 18:30 400 10 412 419 10:46 20:00 400 11 409 409 11:15 21:37 400 12 410 417 12:58 20:24 400 13 414 406 13:00 25:01 400 14 409 418 13:25 28:00 400 15 410 420 13:32 28:01 400 16 418 401 14:24 31:24 400 17 408 394 14:06 32:09 400 18 419 411 15:47 33:00 400 19 410 423 15:39 36:02 400 20 411 408 16:52 37:00 Though pwr dissipation not 'real' because virtual pins are used, still, the comparison between vendors is 'legal' as we can compare between them. 400 21 420 405 28:00 40:00 29 32 1.66 3.27 400 22 409 416 30:00 38:22 30 34 1.7 3.38 400 23 408 412 32:00 39:30 31 36 1.78 3.48 400 24 418 398 32:20 41:24 33 37 1.83 3.6 400 25 420 371 33:00 43:55 34 39 1.89 400 26 411 411 36:00 45:48 36 40 1.95 3.75 400 27 409 410 36:00 45:40 37 42 2 4 400 28 410 409 40:00 50:40 38 43 2 4 400 29 411 415 41:10 52:21 40 45 400 30 409 407 26:00 54:00 41 46 83,448 85,093 146,034 225,496 65 2.17 4.172 400 31 416 406 42:00 56:29 42 48 400 32 408 407 42:00 57:44 44 49 5.3 400 33 414 402 48:14 58:23 45 51 91,761 93,598 160,582 248,035 65 2.34 4.46 400 34 412 404 46:30 58:44 47 53 400 35 409 404 50:00 01:01:52 48 54 400 36 401 380 47:37 01:05:00 400 37 401 393 52:21 59:39 400 38 408 417 50:00 01:07:02 400 39 407 334 57:30 01:10:00 53 60 108,271 110,627 189,474 293,162 65 2.577 4.9 400 40 409 395 53:03 01:02:00 400 41 409 408 55:00 01:11:00 56 63 113,857 116,295 199,250 308,182 65 2.685 400 42 404 359 56:55 01:01:05 400 43 402 395 58:52 01:13:00 59 66 5.25 400 44 390 393 01:03:00 01:12:00 60 68 122,357 124,801 214,125 330,723 65 2.846 400 45 410 406 1:04:00 01:19:00 62 70 2.9 400 46 404 394 1:05:01 01:22:00 63 71 2.95 5.457 400 47 378 397 01:09:00 01:23:00 64 73 3.008 5.5 400 48 409 371 01:06:00 01:29:00 66 3.06 11
General Notes & conclusions for Test C: a. In this test, though less realistic in my point of view, both vendors can hold more replications till they fail timing requirements. Nevertheless, ARRIA 10 (Altera) keeps failing at much earlier points than Ultrascale (Xilinx). b. Xilinx Compilation times are about 20% faster than Altera. c. Regarding logic elements usage, there's a fix value of 65% usage ratio between Xilinx logic usage and Altera logic usage (Xilinx usage is lower than Altera). I've used Xilinx formulas to compare LUT's to ALM's. d. In this test I've also compared Thermal Power: Ultrascale consumes about 50% less power than ARRIA 10 (meaning less overall heat and power supply current needed). 12
7 TEST RESULTS SUMMARY So, overall: A. When comparing Altera ARRIA 10 GX480, F35, to Xilinx UltraScale KU035, A1156: Compilation time (Xilinx 20% less). Frequency (Xilinx were much more stable and higher freq.) Thermal power (Xilinx almost 50% less power). Utilization (Xilinx to Altera ratio 86%). B. Even when I compared Altera s GX320 to Xilinx s KU035 (Altera smaller comp to 'same' Xilinx comp), the Xilinx s KU035 had better results, in all these characteristics. For example, when compiling Altera s GX320, F35 (same package as Altera s GX480) which should be 'equal' to Xilinx s KU035, for 44 replications: Quartus utilization for GX320 for 44 replications, Test C: Logic utilization (in ALMs) 139,107 / 119,900 ( 116 % ) And compilation failed. Not enough place in device. Xilinx utilization for KU035 for 44 replications, Test C: 60%. C. When compared ARRIA 10 GX270 to Xilinx s KU035, I had similar results in all characteristics (did not check all replications). Notes: 2 very important keynotes I've discovered after conducting this comparison (which should tip the scale in favor of Intel/Altera, and nevertheless, Xilinx results are much better): Xilinx FPGA chosen was smaller than Altera. This means Xilinx P&R algorithm must work harder to reach the desired frequency (since less space is available). Nevertheless, Xilinx results are much better. Xilinx FPGA speed is the slowest, compared to Altera (which is the fastest). This means Altera results should be better. Nevertheless, it is much worse. 13