DINI Group FPGA-based Cluster computing with Spartan-6 Mike Dini mdini@dinigroup.com www.dinigroup.com Sept 2010 1
The DINI Group We make big FPGA boards Xilinx, Altera 2
The DINI Group 15 employees in downtown La Jolla A little north of San Diego, California Started as ASIC/FPGA design consultants in 1995 First product was the DN250k10 (1998) 6 FPGAs Based on 4000-series Xilinx FPGAs 6 XC4085 s And then Xilinx Virtex, Virtex-E, V2Pro, V4, V5, V6 Altera Stratix, Stratix2, S3/4 We are FPGA specialists 3
Overview of Product Line Goal: Provide customers a cost-effective vehicle to use the biggest and fastest state of the art FPGAs Large expensive FPGAs (>$5000) Xilinx: Virtex-6 Altera: Stratix IV Cheap FPGAs (~=$100) Xilinx: Spartan-6 Altera: Cyclone III/IV
Altera Stratix IV 130M ASIC gates Uncle of Monster 5
DN7020k10: 20 Stratix-IV FPGAs - Largest FPGA board ever shipped - 13 million LUT/FF s (130 million ASIC gates) - $xxx with 20 4SE820s
7
FPGAs applied to HPC: Bioinformatics/Genomic SW/BLAST V6 P&R encryption/decryption monte carlo atomic modeling encryption/ssl Algorithms Analyzed graphics (3D) imaging (ultrasound/cat) oil exploration DSP stuff ImpulseC Celoxica/HandelC Matlab to gates CDMA decode GPS correlation video compression So, we need FPGAs, high-speed memories (big), high speed memories (small), and a manner to move large amounts of data. All at the best logic/speed/capacity price point and within a power budget. 8
FPGA Choices for HPC Xilinx and Altera Xilinx: Virtex-6, Spartan-6 Altera: Stratix-4, Cyclone III/IV Virtex-6, Stratix-4 bigger and faster And 5x-10x more expensive measure in $$$/performance So for HPC, Spartan and Cyclone are the only viable choices. 9
FPGA Speed Grades (slowest to fastest) LUT Size FF's Gate Estimate Max (100% util) (1000's) Practical (60% util) (1000's) Max I/O's Multipliers (18x18) Multipliers (25x18) Blocks (18kbits) Memory Total (kbits) Total (kbytes) Virtex-6 Xilinx Virtex-5 Spartan -6 Virtex-4 VirtexII Pro LX LX760-1L,-1,-2 6-input 948,480 9,105 5,509 1,200 864 1,440 25,920 3,240 LX550(T) -1L,-1,-2 6-input 687,360 6,599 4,000 1,200 864 1,264 22,752 2,844 LX365T -1L,-1,-2,-3 6-input 455,040 4,368 2,621 600 576 832 14,976 1,872 LXT LX240T -1L,-1,-2,-3 6-input 301,440 2,894 1,736 600 768 832 14,976 1,872 LX195T -1L,-1,-2,-3 6-input 249,600 2,396 1,438 600 640 688 12,384 1,548 LX130T -1L,-1,-2,-3 6-input 160,000 1,536 922 600 480 528 9,504 1,188 SXT SX475T -1L,-1,-2 6-input 595,200 5,714 3,428 600 2,016 2,128 38,304 4,788 SX315T -1L,-1,-2,-3 6-input 394,000 3,782 2,269 600 1,344 1,408 25,344 3,168 LX150-1L,-2,-3 6-input 184,464 1,771 1,063 338 182 268 4,824 603 LX100-1L,-2,-3 6-input 126,576 1,215 729 326 182 268 4,824 603 LX LX75-1L,-2,-3 6-input 93,000 893 536 270 134 172 3,096 387 LX45-1L,-2,-3 6-input 54,576 524 314 316 58 116 2,088 261 LX25-1L,-2,-3 6-input 30,064 289 173 266 38 52 936 117 LX330-1,-2 6-input 207,360 3,320 1,990 1,200 192 576 10,368 1,296 LX LX220-1,-2 6-input 138,240 2,210 1,330 800 128 384 6,912 864 LX155-1,-2,-3 6-input 97,280 1,556 934 800 128 384 6,912 864 LX110-1,-2,-3 6-input 69,120 1,110 670 800 64 256 4,608 576 LX155T -1,-2,-3 6-input 97,280 1,556 934 640 128 424 7,632 954 LX110T -1,-2,-3 6-input 69,120 1,110 666 640 64 296 5,328 666 LXT LX85T -1,-2,-3 6-input 51,840 830 498 480 48 216 3,888 486 LX50T -1,-2,-3 6-input 28,800 460 276 480 48 120 2,160 270 LX30T -1,-2,-3 6-input 19,200 307 184 360 32 72 1,296 162 SX95T -1,-2,-3 6-input 58,880 940 564 640 640 488 8,784 1,098 SXT SX50T -1,-2,-3 6-input 32,640 522 313 480 288 264 4,752 594 SX35T -1,-2,-3 6-input 21,760 392 235 360 192 168 3,024 378 FX100T -1,-2,-3 6-input 64,000 1,024 614 640 256 456 8,208 1,026 FXT FX70T -1,-2,-3 6-input 44,800 717 430 640 128 296 5,328 666 FX30T -1,-2,-3 6-input 20,480 328 197 360 64 136 2,448 306 LX200-10,-11 4-input 178,176 2,490 1,490 960 96 336 6,048 756 LX LX160-10,-11,-12 4-input 135,168 1,890 1,130 960 96 288 5,184 648 LX100-10,-11,-12 4-input 98,304 1,380 830 960 96 240 4,320 540 FX FX100-10,-11,-12 4-input 84,352 1,180 710 768 160 376 6,768 846 FX60-10,-11,-12 4-input 50,560 710 430 576 128 232 4,176 522 LX160-10,-11,-12 4-input 135,168 1,890 1,130 768 96 288 5,184 648 LX100-10,-11,-12 4-input 98,304 1,380 830 768 96 240 4,320 540 LX LX80-10,-11,-12 4-input 71,680 1,000 600 768 80 200 3,600 450 LX60-10,-11,-12 4-input 53,248 750 450 640 64 160 2,880 360 LX40-10,-11,-12 4-input 36,864 520 310 640 64 96 1,728 216 SX SX55-10,-11,-12 4-input 49,152 690 410 640 512 320 5,760 720 2vp100-5,-6 4-input 88,192 1,230 740 1040 444 444 7,992 999 2vp70-5,-6,-7 4-input 66,176 930 560 996 328 328 5,904 738 2vp50-5,-6,-7 4-input 47,232 660 400 692 232 232 4,176 522 Altera Stratix IV Stratix III StratixII GX StratixII FPGA Speed Grades (slowest to fastest) LUT Size FF's Gate Estimate Max (100% util) (1000's) Practical (60% util) (1000's) MLAB (640) M9K (9 kbit) M144K (144 kbit) Total (kbits) Total (kbytes) 4SE820-4,-3 6-input 656,000 10,496 6,508 1120 960 16261 1610 60 23,130 2,891 4SE530-4,-3,-2 6-input 424,960 6,799 4,080 960 1024 10624 1280 64 20,736 2,592 3SL340-4,-3,-2 6-input 270,000 4,320 2,592 1120 576 6750 1040 48 16,272 2,034 M512 (32x18) M4K (128x36) Memory M-RAM (4kx144) Total (kbits) Total (kbytes) 2SGX90E -5,-4,-3 6-input 72,768 1,020 610 558 192 488 408 4 4,415 552 2S180-5,-4,-3 6-input 143,520 2,010 1,210 1,170 384 930 768 9 9,163 1,145 Max I/O's Multipliers (18x18) 10
FPGA Speed Grades (slowest to fastest) LUT Size FF's Gate Estimate Max (100% util) (1000's) Practical (60% util) (1000's) Max I/O's Multipliers (18x18) Multipliers (25x18) Blocks (18kbits) Memory Total (kbits) Total (kbytes) Virtex-6 Xilinx Virtex-5 Spartan -6 irtex-4 LX LXT SXT LX LX LXT SXT FXT LX FX LX LX760-1L,-1,-2 6-input 948,480 9,105 5,509 1,200 864 1,440 25,920 3,240 LX550(T) -1L,-1,-2 6-input 687,360 6,599 4,000 1,200 864 1,264 22,752 2,844 LX365T -1L,-1,-2,-3 6-input 455,040 4,368 2,621 600 576 832 14,976 1,872 LX240T -1L,-1,-2,-3 6-input 301,440 2,894 1,736 600 768 832 14,976 1,872 LX195T -1L,-1,-2,-3 6-input 249,600 2,396 1,438 600 640 688 12,384 1,548 LX130T -1L,-1,-2,-3 6-input 160,000 1,536 922 600 480 528 9,504 1,188 SX475T -1L,-1,-2 6-input 595,200 5,714 3,428 600 2,016 2,128 38,304 4,788 SX315T -1L,-1,-2,-3 6-input 394,000 3,782 2,269 600 1,344 1,408 25,344 3,168 LX150-1L,-2,-3 6-input 184,464 1,771 1,063 338 182 268 4,824 603 LX100-1L,-2,-3 6-input 126,576 1,215 729 326 182 268 4,824 603 LX75-1L,-2,-3 6-input 93,000 893 536 270 134 172 3,096 387 LX45-1L,-2,-3 6-input 54,576 524 314 316 58 116 2,088 261 LX25-1L,-2,-3 6-input 30,064 289 173 266 38 52 936 117 LX330-1,-2 6-input 207,360 3,320 1,990 1,200 192 576 10,368 1,296 LX220-1,-2 6-input 138,240 2,210 1,330 800 128 384 6,912 864 LX155-1,-2,-3 6-input 97,280 1,556 934 800 128 384 6,912 864 LX110-1,-2,-3 6-input 69,120 1,110 670 800 64 256 4,608 576 LX155T -1,-2,-3 6-input 97,280 1,556 934 640 128 424 7,632 954 LX110T -1,-2,-3 6-input 69,120 1,110 666 640 64 296 5,328 666 LX85T -1,-2,-3 6-input 51,840 830 498 480 48 216 3,888 486 LX50T -1,-2,-3 6-input 28,800 460 276 480 48 120 2,160 270 LX30T -1,-2,-3 6-input 19,200 307 184 360 32 72 1,296 162 SX95T -1,-2,-3 6-input 58,880 940 564 640 640 488 8,784 1,098 SX50T -1,-2,-3 6-input 32,640 522 313 480 288 264 4,752 594 SX35T -1,-2,-3 6-input 21,760 392 235 360 192 168 3,024 378 FX100T -1,-2,-3 6-input 64,000 1,024 614 640 256 456 8,208 1,026 FX70T -1,-2,-3 6-input 44,800 717 430 640 128 296 5,328 666 FX30T -1,-2,-3 6-input 20,480 328 197 360 64 136 2,448 306 LX200-10,-11 4-input 178,176 2,490 1,490 960 96 336 6,048 756 LX160-10,-11,-12 4-input 135,168 1,890 1,130 960 96 288 5,184 648 LX100-10,-11,-12 4-input 98,304 1,380 830 960 96 240 4,320 540 FX100-10,-11,-12 4-input 84,352 1,180 710 768 160 376 6,768 846 FX60-10,-11,-12 4-input 50,560 710 430 576 128 232 4,176 522 LX160-10,-11,-12 4-input 135,168 1,890 1,130 768 96 288 5,184 648 LX100-10,-11,-12 4-input 98,304 1,380 830 768 96 240 4,320 540 LX80-10,-11,-12 4-input 71,680 1,000 600 768 80 200 3,600 450
12
We use the Spartan-6 LX150 and LX150T Largest planned/announced device in family FGG484 package, RoHS LX150: Field FPGAs, 12 total Can have identical or different bit files LX150T: Dataflow Manager FGG676 Three speed grades LX150: -1L,-2, -3 LX150T: -2, -3, -4 Relevance? FPGA
FPGA Status (Xilinx Spartan-6) Shipping now with ES parts. Supply is very, very tight Story about how Xilinx botched this is entertaining And a little depressing. Quantity shipments (production parts) in ~Sept 10 Serious questions about routing Useful maximum utilization percentages questionable SSO issues in LX150T et al.
Number 1 constraint for FPGA-based acceleration is power/cooling We solve this issue. Power/Cooling We ignore the 25W/slot maximum from the PCIe specification Board power supplied from topside connector Passive heatsinks assume LOTS of airflow Goal/Spec is to allow 50W per board
Memory Spartan-6 has integrated external memory controllers with LOTS of functionality We use a single DDR3 memory per field FPGA 2 DDR3 chips for Dataflow controller Presently stuffing 2Gb device (128M x 16) Goal is to get to 400 MHz (800 Mb/s per pin) Freq is dependent on speed grade stuffed Some specmanship and characterization will probably reduce this number a bit. 100% of the Memory Block Controller is dedicated to the user application Much reference material provided
Hosting via PCIe Standard, homegrown 4-lane PCIe core Virtex-6 LX130T GEN1/GEN2 Master moding engines PCIe core is fixed and NOT modifiable by user Don t want user **anywhere** near this function. Timing, Xilinx bugs, et al. PCIe bridge is field upgradeable
Interconnect FPGA FPGA I/O performance All single-ended Nearest neighbor connections 77 horizontal, 64 vertical Recommend using I/O FF Goal is to get to 150 MHz Source synchronous With DDR, this is 300 Mbits/sec per pin I/O FF and DDR functionality built-in to I/O block 19
Inter Chassis communication and Dataflow Manager FPGA (LX150T) has 8 highspeed GTP serial transceivers 3.125 Gb/s per lane Transmit and receive are independent 4-lanes each on two topside connectors Aurora protocol Expansion 4-lanes bounded should get to ~1 Gbyte/sec So there is ~4 Gbyte/s throughput capability on the 2 connectors 20
Inter Chassis communication and Board to board dataflow completely independent from the host processor Inter chassis External peripheral expansion Expansion Requires user intervention DINI to provide libraries and reference designs
Clocking and Debug Configurable global clock 31.25 MHz 350 MHz in 1 MHz increments 100 MHz clock MB (main bus) clock JTAG is connected for use with ChipScope Or other third party debug solutions
What We Provide vs. What You Need Customer tool flow: Simulation (verilog/vhdl) Most often: ModelSim Synthesis Xilinx/Altera tools work fine Expensive, third party synthesis tools no longer needed and no longer necessary Place/Route Comes from FPGA vendor: Xilinx/Altera Debug Chipscope, SignalTap, and other third party solutions 23