DINI Group. FPGA-based Cluster computing with Spartan-6. Mike Dini Sept 2010

Similar documents
Field Programmable Gate Array (FPGA) Devices

Gate Estimate. Practical (60% util)* (1000's) Max (100% util)* (1000's)

INTRODUCTION TO FPGA ARCHITECTURE

High-Tech-Marketing. Selecting an FPGA. By Paul Dillien

ASIC Prototyping 32MX16 DDR SDRAM XILINX XILINX FPGA F XC2VP70/100 (FF1704) BD 104. Overview and Selection Guide. Logic Emulation ROCKETIO EF 180

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

Digital Integrated Circuits

Virtex 6 FPGA Broadcast Connectivity Kit FAQ

Virtex-6 FPGA ML605 Evaluation Kit FAQ June 24, 2009

S2C K7 Prodigy Logic Module Series

The Nios II Family of Configurable Soft-core Processors

The Next Generation 65-nm FPGA. Steve Douglass, Kees Vissers, Peter Alfke Xilinx August 21, 2006

English Japanese

Spartan-6 & Virtex-6 FPGA Connectivity Kit FAQ

VXS-610 Dual FPGA and PowerPC VXS Multiprocessor

Virtex-5 GTP Aurora v2.8

FPGA VHDL Design Flow AES128 Implementation

Lecture 41: Introduction to Reconfigurable Computing

Early Models in Silicon with SystemC synthesis

Zynq AP SoC Family

Employing Multi-FPGA Debug Techniques

ML505 ML506 ML501. Description. Description. Description. Features. Features. Features

L2: FPGA HARDWARE : ADVANCED DIGITAL DESIGN PROJECT FALL 2015 BRANDON LUCIA

Experiment 3. Digital Circuit Prototyping Using FPGAs

FPGAs & Multi-FPGA Systems. FPGA Abstract Model. Logic cells imbedded in a general routing structure. Logic cells usually contain:

FPGA system development What you need to think about. Frédéric Leens, CEO

Advanced FPGA Design Methodologies with Xilinx Vivado

XMC-FPGA05F. Programmable Xilinx Virtex -5 FPGA PMC/XMC with Quad Fiber-optics. Data Sheet

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

Xilinx(Ultrascale) Vs. Altera(ARRIA 10) Test Bench

CprE 583 Reconfigurable Computing

HES-7 ASIC Prototyping

ALTERA FPGAs Architecture & Design

Air Cooled 4U 1/2 Rack OpenVPX Windows/Linux Computer with Four Expansion Slots DESCRIPTION

Learning Outcomes. Spiral 3 1. Digital Design Targets ASICS & FPGAS REVIEW. Hardware/Software Interfacing

UCT Software-Defined Radio Research Group

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

BittWare s XUPP3R is a 3/4-length PCIe x16 card based on the

VPXI epc. Datasheet. AmpliconBenelux.com. Air Cooled 4U 1/2 Rack OpenVPX Windows/Linux Computer with Four Expansion Slots DESCRIPTION

Cover TBD. intel Quartus prime Design software

Peter Alfke, Xilinx, Inc. Hot Chips 20, August Virtex-5 FXT A new FPGA Platform, plus a Look into the Future

The Xilinx XC6200 chip, the software tools and the board development tools

High-Performance Integer Factoring with Reconfigurable Devices

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

Building blocks for custom HyperTransport solutions

EITF35: Introduction to Structured VLSI Design

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

PXIe FPGA board SMT G Parker

SHA3 Core Specification. Author: Homer Hsing

X5-TX. Datasheet. AmpliconBenelux.com

AXI4 Interconnect Paves the Way to Plug-and-Play IP

Cover TBD. intel Quartus prime Design software

HPE720. Dual Xilinx Virtex -5 FPGA & MPC8640D VPX Processor Card. Data Sheet

Use of Embedded FPGA Resources in Implementations of Five Round Three SHA-3 Candidates

Organic Computing. Dr. rer. nat. Christophe Bobda Prof. Dr. Rolf Wanka Department of Computer Science 12 Hardware-Software-Co-Design

Advanced FPGA Design Methodologies with Xilinx Vivado

Field Programmable Gate Array (FPGA)

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

A Real Time Implementation of High Speed Data Transmission using Aurora Protocol on Multi-Gigabit Transceivers in Virtex-5 FPGA

Power Consumption in 65 nm FPGAs

Field Programmable Gate Array

DSP Co-Processing in FPGAs: Embedding High-Performance, Low-Cost DSP Functions

System-on Solution from Altera and Xilinx

VXS-621 FPGA & PowerPC VXS Multiprocessor

The Next Generation of Cryptanalytic Hardware

Introduction to Field Programmable Gate Arrays

discrete logic do not

Spartan-6 and Virtex-6 FPGA Embedded Kit FAQ

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

SMT166-FMC User Guide

Introduction of the Research Based on FPGA at NICS

Fibre Channel Arbitrated Loop v2.3

Qsys and IP Core Integration

Full Linux on FPGA. Sven Gregori

The S6000 Family of Processors

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

H100 Series FPGA Application Accelerators

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

PowerPlay Early Power Estimator User Guide for Cyclone III FPGAs

AD GSPS Analog Input XMC/PMC with Xilinx Virtex -5 FPGA. Data Sheet

User Manual for FC100

SoC Platforms and CPU Cores

FPE320. Xilinx Virtex -5 3U VPX Processor with FMC Site. Data Sheet

High Speed Multi-User ASIC/SoC Prototyping system

PlanAhead Release Notes

Embedded Computing Platform. Architecture and Instruction Set

easic Technology & Nextreme Architecture

Intel Arria 10 FPGA Performance Benchmarking Methodology and Results

Configurable Embedded Systems: Using Programmable Logic to Compress Embedded System Design Cycles

Stratix II vs. Virtex-4 Performance Comparison

Rich Sevcik Executive Vice President, Xilinx APAC: RS _January 05

Lecture 7: Introduction to Co-synthesis Algorithms

Design Once with Design Compiler FPGA

The Ultimate System Integration Platform. VIRTEX-5 FPGAs

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

PMC-440 ProWare FPGA Module & ProWare Design Kit

1. Overview for the Arria V Device Family

FPGAs Provide Reconfigurable DSP Solutions

Spiral 3-1. Hardware/Software Interfacing

Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas

Transcription:

DINI Group FPGA-based Cluster computing with Spartan-6 Mike Dini mdini@dinigroup.com www.dinigroup.com Sept 2010 1

The DINI Group We make big FPGA boards Xilinx, Altera 2

The DINI Group 15 employees in downtown La Jolla A little north of San Diego, California Started as ASIC/FPGA design consultants in 1995 First product was the DN250k10 (1998) 6 FPGAs Based on 4000-series Xilinx FPGAs 6 XC4085 s And then Xilinx Virtex, Virtex-E, V2Pro, V4, V5, V6 Altera Stratix, Stratix2, S3/4 We are FPGA specialists 3

Overview of Product Line Goal: Provide customers a cost-effective vehicle to use the biggest and fastest state of the art FPGAs Large expensive FPGAs (>$5000) Xilinx: Virtex-6 Altera: Stratix IV Cheap FPGAs (~=$100) Xilinx: Spartan-6 Altera: Cyclone III/IV

Altera Stratix IV 130M ASIC gates Uncle of Monster 5

DN7020k10: 20 Stratix-IV FPGAs - Largest FPGA board ever shipped - 13 million LUT/FF s (130 million ASIC gates) - $xxx with 20 4SE820s

7

FPGAs applied to HPC: Bioinformatics/Genomic SW/BLAST V6 P&R encryption/decryption monte carlo atomic modeling encryption/ssl Algorithms Analyzed graphics (3D) imaging (ultrasound/cat) oil exploration DSP stuff ImpulseC Celoxica/HandelC Matlab to gates CDMA decode GPS correlation video compression So, we need FPGAs, high-speed memories (big), high speed memories (small), and a manner to move large amounts of data. All at the best logic/speed/capacity price point and within a power budget. 8

FPGA Choices for HPC Xilinx and Altera Xilinx: Virtex-6, Spartan-6 Altera: Stratix-4, Cyclone III/IV Virtex-6, Stratix-4 bigger and faster And 5x-10x more expensive measure in $$$/performance So for HPC, Spartan and Cyclone are the only viable choices. 9

FPGA Speed Grades (slowest to fastest) LUT Size FF's Gate Estimate Max (100% util) (1000's) Practical (60% util) (1000's) Max I/O's Multipliers (18x18) Multipliers (25x18) Blocks (18kbits) Memory Total (kbits) Total (kbytes) Virtex-6 Xilinx Virtex-5 Spartan -6 Virtex-4 VirtexII Pro LX LX760-1L,-1,-2 6-input 948,480 9,105 5,509 1,200 864 1,440 25,920 3,240 LX550(T) -1L,-1,-2 6-input 687,360 6,599 4,000 1,200 864 1,264 22,752 2,844 LX365T -1L,-1,-2,-3 6-input 455,040 4,368 2,621 600 576 832 14,976 1,872 LXT LX240T -1L,-1,-2,-3 6-input 301,440 2,894 1,736 600 768 832 14,976 1,872 LX195T -1L,-1,-2,-3 6-input 249,600 2,396 1,438 600 640 688 12,384 1,548 LX130T -1L,-1,-2,-3 6-input 160,000 1,536 922 600 480 528 9,504 1,188 SXT SX475T -1L,-1,-2 6-input 595,200 5,714 3,428 600 2,016 2,128 38,304 4,788 SX315T -1L,-1,-2,-3 6-input 394,000 3,782 2,269 600 1,344 1,408 25,344 3,168 LX150-1L,-2,-3 6-input 184,464 1,771 1,063 338 182 268 4,824 603 LX100-1L,-2,-3 6-input 126,576 1,215 729 326 182 268 4,824 603 LX LX75-1L,-2,-3 6-input 93,000 893 536 270 134 172 3,096 387 LX45-1L,-2,-3 6-input 54,576 524 314 316 58 116 2,088 261 LX25-1L,-2,-3 6-input 30,064 289 173 266 38 52 936 117 LX330-1,-2 6-input 207,360 3,320 1,990 1,200 192 576 10,368 1,296 LX LX220-1,-2 6-input 138,240 2,210 1,330 800 128 384 6,912 864 LX155-1,-2,-3 6-input 97,280 1,556 934 800 128 384 6,912 864 LX110-1,-2,-3 6-input 69,120 1,110 670 800 64 256 4,608 576 LX155T -1,-2,-3 6-input 97,280 1,556 934 640 128 424 7,632 954 LX110T -1,-2,-3 6-input 69,120 1,110 666 640 64 296 5,328 666 LXT LX85T -1,-2,-3 6-input 51,840 830 498 480 48 216 3,888 486 LX50T -1,-2,-3 6-input 28,800 460 276 480 48 120 2,160 270 LX30T -1,-2,-3 6-input 19,200 307 184 360 32 72 1,296 162 SX95T -1,-2,-3 6-input 58,880 940 564 640 640 488 8,784 1,098 SXT SX50T -1,-2,-3 6-input 32,640 522 313 480 288 264 4,752 594 SX35T -1,-2,-3 6-input 21,760 392 235 360 192 168 3,024 378 FX100T -1,-2,-3 6-input 64,000 1,024 614 640 256 456 8,208 1,026 FXT FX70T -1,-2,-3 6-input 44,800 717 430 640 128 296 5,328 666 FX30T -1,-2,-3 6-input 20,480 328 197 360 64 136 2,448 306 LX200-10,-11 4-input 178,176 2,490 1,490 960 96 336 6,048 756 LX LX160-10,-11,-12 4-input 135,168 1,890 1,130 960 96 288 5,184 648 LX100-10,-11,-12 4-input 98,304 1,380 830 960 96 240 4,320 540 FX FX100-10,-11,-12 4-input 84,352 1,180 710 768 160 376 6,768 846 FX60-10,-11,-12 4-input 50,560 710 430 576 128 232 4,176 522 LX160-10,-11,-12 4-input 135,168 1,890 1,130 768 96 288 5,184 648 LX100-10,-11,-12 4-input 98,304 1,380 830 768 96 240 4,320 540 LX LX80-10,-11,-12 4-input 71,680 1,000 600 768 80 200 3,600 450 LX60-10,-11,-12 4-input 53,248 750 450 640 64 160 2,880 360 LX40-10,-11,-12 4-input 36,864 520 310 640 64 96 1,728 216 SX SX55-10,-11,-12 4-input 49,152 690 410 640 512 320 5,760 720 2vp100-5,-6 4-input 88,192 1,230 740 1040 444 444 7,992 999 2vp70-5,-6,-7 4-input 66,176 930 560 996 328 328 5,904 738 2vp50-5,-6,-7 4-input 47,232 660 400 692 232 232 4,176 522 Altera Stratix IV Stratix III StratixII GX StratixII FPGA Speed Grades (slowest to fastest) LUT Size FF's Gate Estimate Max (100% util) (1000's) Practical (60% util) (1000's) MLAB (640) M9K (9 kbit) M144K (144 kbit) Total (kbits) Total (kbytes) 4SE820-4,-3 6-input 656,000 10,496 6,508 1120 960 16261 1610 60 23,130 2,891 4SE530-4,-3,-2 6-input 424,960 6,799 4,080 960 1024 10624 1280 64 20,736 2,592 3SL340-4,-3,-2 6-input 270,000 4,320 2,592 1120 576 6750 1040 48 16,272 2,034 M512 (32x18) M4K (128x36) Memory M-RAM (4kx144) Total (kbits) Total (kbytes) 2SGX90E -5,-4,-3 6-input 72,768 1,020 610 558 192 488 408 4 4,415 552 2S180-5,-4,-3 6-input 143,520 2,010 1,210 1,170 384 930 768 9 9,163 1,145 Max I/O's Multipliers (18x18) 10

FPGA Speed Grades (slowest to fastest) LUT Size FF's Gate Estimate Max (100% util) (1000's) Practical (60% util) (1000's) Max I/O's Multipliers (18x18) Multipliers (25x18) Blocks (18kbits) Memory Total (kbits) Total (kbytes) Virtex-6 Xilinx Virtex-5 Spartan -6 irtex-4 LX LXT SXT LX LX LXT SXT FXT LX FX LX LX760-1L,-1,-2 6-input 948,480 9,105 5,509 1,200 864 1,440 25,920 3,240 LX550(T) -1L,-1,-2 6-input 687,360 6,599 4,000 1,200 864 1,264 22,752 2,844 LX365T -1L,-1,-2,-3 6-input 455,040 4,368 2,621 600 576 832 14,976 1,872 LX240T -1L,-1,-2,-3 6-input 301,440 2,894 1,736 600 768 832 14,976 1,872 LX195T -1L,-1,-2,-3 6-input 249,600 2,396 1,438 600 640 688 12,384 1,548 LX130T -1L,-1,-2,-3 6-input 160,000 1,536 922 600 480 528 9,504 1,188 SX475T -1L,-1,-2 6-input 595,200 5,714 3,428 600 2,016 2,128 38,304 4,788 SX315T -1L,-1,-2,-3 6-input 394,000 3,782 2,269 600 1,344 1,408 25,344 3,168 LX150-1L,-2,-3 6-input 184,464 1,771 1,063 338 182 268 4,824 603 LX100-1L,-2,-3 6-input 126,576 1,215 729 326 182 268 4,824 603 LX75-1L,-2,-3 6-input 93,000 893 536 270 134 172 3,096 387 LX45-1L,-2,-3 6-input 54,576 524 314 316 58 116 2,088 261 LX25-1L,-2,-3 6-input 30,064 289 173 266 38 52 936 117 LX330-1,-2 6-input 207,360 3,320 1,990 1,200 192 576 10,368 1,296 LX220-1,-2 6-input 138,240 2,210 1,330 800 128 384 6,912 864 LX155-1,-2,-3 6-input 97,280 1,556 934 800 128 384 6,912 864 LX110-1,-2,-3 6-input 69,120 1,110 670 800 64 256 4,608 576 LX155T -1,-2,-3 6-input 97,280 1,556 934 640 128 424 7,632 954 LX110T -1,-2,-3 6-input 69,120 1,110 666 640 64 296 5,328 666 LX85T -1,-2,-3 6-input 51,840 830 498 480 48 216 3,888 486 LX50T -1,-2,-3 6-input 28,800 460 276 480 48 120 2,160 270 LX30T -1,-2,-3 6-input 19,200 307 184 360 32 72 1,296 162 SX95T -1,-2,-3 6-input 58,880 940 564 640 640 488 8,784 1,098 SX50T -1,-2,-3 6-input 32,640 522 313 480 288 264 4,752 594 SX35T -1,-2,-3 6-input 21,760 392 235 360 192 168 3,024 378 FX100T -1,-2,-3 6-input 64,000 1,024 614 640 256 456 8,208 1,026 FX70T -1,-2,-3 6-input 44,800 717 430 640 128 296 5,328 666 FX30T -1,-2,-3 6-input 20,480 328 197 360 64 136 2,448 306 LX200-10,-11 4-input 178,176 2,490 1,490 960 96 336 6,048 756 LX160-10,-11,-12 4-input 135,168 1,890 1,130 960 96 288 5,184 648 LX100-10,-11,-12 4-input 98,304 1,380 830 960 96 240 4,320 540 FX100-10,-11,-12 4-input 84,352 1,180 710 768 160 376 6,768 846 FX60-10,-11,-12 4-input 50,560 710 430 576 128 232 4,176 522 LX160-10,-11,-12 4-input 135,168 1,890 1,130 768 96 288 5,184 648 LX100-10,-11,-12 4-input 98,304 1,380 830 768 96 240 4,320 540 LX80-10,-11,-12 4-input 71,680 1,000 600 768 80 200 3,600 450

12

We use the Spartan-6 LX150 and LX150T Largest planned/announced device in family FGG484 package, RoHS LX150: Field FPGAs, 12 total Can have identical or different bit files LX150T: Dataflow Manager FGG676 Three speed grades LX150: -1L,-2, -3 LX150T: -2, -3, -4 Relevance? FPGA

FPGA Status (Xilinx Spartan-6) Shipping now with ES parts. Supply is very, very tight Story about how Xilinx botched this is entertaining And a little depressing. Quantity shipments (production parts) in ~Sept 10 Serious questions about routing Useful maximum utilization percentages questionable SSO issues in LX150T et al.

Number 1 constraint for FPGA-based acceleration is power/cooling We solve this issue. Power/Cooling We ignore the 25W/slot maximum from the PCIe specification Board power supplied from topside connector Passive heatsinks assume LOTS of airflow Goal/Spec is to allow 50W per board

Memory Spartan-6 has integrated external memory controllers with LOTS of functionality We use a single DDR3 memory per field FPGA 2 DDR3 chips for Dataflow controller Presently stuffing 2Gb device (128M x 16) Goal is to get to 400 MHz (800 Mb/s per pin) Freq is dependent on speed grade stuffed Some specmanship and characterization will probably reduce this number a bit. 100% of the Memory Block Controller is dedicated to the user application Much reference material provided

Hosting via PCIe Standard, homegrown 4-lane PCIe core Virtex-6 LX130T GEN1/GEN2 Master moding engines PCIe core is fixed and NOT modifiable by user Don t want user **anywhere** near this function. Timing, Xilinx bugs, et al. PCIe bridge is field upgradeable

Interconnect FPGA FPGA I/O performance All single-ended Nearest neighbor connections 77 horizontal, 64 vertical Recommend using I/O FF Goal is to get to 150 MHz Source synchronous With DDR, this is 300 Mbits/sec per pin I/O FF and DDR functionality built-in to I/O block 19

Inter Chassis communication and Dataflow Manager FPGA (LX150T) has 8 highspeed GTP serial transceivers 3.125 Gb/s per lane Transmit and receive are independent 4-lanes each on two topside connectors Aurora protocol Expansion 4-lanes bounded should get to ~1 Gbyte/sec So there is ~4 Gbyte/s throughput capability on the 2 connectors 20

Inter Chassis communication and Board to board dataflow completely independent from the host processor Inter chassis External peripheral expansion Expansion Requires user intervention DINI to provide libraries and reference designs

Clocking and Debug Configurable global clock 31.25 MHz 350 MHz in 1 MHz increments 100 MHz clock MB (main bus) clock JTAG is connected for use with ChipScope Or other third party debug solutions

What We Provide vs. What You Need Customer tool flow: Simulation (verilog/vhdl) Most often: ModelSim Synthesis Xilinx/Altera tools work fine Expensive, third party synthesis tools no longer needed and no longer necessary Place/Route Comes from FPGA vendor: Xilinx/Altera Debug Chipscope, SignalTap, and other third party solutions 23