A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing Second International Workshop on HyperTransport Research and Application (WHTRA 2011) University of Heidelberg Computer Architecture Group, Sven Kapferer, Alexander Giese, Holger Fröning, Ulrich Brüning 09.02.2011
Outline Motivation HyperTransport Board Architecture HT3 Implementation Measurements Conclusion & Outlook 2
Accelerated Computing Most accelerated computing implementations focus on GPUs usage GPUs are mass market product providing a huge amount of parallel processing units FPGAs lead to higher costs Optimized and easy programming support FPGA usage is more difficult 3
FPGA Advantages FPGAs evolving in a remarkable way providing different advantages Flexibility and complete reconfigurability Enable usage of large amounts of memory Fine grain access to and from a host system Higher efficiency measured in GFLOPs/Watt 4
HyperTransport HyperTransport is the easiest and best way to connect a device to a processor The only public specification Free for academic use Low latency communication without bridges or protocol conversion HT link frequencies HT200 (2bit/200MHz) up to HT 3200 (32bit/3.2GHz) Theoretical maximum unidirectional bandwidth of 12.8 GB/s HT3 begins at HT1200 5
HT3 Block Diagram HT3 introduces fault detection and recovery mechanisms for higher reliability during high speed operation Periodic CRC window => per packet CRC Link training Link deskewing necessary Retry protocol Stomping 6
Altera Ulysses Rev 2 HTX Connector with 16bit transceiver based bidirectional interface Altera Stratix IV GX 230 (F1517 footprint) 256 MB DDR3-1066 memory 2 CX4 connectors routed to FPGA transceivers Marvel 88E1111 Ethernet solution USB2 connectivity via Cypress CY7C68013A High-Speed USB Additional external connectivity with Stratix LVDS interfaces 7
Ulysses Extension Options Extension possibilities for usage as Prototype Development Platform SEAF connector (Samtec) 500 pins Single-ended signaling up to 9.5 GHz (114) Differential pair signaling up to 10.5 GHz (55) Three QTH connectors (Samtec) 3x120 pins 9GHz single-ended capability 8 GHz differential pair capability Up to 108 differential pairs plus sideband signals 8
HT3 Implementation - Requirements Porting HT3 Core onto the Altera Ulysses Board has two major requirements HW has to be capable of high speed signals Special design methodology H-Spice simulations Physical Interface Development 9
Simulation Setup HT tracks between Opteron processor and Stratix IV HSpice model of FPGA high speed serial transceiver (Altera) Opteron processor IBIS models (AMD) HTX connector Spice model (Samtec) Cadence tool chain extracts design specific data 10
HT3.1 channel data eye specification Parameter Min Max Unit Description TCH-EYE 0.55 UI Eye width 2.4-5.2 Gbps TCH-EYE-6.4 0.65 UI Eye width 5.6-6.4 Gbps TCH-CLK-TJ 0.1 UI Jitter additive to CLK VCH-EYE-DC 140 mv Eye height for 2.4-5.2 Gbps VCH-EYE-DC-6.4 170 mv Eye height for 5.6-6.4 Gbps Unit Interval (UI) : 2.4 Gbps is 416 ps 6.4 Gbps is156 ps 11
HTX Track Simulated at HT1200 Eye width: 0.55 UI at 2.4Gbps = 229 ps 375 ps > 229 ps Eye height: Range 531 mv to 998mV above 170mV 12
Critical HTX Track at HT3200 Eye width: 0.65 UI at 6.4Gbps = 101 ps 107 ps > 101 ps Eye height: 224mV above 140mV 13
PHY challenges PHY must support both HT1 and HT3 Two inherently different operation modes: HT1 is source synchronous, a link clock is transmitted HT3 uses CDR to recover the embedded clock Both low speed (200 MHz) and high speed (3200 MHz) links must be supported LVDS too slow for HT3 Stratix IV transceivers must be used PHY must support frequency switching 14
Stratix IV Transceivers 15
HT1 operation HT200 data rate is below the minimum supported rate of Stratix IV transceivers 5 time oversampling is used No scrambling or 8b10b encoding, therefore no CDR Lock to reference clock to create sampling points TX link clock treated as data channel Simply created by applying a clock pattern Transceivers in PMA mode to provide deterministic latency 90 degree clock shift by padding clock data 16
HT3 operation Reconfiguration of transceiver logic for switch Bypass oversampling Switch to CDR Enable elastic buffers to compensate for phase differences on the lanes Inter-lane skew will be handled in HT3 core logic No support for error detection in PHY Signal integrity issues detected by HT3 core Link reliability features are defined by HT3 protocol 17
Measurements Round trip single PIO access to device is 655 ns Slower latency than old HT1 System More pipeline stages for decoding Serializer must be used within FPGA Several clock domain crossings Less than half of the available bandwidth Caused by credit starvation Credit Redistribution => 2GB/s DMA write and 1.6 GB/s DMA read => This improvement shows that full HT utilization can only be achieved by using a device with higher performance than an FPGA 18
Stratix IV GX 230 Resources Used Total Percent Combinational ALUTs 42,534 182,400 23 % Memory ALUTs 49 91,200 < 1 % Dedicated logic registers 40,009 182,400 22 % Logic utilization 34 % Total block memory bits 739,154 14,625,792 5 % Total PLLs 3 8 38 % 19
Conclusion & Outlook HT1200 and HT1600 (8 bit) implementations are stable Higher link speeds hard to achieve because of HW complexity HT3 platform for rapid prototyping and high performance reconfigurable computing was a successful development Ideal environment for developments and research in the areas of coprocessors or FPGA accelerators Extension connectors enable realization of adapter cards like a network search engine 20
Thank you for your attention! Questions? 21
Back-up Slides 22
NSE Design 23
HT3 Xilinx Board XC5VLX330T & 2x FX70T 16bit wide HT3 link 2 CX-4 connectors (IB, 10GE) SO-DIMM connector HT3 PHY design Based on GTPs (up to 4GBit/s) 24