A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing

Similar documents
Building blocks for custom HyperTransport solutions

Optimal Management of System Clock Networks

An FPGA based Verification Platform for HyperTransport 3.x

CBMnet as FEE ASIC Backend

Flexible Architecture Research Machine (FARM)

SpaceWire-RT. SpaceWire-RT Status SpaceWire-RT IP Core ASIC Feasibility SpaceWire-RT Copper Line Transceivers

SpaceWire-RT Update. EU FP7 Project Russian and European Partners. SUAI, SubMicron, ELVEES University of Dundee, Astrium GmbH

Arria V GX Transceiver Starter Kit

AN 610: Implementing Deterministic Latency for CPRI and OBSAI Protocols in Altera Devices

Design Guidelines for Intel FPGA DisplayPort Interface

Arria V GX Video Development System

ECE 485/585 Microprocessor System Design

Altera Product Overview. Altera Product Overview

40-Gbps and 100-Gbps Ethernet in Altera Devices

Implementing 9.8G CPRI in Arria V GT and ST FPGAs

SGMII Interface Implementation Using Soft-CDR Mode of Stratix III Devices

A (Very Hand-Wavy) Introduction to. PCI-Express. Jonathan Heathcote

Section I. Stratix II GX Device Data Sheet

QPairs QTE/QSE-DP Multi-connector Stack Designs In PCI Express Applications 16 mm Connector Stack Height REVISION DATE: OCTOBER 13, 2004

S2C K7 Prodigy Logic Module Series

Interfacing RLDRAM II with Stratix II, Stratix,& Stratix GX Devices

Optimizing latency in Xilinx FPGA Implementations of the GBT. Jean-Pierre CACHEMICHE

Multi-Gigabit Transceivers Getting Started with Xilinx s Rocket I/Os

Interfacing FPGAs with High Speed Memory Devices

ASNT_MUX64 64Gbps 2:1 Multiplexer

Maintaining Cache Coherency with AMD Opteron Processors using FPGA s. Parag Beeraka February 11, 2009

The Benefits of FPGA-Enabled Instruments in RF and Communications Test. Johan Olsson National Instruments Sweden AB

Understanding JESD204B High-speed inter-device data transfers for SDR

10 Gigabit XGXS/XAUI PCS Core. 1 Introduction. Product Brief Version April 2005

White Paper Low-Cost FPGA Solution for PCI Express Implementation

Interlaken IP datasheet

Peter Alfke, Xilinx, Inc. Hot Chips 20, August Virtex-5 FXT A new FPGA Platform, plus a Look into the Future

8. Migrating Stratix II Device Resources to HardCopy II Devices

Leveraging HyperTransport for a custom high-performance cluster network

SV3C DPRX MIPI D-PHY Analyzer. Data Sheet

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

2. Arria GX Architecture

December 2002, ver. 1.1 Application Note For more information on the CDR mode of the HSDI block, refer to AN 130: CDR in Mercury Devices.

Field Programmable Gate Array (FPGA) Devices

Extending HyperTransport Technology to 8.0 Gb/s in 32-nm SOI-CMOS Processors

ni.com High-Speed Digital I/O

MIPI D-PHY Bandwidth Matrix Table User Guide

MIPI D-PHY Bandwidth Matrix and Implementation Technical Note

6. I/O Features in Stratix IV Devices

JESD204B Xilinx/IDT DAC1658D-53D interoperability Report

Low Latency 40G Ethernet Example Design User Guide

High-speed I/O test: The ATE paradigm must change

SerialLite II IP Core User Guide

Raj Kumar Nagpal, R&D Manager Synopsys. Enabling Higher Data Rates and Variety of Channels with MIPI D-PHY

LVDS applications, testing, and performance evaluation expand.

Designing with the Xilinx 7 Series PCIe Embedded Block. Tweet this event: #avtxfest

SerialLite III Streaming IP Core Design Example User Guide for Intel Arria 10 Devices

5 GT/s and 8 GT/s PCIe Compared

Low Latency 100G Ethernet Design Example User Guide

RiseUp RU8-DP-DV Series 19mm Stack Height Final Inch Designs in PCI Express Applications. Revision Date: March 18, 2005

Resource Efficiency of Scalable Processor Architectures for SDR-based Applications

Application of Zero Delay Buffers in Switched Ethernet

SV3C DPRX MIPI D-PHY Analyzer. Data Sheet

Q Pairs QTE/QSE-DP Final Inch Designs In PCI Express Applications 16 mm Stack Height

Pactron FPGA Accelerated Computing Solutions

7. External Memory Interfaces in Stratix IV Devices

I/O Channels. RAM size. Chipsets. Cluster Computing Paul A. Farrell 9/8/2011. Memory (RAM) Dept of Computer Science Kent State University 1

Low Latency 100G Ethernet Intel Stratix 10 FPGA IP Design Example User Guide

AN 830: Intel FPGA Triple-Speed Ethernet and On-Board PHY Chip Reference Design

RHiNET-3/SW: an 80-Gbit/s high-speed network switch for distributed parallel computing

Industry Collaboration and Innovation

PBL Model Update. Trey Malpass Ye Min Ding Chiwu Zengli. IEEE Higher Speed Study Group Nov HUAWEI TECHNOLOGIES Co., Ltd.

Intel Thunderbolt. James Coddington Ed Mackowiak

802.3bj FEC Overview and Status. 1x400G vs 4x100G FEC Implications DRAFT. IEEE P802.3bs 400 Gb/s Ethernet Task Force. Bill Wilkie Xilinx

The HTX-Board: A Rapid Prototyping Station

2. Link and Memory Architectures and Technologies

SMT9091 SMT148-FX-SMT351T/SMT391

An Innovative Simulation Workflow for Debugging High-Speed Digital Designs using Jitter Separation

PCI Express 3.0 Characterization, Compliance, and Debug for Signal Integrity Engineers

Board Design Guidelines for PCI Express Architecture

XMC-FPGA05F. Programmable Xilinx Virtex -5 FPGA PMC/XMC with Quad Fiber-optics. Data Sheet

7. External Memory Interfaces in Arria II Devices

INT G bit TCP Offload Engine SOC

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

100 Gbps/40 Gbps PCS/PMA + MAC IP Core

AN 830: Intel FPGA Triple-Speed Ethernet and On-Board PHY Chip Reference Design

Farhad Shafai, Sarance Technologies March, 2008 SARANCE TECHNOLOGIES

PXIe FPGA board SMT G Parker

Serial RapidIO Gen2 Protocol Analyzer

IGLOO2 Evaluation Kit Webinar

AN 462: Implementing Multiple Memory Interfaces Using the ALTMEMPHY Megafunction

24DSI16WRC Wide-Range 24-Bit, 16-Channel, 105KSPS Analog Input Module With 16 Wide-Range (High-Level, Low-Level) Delta-Sigma Input Channels

White Paper. ORSPI4 Field-Programmable System-on-a-Chip Solves Design Challenges for 10 Gbps Line Cards

ML505 ML506 ML501. Description. Description. Description. Features. Features. Features

idp TM (Internal DisplayPort TM ) Technology Overview

Section I. Arria GX Device Data Sheet

SERDES Transmitter/Receiver (ALTLVDS) Megafunction User Guide

Intel Stratix 10 Clocking and PLL User Guide

Q2 QMS/QFS 16mm Stack Height Final Inch Designs In PCI Express Applications Generation Gbps. Revision Date: February 13, 2009

PCI Express 1.0a and 1.1 Add-In Card Transmitter Testing

AN 829: PCI Express* Avalon -MM DMA Reference Design

DisplayPort 1.4 Webinar

FELI. : the detector readout upgrade of the ATLAS experiment. Soo Ryu. Argonne National Laboratory, (on behalf of the FELIX group)

ALTDQ_DQS2 Megafunction User Guide

Transcription:

A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing Second International Workshop on HyperTransport Research and Application (WHTRA 2011) University of Heidelberg Computer Architecture Group, Sven Kapferer, Alexander Giese, Holger Fröning, Ulrich Brüning 09.02.2011

Outline Motivation HyperTransport Board Architecture HT3 Implementation Measurements Conclusion & Outlook 2

Accelerated Computing Most accelerated computing implementations focus on GPUs usage GPUs are mass market product providing a huge amount of parallel processing units FPGAs lead to higher costs Optimized and easy programming support FPGA usage is more difficult 3

FPGA Advantages FPGAs evolving in a remarkable way providing different advantages Flexibility and complete reconfigurability Enable usage of large amounts of memory Fine grain access to and from a host system Higher efficiency measured in GFLOPs/Watt 4

HyperTransport HyperTransport is the easiest and best way to connect a device to a processor The only public specification Free for academic use Low latency communication without bridges or protocol conversion HT link frequencies HT200 (2bit/200MHz) up to HT 3200 (32bit/3.2GHz) Theoretical maximum unidirectional bandwidth of 12.8 GB/s HT3 begins at HT1200 5

HT3 Block Diagram HT3 introduces fault detection and recovery mechanisms for higher reliability during high speed operation Periodic CRC window => per packet CRC Link training Link deskewing necessary Retry protocol Stomping 6

Altera Ulysses Rev 2 HTX Connector with 16bit transceiver based bidirectional interface Altera Stratix IV GX 230 (F1517 footprint) 256 MB DDR3-1066 memory 2 CX4 connectors routed to FPGA transceivers Marvel 88E1111 Ethernet solution USB2 connectivity via Cypress CY7C68013A High-Speed USB Additional external connectivity with Stratix LVDS interfaces 7

Ulysses Extension Options Extension possibilities for usage as Prototype Development Platform SEAF connector (Samtec) 500 pins Single-ended signaling up to 9.5 GHz (114) Differential pair signaling up to 10.5 GHz (55) Three QTH connectors (Samtec) 3x120 pins 9GHz single-ended capability 8 GHz differential pair capability Up to 108 differential pairs plus sideband signals 8

HT3 Implementation - Requirements Porting HT3 Core onto the Altera Ulysses Board has two major requirements HW has to be capable of high speed signals Special design methodology H-Spice simulations Physical Interface Development 9

Simulation Setup HT tracks between Opteron processor and Stratix IV HSpice model of FPGA high speed serial transceiver (Altera) Opteron processor IBIS models (AMD) HTX connector Spice model (Samtec) Cadence tool chain extracts design specific data 10

HT3.1 channel data eye specification Parameter Min Max Unit Description TCH-EYE 0.55 UI Eye width 2.4-5.2 Gbps TCH-EYE-6.4 0.65 UI Eye width 5.6-6.4 Gbps TCH-CLK-TJ 0.1 UI Jitter additive to CLK VCH-EYE-DC 140 mv Eye height for 2.4-5.2 Gbps VCH-EYE-DC-6.4 170 mv Eye height for 5.6-6.4 Gbps Unit Interval (UI) : 2.4 Gbps is 416 ps 6.4 Gbps is156 ps 11

HTX Track Simulated at HT1200 Eye width: 0.55 UI at 2.4Gbps = 229 ps 375 ps > 229 ps Eye height: Range 531 mv to 998mV above 170mV 12

Critical HTX Track at HT3200 Eye width: 0.65 UI at 6.4Gbps = 101 ps 107 ps > 101 ps Eye height: 224mV above 140mV 13

PHY challenges PHY must support both HT1 and HT3 Two inherently different operation modes: HT1 is source synchronous, a link clock is transmitted HT3 uses CDR to recover the embedded clock Both low speed (200 MHz) and high speed (3200 MHz) links must be supported LVDS too slow for HT3 Stratix IV transceivers must be used PHY must support frequency switching 14

Stratix IV Transceivers 15

HT1 operation HT200 data rate is below the minimum supported rate of Stratix IV transceivers 5 time oversampling is used No scrambling or 8b10b encoding, therefore no CDR Lock to reference clock to create sampling points TX link clock treated as data channel Simply created by applying a clock pattern Transceivers in PMA mode to provide deterministic latency 90 degree clock shift by padding clock data 16

HT3 operation Reconfiguration of transceiver logic for switch Bypass oversampling Switch to CDR Enable elastic buffers to compensate for phase differences on the lanes Inter-lane skew will be handled in HT3 core logic No support for error detection in PHY Signal integrity issues detected by HT3 core Link reliability features are defined by HT3 protocol 17

Measurements Round trip single PIO access to device is 655 ns Slower latency than old HT1 System More pipeline stages for decoding Serializer must be used within FPGA Several clock domain crossings Less than half of the available bandwidth Caused by credit starvation Credit Redistribution => 2GB/s DMA write and 1.6 GB/s DMA read => This improvement shows that full HT utilization can only be achieved by using a device with higher performance than an FPGA 18

Stratix IV GX 230 Resources Used Total Percent Combinational ALUTs 42,534 182,400 23 % Memory ALUTs 49 91,200 < 1 % Dedicated logic registers 40,009 182,400 22 % Logic utilization 34 % Total block memory bits 739,154 14,625,792 5 % Total PLLs 3 8 38 % 19

Conclusion & Outlook HT1200 and HT1600 (8 bit) implementations are stable Higher link speeds hard to achieve because of HW complexity HT3 platform for rapid prototyping and high performance reconfigurable computing was a successful development Ideal environment for developments and research in the areas of coprocessors or FPGA accelerators Extension connectors enable realization of adapter cards like a network search engine 20

Thank you for your attention! Questions? 21

Back-up Slides 22

NSE Design 23

HT3 Xilinx Board XC5VLX330T & 2x FX70T 16bit wide HT3 link 2 CX-4 connectors (IB, 10GE) SO-DIMM connector HT3 PHY design Based on GTPs (up to 4GBit/s) 24