Leveraging HyperTransport for a custom high-performance cluster network

Size: px

Start display at page:

Download "Leveraging HyperTransport for a custom high-performance cluster network"

Julian Evans
5 years ago
Views:

1 Leveraging HyperTransport for a custom high-performance cluster network Mondrian Nüssle HTCE Symposium

2 Outline Background & Motivation Architecture Hardware Implementation Host Interface Hyper- Transport IP Core HTAX NIC ATU VELO RMA C&S Registerfile Network EXTOLL Software Stack Results Conclusion

3 EXTOLL: Background & Motivation High-performance computing synonymous with parallel computing Interconnection networks between processors are a key component in parallel systems Patterson stated: Latency lags Bandwidth The EXTOLL project at the CAG aims to significantly lower communication latency and improve communication in parallel systems

4 Goals Enable communication with extremely low latency close to main memory access Enable communication computation overlap Design a balanced system In terms of CPU on-loading and offloading In terms of system complexity Adding bandwidth is much easier

5 Key design facts Leverage HT as host interface for lowest latency of data transport between CPU and device Leverage modified HT as on-chip communication protocol Implement a lean network interface controller: Minimize state information on NIC Provide user-level, virtualized access (avoid kernel) Minimize number of CPU device and memory device transactions Network layer that provides reliable, in-order, low-latency transport service

6 Outline Background & Motivation Host Interface NIC ATU Network Architecture Hardware Implementation Hyper- Transport IP Core HTAX VELO RMA EXTOLL Software Stack C&S Registerfile Results Conclusion

7 Block diagram Host Interface NIC Network Host Interface block Hyper- Transport IP Core HTAX ATU VELO RMA EXTOLL NIC block: Several communication functions Network block 6 links 9x9 crossbar C&S Registerfile Flexible architecture: Configurable data path

8 Communication functions: VELO NIC ATU VELO RMA C&S Registerfile Virtualized Engine for low overhead Enable ultra-low send/receive communication Supports messages of up to 64-byte (one cache line) directly A single PIO transaction triggers sending of a message Message completion at the receiver is usually performed with a single DMA transaction Minimized traffic between host and device!

9 Communication functions: RMA NIC ATU VELO RMA C&S Registerfile EXTOLL Remote Memory Architecture Enables access to remote memory using put, get and atomic transactions Transaction triggered by a single 128-bit SSE2 store minimizing start-up latency Flexible notifications: at the requester the completer the responder or any combination thereof

10 Supporting modules NIC ATU VELO RMA C&S Registerfile Address Translation Unit Provides address translation services for RMA Registration/unregistration latency in prototype systems starts at ~2 µs Translation using on-chip TLB and main-memory tables Control and Status Registerfile automatically generated from highlevel spec (including kernel code) Local and remote access possible (network management software)

11 HT Interface Host Interface HT-Core: interface to host All functional units need to communicate with host Avoid protocol conversion for on chipnetwork Hyper- Transport IP Core HTAX HTAX crossbar running on-chip protocol simplified more source tags fixed format

12 Network layer Network EXTOLL Fully parametrizable width of data-paths and number of ports In-order delivery of packets Virtual channels Hardware retransmission Cut-through switching Credit based flow-control Current implementations: 6 ports used to connect to external links 16+2 bit data path width

13 Outline Background & Motivation Host Interface NIC ATU Network Architecture Hardware Implementation Hyper- Transport IP Core HTAX VELO RMA EXTOLL Software Stack C&S Registerfile Results Conclusion

14 Implementation I EXTOLL prototype is implemented on the HTX-Board Virtex 4 FX100 FPGA, speedgrade 11 or 12 6 SFP optical transceivers Currently : 16 bit width, 180 MHz core frequency 3.6 Gb/s links

15 Implementation II > 90% of all slices of the FPGA are in use for the design HT-Core runs at 200 MHz internal frequency and HT400 EXTOLL modules run with 180 MHz on speed-grade -12 device

16 Outline Background & Motivation Host Interface NIC ATU Network Architecture Hardware Implementation Hyper- Transport IP Core HTAX VELO RMA EXTOLL Software Stack C&S Registerfile Results Conclusion

17 Software Stack EXTOLL PCI Configspace libvelo velodrv Basedriver Application User-Application Middleware, i.e. MPI, GasNET (Library) rmadrv EXTOLL Hardware librma atudrv Management seru extoll_rf VELO RMA Registerfile ATU User Space Kernel Space NIC OS bypass Layered approach PGAS support through GasNET MPI support through OpenMPI Linux kernel driver

18 Outline Background & Motivation Host Interface NIC ATU Network Architecture Hardware Implementation Hyper- Transport IP Core HTAX VELO RMA EXTOLL Software Stack C&S Registerfile Results Conclusion

19 Latency [us] EXTOLL VELO EXTOLL RMA Put EXTOLL RMA Get Start-up latency ~ 1 µs Results Latency Size [byte], logarithmic scale RMA Put transaction beats VELO at 256 bytes Get latency is full roundtrip

20 Bandwidth [MB/s] Maximum bandwidth reached at 4k Results - Bandwidth EXTOLL Velo EXTOLL Put EXTOLL Get Peak payload bandwidth Half peak payload bandwidth Size [byte], logarithmic scale Bandwidth More than n½ bandwidth at 32 byte!

21 Technology Scaling Reference: Mellanox Connect X DDR IB Already beats best IB Silicon FPGA, HT400, 180 MHz optimized FPGA, HT400, 200 MHz HT800 ASIC, 500MHz int, est. HT1000 ASIC, 800MHz, est. ASIC would show 3 times lower latency!

22 Outline Background & Motivation Host Interface NIC ATU Network Architecture Hardware Implementation Hyper- Transport IP Core HTAX VELO RMA EXTOLL Software Stack C&S Registerfile Results Conclusion

23 Conclusion EXTOLL is an architecture for ultra low-latency communication in parallel systems prototype hardware is up and running basic software environment is up and running Performance numbers are excellent: ~ 1 μs start-up latency on FPGA prototype Bandwidth limited by serializers & board, but can be improved with new platform

24 Next Steps more software is being added Most interesting GasNET Evaluation on 1024-core Valencia Cluster On the hardware-side, next step is a new revision with a more powerful base technology Evaluation of next platform for HW

25 Thanks! Questions?

Building blocks for custom HyperTransport solutions

Building blocks for custom HyperTransport solutions Holger Fröning 2 nd Symposium of the HyperTransport Center of Excellence Feb. 11-12 th 2009, Mannheim, Germany Motivation Back in 2005: Quite some experience