Fermi Cluster for Real-Time Hyperspectral Scene Generation

Size: px

Start display at page:

Download "Fermi Cluster for Real-Time Hyperspectral Scene Generation"

Loreen Pierce
5 years ago
Views:

1 Fermi Cluster for Real-Time Hyperspectral Scene Generation Gary McMillian, Ph.D. Crossfield Technology LLC 9390 Research Blvd, Suite I200 Austin, TX (512) x151 AF SBIR Program, Donald Snyder, III Program Manager Funding provided by Frank Carlen, Multi-Spectral Test

2 System Architecture & Approach Scenes generated by heterogeneous processors, then transported over In5iniBand to the projector(s) using RDMA protocol for high throughput and low latency Network interfaces aggregate data from multiple heterogeneous processors in high- speed frame buffers Contents of frame buffers output to projector through FPGA Mezzanine Card (FMC) interface IEEE 1588 Precision Time Protocol (PTP) provides global time synchronization Heterogeneous processors and projector network interfaces scale independently 7/20/11 Crossfield Technology LLC 2

3 Scalable System Architecture Processor Network CPU/GPU Interface 7/20/11 LVDS Projector HWIL Fiber InfiniBand Switch Processor Nodes DVI Network Interface Adapters Crossfield Technology LLC 3

4 HWIL Simulation System QuickPath Interconnect (QPI) ~100 Gbps PCI Express x8 ~32 Gbps (x16 ~64 Gbps) DDR3 SDRAM ~85 Gbps/ch x 3 ch GDDR5 SDRAM ~192 Gbps/ch x 6 ch QDR InfiniBand ~32 Gbps VITA 57.1 / FMC ~100 Gbps SERDES Gbps LVDS I/O Projector / HWIL User-definable PHY Frame Synch/Request FMC SSD DDR3 SDRAM CPU CPU DDR3 SDRAM FPGA DDR3 SDRAM QPI QPI PCIe x8 GDDR5 SDRAM GPU PCIe Bridge PCIe Bridge GPU GDDR5 SDRAM CPU DDR3 SDRAM PCIe x8 PCIe x8 GDDR5 SDRAM GPU Network Adapter Network Adapter 1U-4U Heterogeneous Processor Network Adapter 1U Crossfield Network Interface IEEE 1588 PTP Server + Ethernet InfiniBand Switch ( ports) 7/20/11 Crossfield Technology LLC 4

5 REAL-TIME HIGH PERFORMANCE COMPUTER (HPC) 7/20/11 Crossfield Technology LLC 5

6 Real-Time HPC Requirements Deterministic & Synchronous Synthesized images complete & ready at HWIL frame rate High Floating-Point Performance Implement physics-based algorithms High Bandwidth Inter-processor communications for data exchange Stream high-resolution images to projector at high frame rates High Memory Capacity & Performance Processor memory code, model parameters, data Non-volatile storage code, model parameters, data, logging 7/20/11 Crossfield Technology LLC 6

Intel Xeon Processor Roadmap Westmere Microarchitecture 32 nm process, 6 Cores 40 lanes PCI Express Gen 2.

7 Intel Xeon Processor Roadmap Westmere Microarchitecture 32 nm process, 6 Cores 40 lanes PCI Express Gen channels DDR Sandy Bridge Microarchitecture 32 nm process, 4-8 Cores 40 lanes PCI Express Gen channels DDR /20/11 Crossfield Technology LLC 7

8 Nvidia CUDA GPU Roadmap 21 SEP 2010 Kepler To be released sometime in 2011, 28 nm process. Estimated performance of 4-6 DP GFLOPS/W Maxwell To be released sometime in 2013, 22 nm process. Estimated performance of DP GFLOPS/W 7/20/11 Crossfield Technology LLC 8

Nvidia Tesla (Fermi Architecture) CUDA Programming Environment C/C++,

GFLOP Double Precision C2050/C2070 1030 GFLOP Single Precision

option GPUDirect with InfiniBand M2050/M2070 PCI Express 2.

9 Nvidia Tesla (Fermi Architecture) CUDA Programming Environment C/C++, Fortran, OpenCL, Java, Python or DirectX Compute GIGATHREAD Engine 515 GFLOP Double Precision C2050/C GFLOP Single Precision PARALLEL DATACACHE Technology 3-6 GB GDDR5 memory 384-bit bus ECC option GPUDirect with InfiniBand M2050/M2070 PCI Express 2.0 (16 lanes) Two DMA engines for bi-directional data transfer 7/20/11 Crossfield Technology LLC 9

10 Nvidia Tesla Comparison Peak double precision floating point performance Peak single precision floating point performance Tesla C2070 Tesla M2070 Tesla M GFLOPS 515 GFLOPS 665 GFLOPS 1030 GFLOPS 1030 GFLOPS 1331 GFLOPS CUDA cores Memory size (GDDR5) Memory bandwidth (ECC off) Total Dissipated Power (TDP) 6 GB 6 GB 6 GB 144 GB/s 150 GB/s 177 GB/s 247 W 225 W 250 W Retail price $2300 ~$2300 ~$3500 7/20/11 Crossfield Technology LLC 10

11 InfiniBand Roadmap SDR - Single Data Rate DDR - Double Data Rate QDR - Quad Data Rate FDR - Fourteen Data Rate EDR - Enhanced Data Rate HDR - High Data Rate NDR - Next Data Rat 7/20/11 Crossfield Technology LLC 11

Mellanox ConnectX-2 Network Adapters Nvidia GPUDirect InfiniBand Adapter and Nvidia GPU share CPU memory region Open Fabrics Enterprise Distribution (OFED) Software Bandwidth 10G Ethernet 10/20/40G

12 Mellanox ConnectX-2 Network Adapters Nvidia GPUDirect InfiniBand Adapter and Nvidia GPU share CPU memory region Open Fabrics Enterprise Distribution (OFED) Software Bandwidth 10G Ethernet 10/20/40G InfiniBand Protocol Support Remote Direct Memory Access (RDMA) OpenMPI, OSU MVAPICH, HPMPI, Intel MPI, MS MPI, Scali MPI TCP/UDP, IPoIB, SDP, RDS SRP, iser, NFS RDMA, FCoIB, FCoE PCIe 2.0 (8-lanes) Performance 1 µs Ping latency 50M MPI messages/s 7/20/11 Crossfield Technology LLC 12

13 Mellanox IS5200 InfiniBand Switch Non-blocking, full bisectional bandwidth ns latency Up to 216 QSFP ports Tb/s aggregate throughput 9U cabinet 6 spine modules 12 leaf modules 1 kw 7/20/11 Crossfield Technology LLC 13

14 Remote Direct Memory Access (RDMA) Remote Direct Memory Access enables data to be transferred from one processor s memory to another processor s memory across a network, without significantly involving either operating system RDMA supports zero-copy data transfers by enabling the network adapter to transfer data directly to or from application memory, eliminating the need to copy data between application memory and data buffers in the operating system kernel RDMA defines READ, WRITE and SEND/RECEIVE RDMA adapters support thousands of concurrent transactions using work queues 7/20/11 Crossfield Technology LLC 14

OpenFabrics Alliance (OFA) Open Source Application Level User APIs Diag Tools Open SM User Level MAD API InfiniBand OpenFabrics User Level Verbs / API iwarp R-NIC User Space IP Based App Access

Performance Manager Agent IP over InfiniBand Upper Layer Protocol Mid-Layer Kernel bypass SA Client Kernel Space MAD SMA IPoIB SDP SRP iser RDS Connection Manager Connection Manager Abstraction (CMA)

15 OpenFabrics Alliance (OFA) Open Source Application Level User APIs Diag Tools Open SM User Level MAD API InfiniBand OpenFabrics User Level Verbs / API iwarp R-NIC User Space IP Based App Access Sockets Based Access SDP Lib UDAPL Various MPIs Block Storage Access Clustered DB Access Access to File Systems SA MAD SMA PMA IPoIB Subnet Administrator Management Datagram Subnet Manager Agent Performance Manager Agent IP over InfiniBand Upper Layer Protocol Mid-Layer Kernel bypass SA Client Kernel Space MAD SMA IPoIB SDP SRP iser RDS Connection Manager Connection Manager Abstraction (CMA) Connection Manager NFS-RDMA RPC Cluster File Sys InfiniBand OpenFabrics Kernel Level Verbs / API iwarp R-NIC Kernel bypass SDP SRP iser RDS UDAPL HCA Sockets Direct Protocol SCSI RDMA Protocol (Initiator) iscsi RDMA Protocol (Initiator) Reliable Datagram Service User Direct Access Programming Lib Host Channel Adapter Provider Hardware Hardware Specific Driver InfiniBand HCA Hardware Specific Driver iwarp R-NIC R-NIC Key RDMA NIC Common InfiniBand iwarp Apps & Access Methods for using OF Stack 7/20/11 Crossfield Technology LLC 15

PCIe slots (PLX 8647 switch) Supports 1-4 M2090 + 1-2 IB HCA 4U server Dual Xeon 5600 processors & 5520

16 GPU Server Options 1U server Dual Xeon 5600 processors & 5520 chipsets Three 16-lane + one 8-lane PCIe slots Supports 1-3 M IB HCA 2U server Dual Xeon 5600 processors & 5520 chipsets Four 16-lane + two 8-lane PCIe slots (PLX 8647 switch) Supports 1-4 M IB HCA 4U server Dual Xeon 5600 processors & 5520 chipsets Eight 16-lane PCIe slots (4 PLX 8647 switches) Supports 4-7 C IB HCA 7/20/11 Crossfield Technology LLC 16

17 HPC System Configuration 4U Servers (64 + 1) Dual 6-core, 2.66 GHz Intel Xeon 5650 (Westmere) CPUs Dual Intel 5520 (Tylersburg-36D) IOH with 6.4 GT/s QPI Four 16-lane PCI Express Gen 2 slots Six 8 GB DDR DIMMs (48 GB) Four Nvidia Tesla C2070 (Fermi) GPUs One Mellanox 40G InfiniBand Host Channel Adapter One 300 GB, 10K RPM disk drive Mellanox 40G InfiniBand Switch (216 ports max) Symmetricom IEEE 1588 PTP Master Clock APC Smart-UPS RT 6000VA (18) 76 kw 42U Racks (9) *65 nodes x 1.4 kw/node = 91 kw 7/20/11 Crossfield Technology LLC 17

18 Advanced HPC System Configuration 2U Servers (64 + 1) Dual 6-core, 2.66 GHz Intel Xeon 5650 (Westmere) CPUs Dual Intel 5520 (Tylersburg-36D) IOH with 6.4 GT/s QPI Four 16-lane + two 8-lane PCI Express Gen 2 slots (with switch) Six 8 GB DDR DIMMs (48 GB) Three Nvidia Tesla M2090 (Fermi) GPUs Two Mellanox 40G InfiniBand Host Channel Adapters One 250 GB SSD (solid state disk) Mellanox 40G InfiniBand Switch (216 ports max) Symmetricom IEEE 1588 PTP Master Clock APC Symmetra PX SY100K100F UPS kw 42U Racks (4+1) 7/20/11 Crossfield Technology LLC 18

19 Future HPC System Configuration 2U Servers (64 + 1) Dual 8-core, 2.3 GHz Intel Xeon E (Sandy Bridge) CPUs Four 16-lane + two 8-lane PCI Express Gen 3 slots (with switch) Eight 8 GB DDR DIMMs (64 GB) Three Nvidia Tesla M2090 (Fermi) GPUs Two Mellanox 56G InfiniBand Host Channel Adapters One 250 GB SSD (solid state disk) Mellanox 56G InfiniBand Switch (648 ports max) Symmetricom IEEE 1588 PTP Master Clock APC Symmetra PX SY100K100F UPS kw 42U Racks (4+1) 7/20/11 Crossfield Technology LLC 19

IEEE 1588 Precision Time Protocol IEEE 1588-2008 Precision Time Protocol (PTP) Version 2 overcomes network and application latency and jitter through hardware time stamping at the physical layer of

20 IEEE 1588 Precision Time Protocol IEEE Precision Time Protocol (PTP) Version 2 overcomes network and application latency and jitter through hardware time stamping at the physical layer of the network. IEEE provides time transfer accuracy in the sub ns range, a significant improvement in time synchronization accuracy over Network Time Protocol (NTP). The Symmetricom XLi Grandmaster is IEEE PTP V2 compliant and time stamps PTP packets with a time stamp accuracy of 50 ns to UTC. Measured synchronization accuracy at a PTP client has been shown to be as good as a 17 ns offset from the XLi Grandmaster. Operating at 100BaseT line speed with deep time stamp packet buffers, the XLi Grandmaster can support thousands of 1588 clients. 7/20/11 Crossfield Technology LLC 20

21 Uninterruptable Power Supply (UPS) APC Symmetra PX 100kW Scalable to 100kW/100kVA 208V 3PH 332A Service 7/20/11 Crossfield Technology LLC 21

22 APC Symmetra PX Performance 7/20/11 Crossfield Technology LLC 22

23 HPC Performance Node System Cores CPU/GPU 12/ /98304 CPU SP FP Performance 128 GFLOP 8 TFLOP CPU DP FP Performance 64 GFLOP 4 TFLOP GPU SP FP Performance 3990 GFLOP 255 TFLOP GPU DP FP Performance 1995 GFLOP 128 TFLOP Main Memory Size 48 GB 3 TB Main Memory BW 64 GB/s 4 TB/s Disk Size 250 GB 16 TB Disk IOPS (4 KB) 20K 1.28M Disk R/W BW 500/315 MB/s 32/20 GB/s Network BW 50 Gb/s 3.2 Tb/s Power 1.5 kw 100 kw 7/20/11 Crossfield Technology LLC 23

24 HPC Procurement Schedule Breadboard Performance Evaluation 15 JUL Finalize HPC Configuration 15 JUL # Fermi Processors (4 -> 3) # IB Adapters (1 -> 2) UPS (100 kw), Server (4U -> 2U), SSD Request Final Vendor Quotes 1 AUG HPC Vendor Selection Issue HPC System Purchase Order OCT 31 HPC System Integration & Test by Vendor 6-12 week delivery ARO Installation DEC 31 Prepare electrical supply for UPS 7/20/11 Crossfield Technology LLC 24

25 REAL-TIME LINUX 7/20/11 Crossfield Technology LLC 25

26 Real-Time Operating System (RTOS) Requirements No dropped frames during simulation run Support Nvidia s CUDA Support InfiniBand Adapter with GPUDirect Support Precision Time Protocol (PTP) IEEE 1588 Candidate RTOS Concurrent Computer RedHawk RedHat MRG (Messaging, Real-Time, Grid) 7/20/11 Crossfield Technology LLC 26

27 Interrupt Dispatch Latency* *Ravi Malhotra, Real-Time Performance on Linux-based Systems, 2011 Freescale Technology Forum 7/20/11 Crossfield Technology LLC 27

28 Real-Time Support on Linux* Traditionally, Linux is not a real-time operating system Designed for server throughput performance rather than embedded systems latency Scheduling latencies can be unbound Big kernel lock and other mechanisms (softirq) typically end up blocking real-time critical tasks Processes cannot be pre-empted while executing system calls *Ravi Malhotra, Real-Time Performance on Linux-based Systems, 2011 Freescale Technology Forum 7/20/11 Crossfield Technology LLC 28

29 Sources of Latency & How RT Patch Helps* *Ravi Malhotra, Real-Time Performance on Linux-based Systems, 2011 Freescale Technology Forum 7/20/11 Crossfield Technology LLC 29

30 HPC PERFORMANCE MODEL 7/20/11 Crossfield Technology LLC 30

31 Hyperformix Workbench Performance Model 7/20/11 Crossfield Technology LLC 31

32 Workbench Model Steps The application consists of 9 steps that comprise the generation and transfer of a frame: 1. Projector requests frame (provides state data) 2. CPU setups Frame Generation Process 3. CPU writes task data to CPU Memory (DDR3 SDRAM) 4. CPU tasks the GPU to synthesize the Frame 5. GPU reads the task data from CPU memory 6. GPU synthesizes the Frame 7. GPU transfers the frame data to CPU memory 8. CPU tasks the InfiniBand Network Adapters to transfer the frame to Crossfield Network Interface via the InfiniBand Switch 9. Network Adapters transfer the frame to FPGA memory using RDMA Protocol 7/20/11 Crossfield Technology LLC 32

33 Hyperformix Workbench Performance Model 7/20/11 Crossfield Technology LLC 33

34 Workbench Model Results Application Steps Response (µs) Application.Step_1_Frame_Request_from_Projector.response Application.Step_2_and_3_Setup_Process_and_write_data_to_memory.response Application.Step_4_CPU_tasks_GPU.response Application.Step_5_GPU_reads_data_from_CPU_Memory.response Application.Step_6_GPU_synthesizes_Frame_first_transfer.response 1000 Application.Step_7_GPU_xfers_Frame_to_CPU_memory.response Application.Step_8_CPU_tasks_Network_Adapter_to_transfer_Frame_to_NI.response Application.Step_9_Network_Adapter_xfer_frame_to_NI_FPGA_Memory.response 2259 Application.Main_RT_App.All_Steps_transfer_RT_ /20/11 Crossfield Technology LLC 34

35 PROJECTOR INTERFACE 7/20/11 Crossfield Technology LLC 35

36 Projector Interfaces FPGA Mezzanine Cards (FMC) 1. Two Dual DVI 2. Parallel Fiber Optic Ports (8-10) 3. Digital Micromirror Device (DMD) Interface All modules provide 2 User Definable I/Os, e.g. HWIL Synchronization Signal Output Next Frame 7/20/11 Crossfield Technology LLC 36

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications Outline RDMA Motivating trends iwarp NFS over RDMA Overview Chelsio T5 support Performance results 2 Adoption Rate of 40GbE Source: Crehan