7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

Size: px

Start display at page:

Download "7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT"

Cecil Fitzgerald
6 years ago
Views:

2 Murex Analytics Only global vendor of trading, risk management and processing systems focusing also on analytics A team of quants with math and finance skills who are fully hardware aware Pioneer in GPU programming for financial derivatives since 2008 We have learned that we need to program having in mind disruptive technologies to be ready on time when they become mainstream We need to consume less power to give more results at the same cost 2

3 Murex basic compute architecture Database Business layer Map Reduce like Analytic layer Grid computing like 1 Grid engine 3

Murex architecture evolution for the computation of 1

core GPU or Multiplecores 1 mono threaded task

computer Multiple CPU & GPU 1 multithreaded task

4 Murex single grid engine evolution. Always faster to become real time on a big problem Murex architecture evolution for the computation of 1 single query Legacy 1 mono threaded task attached to 1 core GPU or Multiplecores 1 mono threaded task launching multiples-cores or GPU kernels on the same computer Multiple CPU & GPU 1 multithreaded task launching kernels on a cluster of CPU & GPU powered computers using MPI

specialized in sequential code treatment I/O MPI Compute Node 2 2

5 Current Grid Engine Cluster version analyzed when GPUs are used Compute Node 1 2 M x86 Engine Head Node Fast x86 computer specialized in sequential code treatment I/O MPI Compute Node 2 2 M x86 Compute Node N 2 M x86 Java C Python MPI C OpenCL MPI 5

kernel calls Very small code Compute Node

M2090 + 2 x86 X86,HD & MEM are useless at

6 Current Grid Engine Cluster version analyzed when GPUs are used Does only kernel calls Very small code Compute Node 1 2 M x86 Engine Head Node Fast x86 computer specialized in sequential code treatment I/O MPI Compute Node 2 2 M x86 X86,HD & MEM are useless at the level of the slave nodes Engine Node N 2 M x86 6

7 Port the compute nodes to carma to keep only the essential : The GPU Build a BG like architecture Murex has received 4 carma devkit the 22th october 2012 and a mini cluster is built Easy setup since ubuntu and the CUDA samples are preinstalled 7

Port the compute nodes to carma : Fast track

ubuntu Install MPICH2 on a 32 bits head node we

OpenCL Murex code to CUDA with an in-house

bits server Compile the compute node code

8 Port the compute nodes to carma : Fast track Recompile MPICH2 directly on carma 32 bits ubuntu Install MPICH2 on a 32 bits head node we have chosen a PC in XP Automatically convert OpenCL Murex code to CUDA with an in-house developed tool Build a ptx file on ubuntu 64 bits server Compile the compute node code directly on carma. No cross compilation thanks to the CUDA driver API usage 8

Toward a first benchmark on a basic case M2090 is on all fronts but consumption & memory better than

1 x86 core + 4 carma dev kits Mem (GB) 6 8 Peak SP GFlops 1331 1080 Device to Device Bandwidth (GB/s)

9 Toward a first benchmark on a basic case M2090 is on all fronts but consumption & memory better than 4 Carma Dev Kits Let s see how close we can be Very challenging 1 x86 core + 1 x86 host +1 M2090 MPI 1 x86 core + 4 carma dev kits Mem (GB) 6 8 Peak SP GFlops Device to Device Bandwidth (GB/s) Host to Device Bandwidth (GB/s) Consumption GPU (W) X86 GPU Host Consumption (W) Max consumption (W) 225 theoretical usage standby 160 full usage # Peak160 Network IB 1Gbits Ethernet #0.8 Ratio #0.7 Ratio Cannot be multiplied by 4 since only used for kernel arguments #2.0 Ratio 9

A first test case : 365 days Scripted Asian option on a single stock with BS MC Real production code subset ported : Very

Only sequential time, network & kernel launching times matter The test case was built to break and show Network through

Its usage will have to be balanced by bigger/slower compute tasks The speed of the I/O node matters.

10 A first test case : 365 days Scripted Asian option on a single stock with BS MC Real production code subset ported : Very Small Load Parallel Random number generator Brownian bridge BS Diffusion Payoff scripting subset Pure compute time is 0. Only sequential time, network & kernel launching times matter The test case was built to break and show Network through MPI has a bigger effect when using Carma. Its usage will have to be balanced by bigger/slower compute tasks The speed of the I/O node matters. Carma kits were linked to a developer PC node slower than the high end Xeon Server driving the M2090 cluster But most of all the PCI/e link is very slow and kernel calls and host to/from device transfers will have to be limited 10

A first test case : 365 days Scripted Asian option on a single stock with BS MC Real production code subset ported : Usual Load Speed With 4 Carma kits the evaluation time is close to the one

11 A first test case : 365 days Scripted Asian option on a single stock with BS MC Real production code subset ported : Usual Load Speed With 4 Carma kits the evaluation time is close to the one obtained with the M2090 Consumption The Carma dev kits are always more efficient knowing that the number shown are very conservative since the peak consumption used to compute them was rarely reached during our tests 11

12 Why has it worked so well? We don't have a good PCIe But we try our best to generate and keep data on GPU We don't have IB But with MC most calls are non blocking and fine-tuned grouping of small messages enables us to be far less sensitive to network latency We have only 1g of RAM available on the ARM side when Linux is loaded But we only care about the GPU memory and the memory of the Quadro card is proportionally important relatively to the Flops generated by the GPU Arm will be slow running sequential code But we keep this one at the level of the I/O node which is still x86 based but we will investigate shortly if we can offload more code at the level of the Tegra 12

13 TO BE CONTINUED THANKS

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker CUDA on ARM Update Developing Accelerated Applications on ARM Bas Aarts and Donald Becker CUDA on ARM: a forward-looking development platform for high performance, energy efficient hybrid computing It