Asian Option Pricing on cluster of GPUs: First Results

Size: px

Start display at page:

Download "Asian Option Pricing on cluster of GPUs: First Results"

Kathlyn Lynch
6 years ago
Views:

1 Asian Option Pricing on cluster of GPUs: First Results (ANR project «GCPMF») S. Vialle SUPELEC L. Abbas-Turki ENPC With the help of P. Mercier (SUPELEC). Previous work of G. Noaje March-June 2008.

2 1 Building a GPU cluster for experimentations

3 1 Building a cluster of GPUs 1.1 Objectives & Strategy 1. Build a 16-node cluster to experiment distributed computing on GPU 2. With a multi-core CPU and a GPU on each node 3. Choose hardware to support asynchronous communications and maximal overlapping strategy: Parallelization and overlapping of: GPU computation, CPU-GPU communications, CPU computations, and CPU-CPU communications (across the interconnection network). RAM RAM RAM RAM RAM RAM CPU GPU CPU GPU CPU GPU Interconnection network

4 1 Building a cluster of GPUs Hardware choice GPU on each node: ASUS GeForce 8800 GT Product model: EN8800GT/G/HTDP/512M Multiprocessors: 14 Stream processors: 112 Core clock: 600 MHz Memory clock: 900 MHz Memory amount: 512 MB Memory interface: 256-bit Memory bandwidth: 57.6 GB/sec Texture fill rate: 33.6 billion/sec Asynchronous communications and Cuda 1.1 supported CPU on each node: 1 processor dual-cores Intel E8200, 2.66 GHz front side bus:1333mhz RAM : 4Go DDR3, cache : 6Mo

5 1 Building a cluster of GPUs Software installed Software MPICH-2 OpenMPI GCC (with OpenMP) ICC (with OpenMP) CUDA 1.1 OAR Linux Fedora core 8 (64 bit kernel) Installed & Available yes yes yes no, coming soon yes no, coming soon yes To support various experiments Contact us Other software can be installed

6 1 Building a cluster of GPUs 1.4 Interconnection networks Networks Gigabit Ethernet Infiniband 10-Gigabit Ethernet Installed & available yes yes, half of the cluster no, next year GPUs compute very fast: network communication times are not negligible experiment and identify the best / less worst network

7 2 First benchmarks on a GPU cluster

8 2 First benchmarks Distributed matrix product Principles (C = AxB) : 1. Matrixes A and B are partitioned on P PCs. 2. The B partition is static. 3. The A partition circulates on the ring of PCs. 4. Algorithm includes P steps. 5. At each step, each PC computes a part of C matrix 0 1 P-1 6. At the end, the C = AxB matrix is distributed on the P PCs 0 1 P-1 Each local computations is run on the GPU

9 2 First benchmarks Distributed matrix product One step on PE i: PE i Circulation of A partition: PE i-1 PE i+1 CPU-GPU data transfers: Computation of C on GPU:

10 2 First benchmarks Distributed matrix product MPI on cluster of CPUs: 1 core/node + Gigabit Ethernet Computation time >> communications time overlapping has no impact Regular decrease of the execution time (good scalability) MPI - No Overlap (1 core / node) MPI - Overlap (1 core / node) 1000,0 MPI-NoOverlap-LoopTime MPI-NoOverlap-ComputTime 1000,0 MPI-Overlap-LoopTime MPI-Overlap-ComputTime MPI-NoOverlap-CommTime MPI-Overlap-WaitTime 100,0 100,0 Texec (s) 10,0 Texec (s) 10,0 1,0 1,0 0, , Number of Nodes Number of Nodes

11 2 First benchmarks Distributed matrix product MPI+CUDA on cluster of CPU-GPUs: 1 core/node + 1 GPU/node + Gigabit Eth Computation time communications time! Gigabit Ethernet is not fast enough for GPU communications! Overlapping of CPU comms & GPU computation has an impact: The overlap is incomplete, but seems the right strategy 7 6 MPI - No Overlap MPI+CUDA-NoOverlap-LoopTime MPI+CUDA-NoOverlap-ComputTime 7 6 MPI - Overlap MPI+CUDA-Overlap-LoopTime MPI+CUDA-Overlap-ComputTime 5 MPI+CUDA-NoOverlap-CommTime 5 MPI+CUDA-Overlap-WaitTime Texec (s) 4 3 Texec (s) Number of Nodes Number of Nodes

12 2 First benchmarks Distributed matrix product Finally: 2.1 Gflops on 1 CPU 155 GFlops on 8 GPUs But many time spent in cluster communications Nb of Nodes Nb of CPUs (1 core/node) Nb of GPUs (1 GPU/node) GFlops 77 Gflops GFlops 155 Gflops Matrix product (6080x6080) on GPLEC cluster Network is slow and overlap is incomplete! Infiniband interconnect does not improve performances (difference appears for a larger number of nodes). Result are encouraging but are far from peak performances! Performances (GigaFlops) MPI+Cuda - Overlap MPI-CUDA - No overlap MPI - Overlap (1 core/node) Number of nodes

13 3 Parallelization of an «Asian Option Pricer» on a GPU cluster

14 3 Parallelization of an Asian Option Pricer Parallelization principle Read input data Broadcast input data Transfer data to the GPU Run computation on the GPU Transfer results on the CPU Make final computations PE-0 PE-0 PE-1 PE-P-1 GPU GPU PE-0 PE-1 PE-P-1 GPU GPU PE-0 PE-1 PE-P-1 GPU GPU PE-0 PE-1 PE-P-1 PE-0 PE-1 PE-P-1 GPU GPU GPU Print result t PE-0

15 3 Parallelization of an Asian Option Pricer Implementation (1) int main(int argc, char **argv) {... // Variable declarations MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD,&NbPE); MPI_Comm_rank(MPI_COMM_WORLD,&Me); if (Me == 0) { InitStockParaCPU(); } BroadcastInputData(); InitStockParaGPU(); MPI initializations. Input data file reading from PE 0. Broadcast input data to all PEs. Transfer input data on GPU.

16 3 Parallelization of an Asian Option Pricer Implementation (2) for (int jj = 0; jj <= N; jj++){ for (int k = 0; k < NbStocks; k++){ ComputeUniformRandom(); GaussPRNG(k); OutputInputPRNG(); } ActStock(jj); for (int k = 0; k < NbStocks; k++){ AsianSum(k,jj); OutputInputSum(k); } } ComputeIntegralSum(); ComputePriceSum(); for (int i = 0; i < Nx; i++) { for (int j = 0; j < Ny; j++) { value = maxi((float)(basketpricecpu[i][j]- BasketSumCPU[i][j]),0); sum = sum+value; sum2 = sum2+value*value; } } Call «kernels» on GPU, and transfer results on CPU. CPU computations.

17 3 Parallelization of an Asian Option Pricer Implementation (3) MPI_Reduce(&sum,&TotalSum,1,MPI_DOUBLE,MPI_SUM, 0,MPI_COMM_WORLD); MPI_Reduce(&sum2,&TotalSum2,1,MPI_DOUBLE,MPI_SUM, 0,MPI_COMM_WORLD); if (Me == 0) { value = exp(-r)*(totalsum/(((double)nx)*ny*nbpe)); fprintf(stdout,"computed price: %f\n", (float)value); } MPI_Finalize(); return(exit_success); } Collect all results on PE-0. Compute last result on PE0. Close MPI mechanisms

18 3 Parallelization of an Asian Option Pricer Compilation OpenMPI + Cuda: nvcc -O3 // Serial automatic optimizations -I/opt/openmpi/include/ // MPI include files -I/usr/include/c++/4.1.2/ // Include files required by MPI -DOMPI_SKIP_MPICXX // NVCC does not support «exceptions» -o AsianPricer *.cu // ALL source files are.cu files Compilation using OpenMPI+CUDA appears easy when all files have.cu extension

19 4 Usage and performances of an «Asian Option Pricer» on a GPU cluster

20 4 Usage and performances Experimental performances From 1 to 4 nodes: size up (to achieve accuracy of 10 6 trajectories), Beyond 4 nodes: speedup (to achieve computations faster). Size up Speedup T(s) Asian Pricing on GPU cluster 10 6 / trajectories T-TotalExec(s) T-CalculGPU T-Transfert T-DataBcast T-IO T-CalculComm Nb of CPU+GPU nodes

21 4 Usage and performances Experimental performances Good scaling of GPU computations and CPU-GPU transfers but data broadcast could become a problem. 100,00 10,00 Asian Pricing on GPU cluster T-TotalExec(s) T-CalculGPU T-Transfert T(s) 1,00 0,10 T-DataBcast T-IO T-CalculComm 0, Nb of CPU+GPU nodes

22 4 Usage and performances Experimental speedup The «speedup» part of the experiment (from 4 nodes to 16 nodes) exhibit correct relative speedup (compared to execution on 4 nodes). Asian pricing on GPU cluster 4,5 4,0 SU-ideal(X) = X SU-vs-4Nodes (nb of 4Nodes) GPU-SU vs 4Nodes 3,5 3,0 2,5 2,0 1,5 1,0 1,00 1,50 2,00 2,50 3,00 3,50 4,00 Nb of 4Nodes

23 4 Usage and performances Experimental speedup to do! The «speedup» of the GPU cluster compared to an execution on 1 CPU & 1 core is:??? Requires a sequential execution on the CPU of node of the GPU cluster TO DO! Previously measured close to 100 on other systems. The GPU cluster could achieve a speedup close to 360, compared to a sequential execution on one CPU of the same cluster. The «speedup» of the GPU cluster compared to an execution on a cluster of P multi-core CPUs is??? Requires a parallel MPI+OpenMP execution on the CPUs of the GPU cluster TO DO!

24 5 Conclusion and perspectives

25 5 Conclusion and perspectives Current results are promising. Size up + Speedup seems the realistic way to use a cluster of GPUs. Future work: Optimize parallel algorithms and source code (many issues to investigate in MPI+CUDA programming). Measure performances on a cluster of multi-core CPUs, and compare. Measure the energy consumed and compare CPU and GPU energetic performances. Next events: 2 nd JTE-GPGPU, (December 4, 2008, Paris) PDCoF 09 (May 2009, Rome, Italy)

26 Asian Option Pricing on cluster of GPUs: First Results (ANR project «GCPMF») Questions?

Computing and energy performance

Equipe I M S Equipe Projet INRIA AlGorille Computing and energy performance optimization i i of a multi algorithms li l i PDE solver on CPU and GPU clusters Stéphane Vialle, Sylvain Contassot Vivier, Thomas