Data-intensive computing in radiative transfer modelling

Size: px

Start display at page:

Download "Data-intensive computing in radiative transfer modelling"

Michael Sullivan
6 years ago
Views:

1 German Aerospace Center (DLR) Remote Sensing Technology Institute (IMF) Data-intensive computing in radiative transfer modelling Dmitry Efremenko Diego Loyola Adrian Doicu Thomas Trautmann

2 Motivation SCIAMACHY S5P SCIAMACHY Sentinel 5 P Spatial resolution 30 km 60 km 7 km 7 km Amount of level 1 data per year 2 ТВ 180 TB

3 Performance requirements Earth surface area: ~ 5x10^8 km 2 Pixel area: ~ 50 km 2 Number of ground pixels per day: ~10,000,000 Processing time for one pixel: ~0.01 sec O spectral lines 1 spectral line per sec Radiative transfer solvers have to be accelerated to be used in nearreal-time trace gas retrieval algorithms

4 Loop hierarchy // loop over ground pixels foreach ground_pixel: // loop over wavelengths for wl_start to wl_end: // loop over cloud fractions for cloud_free and cloudy: call_rte_solver( );

// loop over wavelengths: PCA-based radiative transfer solver I I N 2 ( xw) ( x ) Correction factor w = f ( x ) w I N I 2 multi-stream model the two-stream model Approximation in the reduced data

5 // loop over wavelengths: PCA-based radiative transfer solver I I N 2 ( xw) ( x ) Correction factor w = f ( x ) w I N I 2 multi-stream model the two-stream model Approximation in the reduced data space T 1 T 2 f( xw) = f( x+ xw) f( x) + xw f( x) + xw f( x) xw f( x ) f( x ) + [ f( + ) f( )] y + [ f( + ) 2 f( ) + f( )] y 2 x a x a 2 x a x x a 2 w k k wk k k wk k= 1 k= 1 Computation of the spectrum I N 100 PCA (LPP, LPE, LEA, ) I I 5 N I 2 100

6 // loop over wavelengths: PCA-based radiative transfer solver The accuracy of the results is better than 0.5% while the performance enhancement is of about 8 for the ozone total column retrieval. Natraj et al. //JQSRT, 111 (2010) Efremenko et al // JQSRT V.133. P

$// loop over cloud fraction: Optimization of the independent$

7 // loop over cloud fraction: Optimization of the independent pixel approximation c 1-c I = ci + (1 c) I cloudy clear_sky +

8 // loop over cloud fraction: Optimization of the independent pixel approximation Ui + Ui = B Ui + Ui = B 1 2 j j j j+ 1 j U i + U i = B 1 2 N N N N+ 1 N Ui + Ui = B ˆ 1 ˆ 2 ˆ j j + j j+ 1 = j Ui Ui B U i + U i 1 = B exp( τ) 1 2 N N N N+ N cloud 2 cloud ( ) = clear ( ) clear I2 ( λ ) ( λ ) I I λ I λ K λ ( ) Efremenko et al. // JQSRT 135 (2014) 58-65

9 // loop over ground pixels: CUDA-based implementation Stand-alone C version of the discrete ordinate method Single precision code CPU/GPU overlapping Overlapping for the data transfer between CPU and GPU

the matrix multiplication; EIG the eigenvalue problem; LU the

10 // loop over ground pixels: Performance for matrix operations Matrix size 8x8 Matrix size 32x32 Matrix size 128x128 SGEMM - the matrix multiplication; EIG the eigenvalue problem; LU the LU-factorization. For CUBLAS, a dynamic parallelism was used.

11 // loop over ground pixels: Performance comparison Efremenko et al. Multi-core-CPU and GPU-accelerated // Computer Physics Communications, 185 (2014)

12 // loop over ground pixels: Performance comparison Workload Speedup Reduced workload Multi-stream RTM 50 % % Two-stream RTM 25% 53 7 % PCA 20% 6 52% Rest 5% 10 6% According to Amdahl s law, the speedup for the whole algorithm is limited: tot S 1 < β + (1 β ) S β the accelerated part, S the speedup factor for β-part of the algorithm. Further speedup of the PCA-based RTM due to improving the multi-stream RTM cannot be larger than 2.

13 Cumulative performance enhancement Acceleration technique PCA-based radiative transfer 8 IPA-optimization 2 GPU-computing 15 Performance enhancement Total: 240 (220) The maximum error introduced by all acceleration techniques does not exceed 0.1%

14 Example of ozone retrieval

15 Summary A fast radiative transfer solver has been developed. It encapsulates dimensionality reduction of optical parameters together with the discrete ordinate method; We introduced two optimization techniques for the independent pixel approximation; GPU solver has a 50 speedup factor for the two-stream model and 20 speedup for the multi-stream model with 8 discrete ordinates compared with single-threaded CPU code; GPU memory management is a crucial factor regarding the performance. The memory size is a major limitation of current generation of GPU cards; The total performance enhancement is of about 220 times.

16 Background The discrete ordinate method is used to obtain a numerically stable solution of the RTE The RT problem has a matrix solution and encapsulates mostly the LAPACK subroutines:?gemm?getrf +?getri?getrf +?getrs?geev (the matrix multiplication) (the matrix inversion) (the system of linear equations) (the eigenvalue problem) Number of discrete ordinate is a main parameter which governs the accuracy and the performance. For ozone retrieval in UV spectral range 4-8 discrete ordinates per hemisphere are used. The computations are implemented for several wavelengths (80) and pixels (20 000, in the nearest future ).

17 Parallel computing strategies

$Implementation: -L ${MKLPATH} -I ${MKLINCLUDE} -Wl,--start-group ${MKLPATH}/libmkl_intel.a ${MKLPATH}/libmkl_intel_thread.a ${MKLPATH}/libmkl_core.a ${MKLPATH}/libmkl_scalapack_core.$

18 Parallalization on the level of math libraries Tool: Math Kernel Library Idea: Math Kernel library is used instead of LAPACK. It is optimized for multicore processors. Implementation: -L ${MKLPATH} -I ${MKLINCLUDE} -Wl,--start-group ${MKLPATH}/libmkl_intel.a ${MKLPATH}/libmkl_intel_thread.a ${MKLPATH}/libmkl_core.a ${MKLPATH}/libmkl_scalapack_core.a - Wl,--end-group -liomp5 -lpthread Result: On 2-core CPU we got acceleration ~ 2 times, 4-core CPU ~ 4 times. Simultaneously only 1 task (1 wavelength for 1 pixel) is processed. The code should not be changed (FORTRAN and C support), only the linking option is added.

$.. omp_set_num_threads(total_n_threads); #pragma omp parallel private(th_id) { th_id = omp_get_thread_num();.$

19 Parallalization on the level of tasks Tool: Open Multi-Processing Idea: a master thread forks a specified number of slave threads and a task is divided among them. Implementation: #include <omp.h>... omp_set_num_threads(total_n_threads); #pragma omp parallel private(th_id) { th_id = omp_get_thread_num();... call_rtmodel(th_id); } The slightly changes should be made one should add OpenMP directives and the compiler flag -fopenmp.

What every programmer should know about memory Intel Xeon CPU E5-1620 3,60GHz with 8 CPUs on 4 cores (250

20 Running on several CPUs White dots refer to OpenMP-accelerated code; black dots refer to executing several singlethreaded codes (GNU Parallel tool). DrepperU. What every programmer should know about memory Intel Xeon CPU E ,60GHz with 8 CPUs on 4 cores (250 euro) AMD Opteron 4176 HE 2.4GHz with 12 CPUs on 6 cores AMD Opteron 6282 SE 2.6GHz with 32 CPUs on 8 cores (1000 euro)

GPU: possible performance enhancement Mielikainen et al. // IEEE Selected Topics in Applied Earth Observations and Remote Sensing. 2011. V.4. P. 691. 3000 times Chen et al. // Computers & Geoscience.

21 GPU: possible performance enhancement Mielikainen et al. // IEEE Selected Topics in Applied Earth Observations and Remote Sensing V.4. P times Chen et al. // Computers & Geoscience V. 46. P times Sui et al. // Computers & Geoscience V. 43. P times Humphrey et al. // Proc. SPIE V P Lee et al. Debunking the 100x GPU vs. CPU myth... // SIGARCH Comput. Archit. News V.38. P times

22 Memory management

23 Summary The PCA-based radiative transfer model is developed for the GPU and multicore CPU hardware. GPU solver has a 50 speedup factor for the two-stream model and 20 speedup for the multi-stream model with 8 discrete ordinates compared with single-threaded CPU code. GPU memory management is a crucial factor regarding the performance. The memory size is a major limitation of current generation of GPU cards Future generation of GPUs are a promising option for Sentinel 5 Precursor

24 Philosophical conclusion If you optimize everything, you will always be unhappy. Donald Knuth

25 Thread and memory hierarchies CUDA C programming Guide

Parallel Algorithm Engineering

Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples