Parallel Processing Majid AlMeshari John W. Conklin 1
Outline Challenge Requirements Resources Approach Status Tools for Processing 2
Challenge A computationally intensive algorithm is applied on a huge set of data. Verification of filter s correctness takes a long time and hinders progress. 3
Requirements Achieve a speed-up between 5-10x over serial version Minimize parallelization overhead Minimize time to parallelize new releases of code Achieve reasonable scalability SAC 19 Time (sec) Number of Processors 4
Resources Hardware Computer Cluster (Regelation), a 44 64-bit CPUs Software» 64-bit enables us to address a memory space beyond 4 GB» Using this cluster because: (a) likely will not need more than 44 processors (b) same platform as our Linux boxes MatlabMPI» Matlab parallel computing toolbox created by MIT Lincoln Lab» Eliminates the need to port our code into another language 5
Approach - Serial Code Structure Initialization Loop over iterations Loop Gyros Loop over Segments Loop over orbits Build h(x) (the right hand side) Build Jacobian, J(x) Computationally Intensive Components (Parallelize!!!) Load data (Z), perform fit Bayesian estimation, display results 6
Initialization Approach - Parallel Code Structure Loop over iterations Loop Gyros Loop over Segments Loop over orbits Build h(x) Build Jacobian, J(x) Distribute J(x), all nodes Loop over partial orbits Load data (Z), perform fit Send information matrices to master & combine In parallel, split up by columns of J(x) Custom data distribution code required (Overhead) In parallel, split up by orbits Bayesian estimation, display results 7
Code in Action Split by Columns of J Initialization Go for another Iteration Split by Rows of J Go for next Gyro, Segment Proc 1 Proc 1 Proc 2 Proc 2 Proc 3 Proc 3.................. Proc n Proc n Compute h and Partial J Sync Point Get Complete J Perform Fit Sync Point Combine Info Matrix Bayesian Estimation Another Iteration? 8
Parallelization Techniques - 1 Minimize inter-processor communication 2000 1800 Use MatlabMPI for one-to-one communication only Use a custom approach for one-to-many communication Time (Sec) 1600 1400 1200 1000 800 600 400 400 200 200 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 10 11 12 13 14 15 16 Comm time with custom MPI approach Comp J Dist J (I/O) Est comp info (I/O) Processors 9
Parallelization Techniques - 2 Balance processing load among processors 2000 1800 Load per processor = length(reducedvector k )*(computationweight k + diskioweight) k: {RT, Tel, Cg, TFM, S 0 } Time (Sec) 1600 1400 1200 1000 800 600 400 200 CPUs have uniform load Comp J Dist J (I/O) Est comp info (I/O) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors 10
Parallelization Techniques - 3 Minimize memory footprint of slave code Tightly manage memory to prevent swapping to disk, or out of memory errors We managed to save ~40% of RAM per processor and never seen out of memory since 11
Parallelization Techniques - 4 Find contention points 4000 Cache what can be cached Optimize what can be optimized Example: Compute h Optimization 3500 3000 Time (Sec) 2500 2000 1500 1000 500 0 Old Code Optimized Code 12
Status Current Speed up 16.00 14.00 14.50 12.00 10.83 Speedup 10.00 8.00 6.00 5.76 7.19 6.50 9.62 7.22 Sep. '09 May '10 July '10 4.00 4.26 4.08 2.00 1.00 0.00 1 CPU 8 CPUs 16 CPUs 20 CPUs Number of Processors 13
Tools for Processing - 1 Run Manager/Scheduler/Version Controller Built by Ahmad Aljadaan Given a batch of options files, it schedules batches of runs, serial or parallel, on cluster Facilitates version control by creating a unique directory for each run with its options file, code and results Used with large number of test cases Helped us schedule 1048+ runs since January 2010» and counting 14
Tools for Processing - 2 Result Viewer Built by Ahmad Aljadaan Allows searching for runs, displays results, plots related charts Facilitates comparison of results 15
Questions 16