Digital Earth Routine on Tegra K1 Aerosol Optical Depth Retrieval Performance Comparison and Energy Efficiency
Energy matters! Ecological A topic that affects us all Economical Reasons Practical Curiosity My Background: - Many years of research in High Performance Computing at Fraunhofer SCAI, Germany - Compiler development - Remote Sensing together with the Academy of Science, China
Sites.ieee.org mx.nthu.edu.tw AOD Retrieval Method Research cooperation: Fraunhofer SCAI (Germany) and the Academy of Science (China) Aerosol Optical Depth (AOD) is a significant optical property of aerosols AOD is applied to the atmospheric correction of remotely sensed surface features for monitoring volcanic eruptions, forest fires and air quality in general as well as climate predictions from satellites Measurements of different wavelengths for each pixel on earth (with a spacial resolution e.g. 1 km) are stacked into a data cube and form the input for Remote Sensing algorithms
Sites.ieee.org AOD Retrieval Method Input Data Collection Daily observations of the MODerate resolution Imaging Spectrometer MODIS from the NASA satellites TERRA und AQUA (i=1,2) Three different wavelengths from the visible spectrum (470, 550, 660 nm) are considered (j=1,2,3) The satellites were placed into a near-polar, sunsynchronous orbit at an altitude of 705km Both complement each other as they observe the same earth regions at different times of the day
AOD Retrieval Method Background Consider the Atmosphere as turbid medium following the Lambert-Beer-Law Optical Depth τ = τ R + τ G + τ A The total thickness τ consists of Rayleigh Scattering 4,085 τ R = 0,008735 λ j (λ j wavelength to j) Mie Scattering τ G τ G τ R Chanel wavelength(nm) transmissivity gas-opt. thickn. abs. gas is quasi constant (ozone + water + oxygen + others» tabels) Absorption or (mainly) Scattering through aerosols (AOD ) α Ångstrom's turbidity formula: τ j = β i λ j ( β i AOD for λ j = 1μm ) τ A Example: Cloud Droplets Particles are relatively large» small α» Scattering nearly constant over λ j
AOD Retrieval Method SRAP-MODIS Algorithm (Synergetic Retrieval of Aerosol Properties) Difference between TopOfAtmosphere- and Surface-Reflectance (Atmospheric Distortion) τ R τ A τ R τ A Ratio of two observations is constant for all wavelengths Estimate parameters Approximation of the Jacobi-Matrix Influence of the atmosphere decreases rapidly with increasing wavelength» Approximation by TOA values of large wavelengths with minor influences α, β 1, β 2 by Quasi-Newton Method Derive AOD for different wavelengths with Ångstrom's turbidity formula
AOD Retrieval Method IMPORTANT for the parallelization of the Retrieval-Method AOD calculation is independent for each pixel and can be performed solely based on the respective wavelengths-vector in the data cube Quasi-Newton-Method for each pixel to determine α, β 1, β 2 The Rate of Convergence for different pixels may vary seriously, e.g. between OL pixels (over-land) OS pixels (over-sea) Masked pixels (more about that later ) Additionally the control-flow may follow different paths in the AOD kernel Diverging branches
0 19 38 57 76 95 114 133 152 171 190 209 228 247 266 285 304 323 342 361 Power intake (watt) 1 2 4 8 16 Threads AOD Retrieval on multi-core processors Shared Memory parallelization with OpenMP Static OpenMP-Scheduling 160 140 120 Problem: Imbalance on cores 100 80 Reason on the one hand: Quasi-Newton (convergence) Load-Imbalance Reason on the other hand: Varying pixel data may lead to different branches in the AOD kernel (e.g. cloud-masking) Branch-Divergence 60 40 20 0 static second
0 19 38 57 76 95 114 133 152 171 190 209 228 247 266 285 304 323 342 361 Power intake (watt) AOD Retrieval on multi-core processors Shared Memory parallelization with OpenMP Solution: adapted scheduling of the pixels AOD threads Similar pixels Similar convergence Similar pixels nearby each other Instead of blocking the iterations/pixels statically in large chunks Small blocks, e.g. of size 1 OR dynamic scheduling As each kernel run is relatively work-intensive, the thereby introduced overhead is insignificant 160 140 120 100 80 60 Cloud-Masking Dependencies 40 20 0 static dynamic second
AOD Retrieval on GPUs Similar to multi-core Solution again: adapted scheduling of the pixels AOD threads Similar pixels Similar convergence Similar pixels nearby each other Thread-Blocks Only little branch-divergence per construction Not too many pixels per Thread-Block - nearby Similar pixels Similar convergence Thread-Block Tuning *NOT*: more is better Registers per block restrictions Programmer (can and has to) optimize parameters GPUs are very well suited for the Retrieval kernel but not necessarily for other parts of the workflow
AOD Retrieval on GPUs Speedup with increasing input size 120 100 80 60 40 20 Data Transfer GPU overall MC overall GPU calc MC calc 0
DRAM DRAM AOD Retrieval on GPUs Comparison of CPU and GPU architecture CPU vs. GPU problem-dependent (part) Low Latency vs. High Throughput Lots of automatisms vs. (still) lots of manual tuning Optimization of Thread-Blocks, register-assignment, occupancy (e.g. registers vs. threads), memory-accesses (shared memory bank conflicts, global memory coalescing), ALU ALU ALU ALU Control Unit Cache Level vs. L2
AOD Workflow on HYBRID systems more than the Retrieval is needed Multi-Core Multi-Core or GPU Multi-Core or GPU or HYBRID
http://www.eetimes.com/document.asp?doc_id=1272780 Why EMBEDDED? EMBEDDED architectures are interesting in various fields of research Energy plays a major role today Satellite on-board observations Automotive sector, e.g. high performance embedded systems for in-vehicle applications The convergence of HPC and embedded systems in our heterogeneous computing future (Kaeli et al. 2011) The Exascale Challenge (Moore s Law) and future HPC systems Relatively cheap combination of multi-cores and GPUs today
AOD on EMBEDDED architectures NVIDIA JetsonTK1 Jetson TK1 energy efficient SoC for high performance under strong energy constraints
AOD Retrieval on MIXED EMBEDDED JetsonTK1
AOD Retrieval Method JetsonTK1 Runtime 1xSoC 2861.37 717.37 46.93 0 500 1000 1500 2000 2500 3000 CPU 1HPCore CPU 4HPCores GPU 1HPCore 4xSoC 18.13 0 500 1000 1500 2000 2500 3000 GPU 1HPCore XeonWS 192.15 48.28 3.92 0 500 1000 1500 2000 2500 3000 CPU 1T CPU 4T GPU
AOD Retrieval Method JetsonTK1 Runtime (Scaling) 1xSoC 47.08 2xSoC 3xSoC 22.67 28.92 4xSoC 3xSoC 2xSoC 4xSoC 18.13 1xSoC 0 5 10 15 20 25 30 35 40 45 50
AOD Retrieval Method JetsonTK1 Energy 1xSoC 12309.49 3650.93 339.61 0 2000 4000 6000 8000 10000 12000 14000 16000 CPU 1HPCore CPU 4HPCores GPU 1HPCore 4xSoC 610.88 0 2000 4000 6000 8000 10000 12000 14000 16000 GPU 1HPCore XeonWS 15336.89 5268.56 880.95 0 2000 4000 6000 8000 10000 12000 14000 16000 CPU 1T CPU 4T GPU
Publications 2015 Multi-Core Processors and Graphics Processing Unit Accelerators for Parallel Retrieval of Aerosol Optical Depth from Satellite Data: Implementation, Performance and Energy Efficiency J. Liu, D. Feld, Y. Xue, J. Garcke and T. Soddemann IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2016 Design of a Hybrid Parallel Workflow for Efficient Aerosol Optical Depth Retrieval from MODIS Satellite Data for Computers with Multi-core Processors and GPUs J. Liu, D. Feld, Y. Xue, J. Garcke, T. Soddemann and P. Pan 1500000 International Journal of Digital Earth 15000 1000000 500000 0 Workstation1 Corei7-960 3.20GHz 8T(HT), GTX460 1xSoC ARM Cortex A15 2.30 GHz 4T, Kepler "192" 10000 2016 Energy-Efficiency and Performance Comparison of Aerosol Optical Depth 5000 (AOD) retrieval on distributed Embedded SoC architectures with Nvidia GPUs 0 Workstation32Xeon D. Feld, E. Schricker, Workstation1 J. Liu, Core-i7- Y. Xue, 1xSoC J. Garcke ARM Cortex and T. Soddemann E3-1275 V2 3.50 GHz 8T(HT), GTX680 960 3.20GHz 8T(HT), GTX460 A15 2.30 GHz 4T, Kepler "192" Workstation32Xeon E3-1275 V2 3.50 GHz 8T(HT), GTX680 SCAI Book (Springer) [t.b.a.] CPU MC GPU HYBRID DYNAMIC CPU MC GPU HYBRID DYNAMIC
Energy matters! Workflow as a tool Restrictions Power intake Energy consumption Runtime restriction (real-time) Minimize runtime (post-processing) e.g. on-board missions Goals influence each other Extension of the methods e.g.: Pixel-sorting to reduce divergence respect dependencies Further methods, other input data, other goals
Thanks for your attention! Questions? NASA Earth Observatory dustin.feld@scai.fraunhofer.de https://www.researchgate.net/profile/dustin_feld https://de.linkedin.com/in/d3feld