Digital Earth Routine on Tegra K1

Similar documents
Energy-Efficiency and Performance Comparison of Aerosol Optical Depth Retrieval on Distributed Embedded SoC Architectures

Global and Regional Retrieval of Aerosol from MODIS

Atmospheric correction of hyperspectral ocean color sensors: application to HICO

A Survey of Modelling and Rendering of the Earth s Atmosphere

JAXA Himawari Monitor Aerosol Products. JAXA Earth Observation Research Center (EORC) September 2018

Monte Carlo Ray Tracing Based Non-Linear Mixture Model of Mixed Pixels in Earth Observation Satellite Imagery Data

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

UV Remote Sensing of Volcanic Ash

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Barcelona Supercomputing Center

Aerosol Optical Depth Retrieval from Satellite Data in China. Professor Dr. Yong Xue

The NIR- and SWIR-based On-orbit Vicarious Calibrations for VIIRS

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

GEOG 4110/5100 Advanced Remote Sensing Lecture 2

Porting The Spectral Element Community Atmosphere Model (CAM-SE) To Hybrid GPU Platforms

JAXA Himawari Monitor Aerosol Products. JAXA Earth Observation Research Center (EORC) August 2018

Direct radiative forcing of aerosol

LIGHT SCATTERING THEORY

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Menghua Wang NOAA/NESDIS/STAR Camp Springs, MD 20746, USA

Kohei Arai 1 1Graduate School of Science and Engineering Saga University Saga City, Japan. Kenta Azuma 2 2 Cannon Electronics Inc.

Hyperspectral Unmixing on GPUs and Multi-Core Processors: A Comparison

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Improved MODIS Aerosol Retrieval using Modified VIS/MIR Surface Albedo Ratio Over Urban Scenes

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

Parallel Algorithm Engineering

Prototyping GOES-R Albedo Algorithm Based on MODIS Data Tao He a, Shunlin Liang a, Dongdong Wang a

Comparison of Full-resolution S-NPP CrIS Radiance with Radiative Transfer Model

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Seawater reflectance in the near-ir

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Motivation. Aerosol Retrieval Over Urban Areas with High Resolution Hyperspectral Sensors

Data Mining Support for Aerosol Retrieval and Analysis:

Class 11 Introduction to Surface BRDF and Atmospheric Scattering. Class 12/13 - Measurements of Surface BRDF and Atmospheric Scattering

Using GPUs to compute the multilevel summation of electrostatic forces

A Generic Approach For Inversion And Validation Of Surface Reflectance and Aerosol Over Land: Application To Landsat 8 And Sentinel 2

NASA e-deep Blue aerosol update: MODIS Collection 6 and VIIRS

1. Particle Scattering. Cogito ergo sum, i.e. Je pense, donc je suis. - René Descartes

Retrieval of optical and microphysical properties of ocean constituents using polarimetric remote sensing

Nonlinear Mixing Model of Mixed Pixels in Remote Sensing Satellite Images Taking Into Account Landscape

ECE 8823: GPU Architectures. Objectives

GOES-R AWG Radiation Budget Team: Absorbed Shortwave Radiation at surface (ASR) algorithm June 9, 2010

Data-intensive computing in radiative transfer modelling

Accelerators in Technical Computing: Is it Worth the Pain?

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Timothy Lanfear, NVIDIA HPC

TEMPO & GOES-R synergy update and! GEO-TASO aerosol retrieval!

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

HPC with Multicore and GPUs

CPU-GPU Heterogeneous Computing

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Heterogeneous platforms

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Calibration Techniques for NASA s Remote Sensing Ocean Color Sensors

Towards a robust model of planetary thermal profiles

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

MET 4410 Remote Sensing: Radar and Satellite Meteorology MET 5412 Remote Sensing in Meteorology. Lecture 9: Reflection and Refraction (Petty Ch4)

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS 179 Lecture 4. GPU Compute Architecture

LECTURE 37: Ray model of light and Snell's law

When MPPDB Meets GPU:

PART I: Collecting data from National Earth Observations

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Shortwave infrared measurements of the TROPOMI instrument on the Sentinel 5 Precursor mission

4.5 Images Formed by the Refraction of Light

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Two-Phase flows on massively parallel multi-gpu clusters

Understanding The MODIS Aerosol Products

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Aeolus L2A optical properties products and assimilation in air quality models

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Modern Processor Architectures. L25: Modern Compiler Design

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Dispersion Polarization

Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects

Revision History. Applicable Documents

Lecture 1: CS/ECE 3810 Introduction

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1

Tesla Architecture, CUDA and Optimization Strategies

High Performance Computing

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

GPU Fundamentals Jeff Larkin November 14, 2016

Addressing Heterogeneity in Manycore Applications

1.Rayleigh and Mie scattering. 2.Phase functions. 4.Single and multiple scattering

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Chapter 24. Wave Optics. Wave Optics. The wave nature of light is needed to explain various phenomena

Fast Hardware For AI

Computer Vision on Tegra K1. Chen Sagiv SagivTech Ltd.

Advances of parallel computing. Kirill Bogachev May 2016

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

MODULE 3. FACTORS AFFECTING 3D LASER SCANNING

Steve Scott, Tesla CTO SC 11 November 15, 2011

Parallel Computing. November 20, W.Homberg

General Purpose GPU Computing in Partial Wave Analysis

TUNING CUDA APPLICATIONS FOR MAXWELL

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Hardware Acceleration of Feature Detection and Description Algorithms on Low Power Embedded Platforms

Transcription:

Digital Earth Routine on Tegra K1 Aerosol Optical Depth Retrieval Performance Comparison and Energy Efficiency

Energy matters! Ecological A topic that affects us all Economical Reasons Practical Curiosity My Background: - Many years of research in High Performance Computing at Fraunhofer SCAI, Germany - Compiler development - Remote Sensing together with the Academy of Science, China

Sites.ieee.org mx.nthu.edu.tw AOD Retrieval Method Research cooperation: Fraunhofer SCAI (Germany) and the Academy of Science (China) Aerosol Optical Depth (AOD) is a significant optical property of aerosols AOD is applied to the atmospheric correction of remotely sensed surface features for monitoring volcanic eruptions, forest fires and air quality in general as well as climate predictions from satellites Measurements of different wavelengths for each pixel on earth (with a spacial resolution e.g. 1 km) are stacked into a data cube and form the input for Remote Sensing algorithms

Sites.ieee.org AOD Retrieval Method Input Data Collection Daily observations of the MODerate resolution Imaging Spectrometer MODIS from the NASA satellites TERRA und AQUA (i=1,2) Three different wavelengths from the visible spectrum (470, 550, 660 nm) are considered (j=1,2,3) The satellites were placed into a near-polar, sunsynchronous orbit at an altitude of 705km Both complement each other as they observe the same earth regions at different times of the day

AOD Retrieval Method Background Consider the Atmosphere as turbid medium following the Lambert-Beer-Law Optical Depth τ = τ R + τ G + τ A The total thickness τ consists of Rayleigh Scattering 4,085 τ R = 0,008735 λ j (λ j wavelength to j) Mie Scattering τ G τ G τ R Chanel wavelength(nm) transmissivity gas-opt. thickn. abs. gas is quasi constant (ozone + water + oxygen + others» tabels) Absorption or (mainly) Scattering through aerosols (AOD ) α Ångstrom's turbidity formula: τ j = β i λ j ( β i AOD for λ j = 1μm ) τ A Example: Cloud Droplets Particles are relatively large» small α» Scattering nearly constant over λ j

AOD Retrieval Method SRAP-MODIS Algorithm (Synergetic Retrieval of Aerosol Properties) Difference between TopOfAtmosphere- and Surface-Reflectance (Atmospheric Distortion) τ R τ A τ R τ A Ratio of two observations is constant for all wavelengths Estimate parameters Approximation of the Jacobi-Matrix Influence of the atmosphere decreases rapidly with increasing wavelength» Approximation by TOA values of large wavelengths with minor influences α, β 1, β 2 by Quasi-Newton Method Derive AOD for different wavelengths with Ångstrom's turbidity formula

AOD Retrieval Method IMPORTANT for the parallelization of the Retrieval-Method AOD calculation is independent for each pixel and can be performed solely based on the respective wavelengths-vector in the data cube Quasi-Newton-Method for each pixel to determine α, β 1, β 2 The Rate of Convergence for different pixels may vary seriously, e.g. between OL pixels (over-land) OS pixels (over-sea) Masked pixels (more about that later ) Additionally the control-flow may follow different paths in the AOD kernel Diverging branches

0 19 38 57 76 95 114 133 152 171 190 209 228 247 266 285 304 323 342 361 Power intake (watt) 1 2 4 8 16 Threads AOD Retrieval on multi-core processors Shared Memory parallelization with OpenMP Static OpenMP-Scheduling 160 140 120 Problem: Imbalance on cores 100 80 Reason on the one hand: Quasi-Newton (convergence) Load-Imbalance Reason on the other hand: Varying pixel data may lead to different branches in the AOD kernel (e.g. cloud-masking) Branch-Divergence 60 40 20 0 static second

0 19 38 57 76 95 114 133 152 171 190 209 228 247 266 285 304 323 342 361 Power intake (watt) AOD Retrieval on multi-core processors Shared Memory parallelization with OpenMP Solution: adapted scheduling of the pixels AOD threads Similar pixels Similar convergence Similar pixels nearby each other Instead of blocking the iterations/pixels statically in large chunks Small blocks, e.g. of size 1 OR dynamic scheduling As each kernel run is relatively work-intensive, the thereby introduced overhead is insignificant 160 140 120 100 80 60 Cloud-Masking Dependencies 40 20 0 static dynamic second

AOD Retrieval on GPUs Similar to multi-core Solution again: adapted scheduling of the pixels AOD threads Similar pixels Similar convergence Similar pixels nearby each other Thread-Blocks Only little branch-divergence per construction Not too many pixels per Thread-Block - nearby Similar pixels Similar convergence Thread-Block Tuning *NOT*: more is better Registers per block restrictions Programmer (can and has to) optimize parameters GPUs are very well suited for the Retrieval kernel but not necessarily for other parts of the workflow

AOD Retrieval on GPUs Speedup with increasing input size 120 100 80 60 40 20 Data Transfer GPU overall MC overall GPU calc MC calc 0

DRAM DRAM AOD Retrieval on GPUs Comparison of CPU and GPU architecture CPU vs. GPU problem-dependent (part) Low Latency vs. High Throughput Lots of automatisms vs. (still) lots of manual tuning Optimization of Thread-Blocks, register-assignment, occupancy (e.g. registers vs. threads), memory-accesses (shared memory bank conflicts, global memory coalescing), ALU ALU ALU ALU Control Unit Cache Level vs. L2

AOD Workflow on HYBRID systems more than the Retrieval is needed Multi-Core Multi-Core or GPU Multi-Core or GPU or HYBRID

http://www.eetimes.com/document.asp?doc_id=1272780 Why EMBEDDED? EMBEDDED architectures are interesting in various fields of research Energy plays a major role today Satellite on-board observations Automotive sector, e.g. high performance embedded systems for in-vehicle applications The convergence of HPC and embedded systems in our heterogeneous computing future (Kaeli et al. 2011) The Exascale Challenge (Moore s Law) and future HPC systems Relatively cheap combination of multi-cores and GPUs today

AOD on EMBEDDED architectures NVIDIA JetsonTK1 Jetson TK1 energy efficient SoC for high performance under strong energy constraints

AOD Retrieval on MIXED EMBEDDED JetsonTK1

AOD Retrieval Method JetsonTK1 Runtime 1xSoC 2861.37 717.37 46.93 0 500 1000 1500 2000 2500 3000 CPU 1HPCore CPU 4HPCores GPU 1HPCore 4xSoC 18.13 0 500 1000 1500 2000 2500 3000 GPU 1HPCore XeonWS 192.15 48.28 3.92 0 500 1000 1500 2000 2500 3000 CPU 1T CPU 4T GPU

AOD Retrieval Method JetsonTK1 Runtime (Scaling) 1xSoC 47.08 2xSoC 3xSoC 22.67 28.92 4xSoC 3xSoC 2xSoC 4xSoC 18.13 1xSoC 0 5 10 15 20 25 30 35 40 45 50

AOD Retrieval Method JetsonTK1 Energy 1xSoC 12309.49 3650.93 339.61 0 2000 4000 6000 8000 10000 12000 14000 16000 CPU 1HPCore CPU 4HPCores GPU 1HPCore 4xSoC 610.88 0 2000 4000 6000 8000 10000 12000 14000 16000 GPU 1HPCore XeonWS 15336.89 5268.56 880.95 0 2000 4000 6000 8000 10000 12000 14000 16000 CPU 1T CPU 4T GPU

Publications 2015 Multi-Core Processors and Graphics Processing Unit Accelerators for Parallel Retrieval of Aerosol Optical Depth from Satellite Data: Implementation, Performance and Energy Efficiency J. Liu, D. Feld, Y. Xue, J. Garcke and T. Soddemann IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 2016 Design of a Hybrid Parallel Workflow for Efficient Aerosol Optical Depth Retrieval from MODIS Satellite Data for Computers with Multi-core Processors and GPUs J. Liu, D. Feld, Y. Xue, J. Garcke, T. Soddemann and P. Pan 1500000 International Journal of Digital Earth 15000 1000000 500000 0 Workstation1 Corei7-960 3.20GHz 8T(HT), GTX460 1xSoC ARM Cortex A15 2.30 GHz 4T, Kepler "192" 10000 2016 Energy-Efficiency and Performance Comparison of Aerosol Optical Depth 5000 (AOD) retrieval on distributed Embedded SoC architectures with Nvidia GPUs 0 Workstation32Xeon D. Feld, E. Schricker, Workstation1 J. Liu, Core-i7- Y. Xue, 1xSoC J. Garcke ARM Cortex and T. Soddemann E3-1275 V2 3.50 GHz 8T(HT), GTX680 960 3.20GHz 8T(HT), GTX460 A15 2.30 GHz 4T, Kepler "192" Workstation32Xeon E3-1275 V2 3.50 GHz 8T(HT), GTX680 SCAI Book (Springer) [t.b.a.] CPU MC GPU HYBRID DYNAMIC CPU MC GPU HYBRID DYNAMIC

Energy matters! Workflow as a tool Restrictions Power intake Energy consumption Runtime restriction (real-time) Minimize runtime (post-processing) e.g. on-board missions Goals influence each other Extension of the methods e.g.: Pixel-sorting to reduce divergence respect dependencies Further methods, other input data, other goals

Thanks for your attention! Questions? NASA Earth Observatory dustin.feld@scai.fraunhofer.de https://www.researchgate.net/profile/dustin_feld https://de.linkedin.com/in/d3feld