High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

Similar documents
Fault-Tolerant Parallel Analysis of Millisecond-scale Molecular Dynamics Trajectories. Tiankai Tu D. E. Shaw Research

CSE 444: Database Internals. Lecture 23 Spark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

MOHA: Many-Task Computing Framework on Hadoop

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Webinar Series TMIP VISION

The Evolution of Big Data Platforms and Data Science

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Databases 2 (VU) ( / )

DATA SCIENCE USING SPARK: AN INTRODUCTION

Research in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis

April Copyright 2013 Cloudera Inc. All rights reserved.

Embedded Technosolutions

Hadoop. Introduction / Overview

Introduction to Hadoop and MapReduce

The Fusion Distributed File System

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

EsgynDB Enterprise 2.0 Platform Reference Architecture

An Introduction to Apache Spark

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Big Data Hadoop Stack

Big Data Infrastructures & Technologies

Cloud Computing & Visualization

: A new version of Supercomputing or life after the end of the Moore s Law

MapReduce, Hadoop and Spark. Bompotas Agorakis

Scalable Tools - Part I Introduction to Scalable Tools

Asynchronous and Fault-Tolerant Recursive Datalog Evalua9on in Shared-Nothing Engines

Spark, Shark and Spark Streaming Introduction

Outline. In Situ Data Triage and Visualiza8on

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

The Future of High Performance Computing

Motivation Goal Idea Proposition for users Study

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

MDHIM: A Parallel Key/Value Store Framework for HPC

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Processing of big data with Apache Spark

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

A Review Paper on Big data & Hadoop

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

CISC 7610 Lecture 2b The beginnings of NoSQL

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Lecture 11 Hadoop & Spark

Cloud, Big Data & Linear Algebra

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Shark: Hive (SQL) on Spark

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Big data systems 12/8/17

Exam Questions

Spark Overview. Professor Sasu Tarkoma.

Big Data for Engineers Spring Resource Management

Innovatus Technologies

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

MATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2

High Performance and Cloud Computing (HPCC) for Bioinformatics

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Evolving To The Big Data Warehouse

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Harp-DAAL for High Performance Big Data Computing

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Data Informatics. Seon Ho Kim, Ph.D.

A Glimpse of the Hadoop Echosystem

VisIt Libsim. An in-situ visualisation library

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Execu&on Templates: Caching Control Plane Decisions for Strong Scaling of Data Analy&cs

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Leveraging Flash in HPC Systems

STA141C: Big Data & High Performance Statistical Computing

Scaling MATLAB. for Your Organisation and Beyond. Rory Adams The MathWorks, Inc. 1

Triton file systems - an introduction. slide 1 of 28

Bridging the Gap Between High Quality and High Performance for HPC Visualization

Introduction to High Performance Parallel I/O

The amount of data increases every day Some numbers ( 2012):

Automating Real-time Seismic Analysis

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Oracle Big Data Connectors

Towards Scalable Data Management for Map-Reduce-based Data-Intensive Applications on Cloud and Hybrid Infrastructures

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Distributed Systems CS6421

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Green Supercomputing

Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Similarities and Differences Between Parallel Systems and Distributed Systems

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Employing HPC DEEP-EST for HEP Data Analysis. Viktor Khristenko (CERN, DEEP-EST), Maria Girone (CERN)

"Big Data... and Related Topics" John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute

Transcription:

High Performance Data Analytics for Numerical Simulations Bruno Raffin DataMove bruno.raffin@inria.fr April 2016

About this Talk HPC for analyzing the results of large scale parallel numerical simulations (and not Big Data applications on HPC plateforms) Most of my examples taken from molecular dynamics Good overview document: 2013 DOE report on Synergistic Challenges in Data-Intensive Science and Exascale Computing -2

Exascale 2016 2020 Tianhe-2 (China) #1 @ Top 500 Exascale Machine 33 PetaFLOPS 1 ExaFLOPS 3 120 000 cores O(1 000 000 000) cores 17.6 MW 20 MW - 3

The Data Challenge More compute capabilities -> larger simulations -> more data Usability Challenge: How to extract meaningful information from this huge amount of data in a reasonable time Analysis tools have not been considered as first class citizen so far. They did not receive the same as simulation codes. Today analysis codes are either: - In the simulation codes - Scripts (with limited parallelism) - Rely on on scientific visualization tools like Paraview/VTK or Visit (reasonable parallelism support) Performance Challenge: Moving data becomes the bottleneck for simulation as well as data analytics Compute capabilities increase faster than data transfer ones Data movements and storage consume 50%-70% of total energy (ScidacReview 1001) - 4

Traditional Workflow Job submission Job Scheduler Not sustainable at Exascale! Execution (num. simulation) Big Machine Simulation codes may include some analysis Data Disks Data Analytics Visualization Small Machine (laptop) Limited support for parallelization - 5

A Data Challenge Already Present Scientists already spend a significant part of their efforts in the data analysis: Computational Biology: 2013 Molecular Dynamics Simulation wit Gromacs: 21 000 000 CPU hours (Curie supercomputer) More than 5 TB of data Analysis (VMD, MDAnalysis) still on-going work Material Science: Molecular Dynamics Simulation with Stamps: 700 million atoms on 4096 cores, 1 million iterations Output: 1 every 10000 iteration, 100GB each Analysis (in-simulation code, Paraview/VTK): about 30% CPU wall clock time of the simulation time wall clock time. A simple but classical strategy to limit the impact of the data challenge: Reduce output frequency - 6

Big Data: Google Map/Reduce Google Map/Reduce (2004): - Two data parallel operators: map, reduce - Values are indexed with a key (key/value model) - Parallel execution on a cluster (distributed memory) - Runtime takes care of tasks scheduling, load balancing and fault tolerance - 7

Big Data: Beyond Map/Reduce The original model has been extended in different ways (Spark, Flink) to support complex analysis plans: - More operators (join, union,.) - In-memory data store - Iterative scripts - Streaming (interactive scripts) Augmented with specialization layers to support: - SQL queries - Large graph processing - Machine learning But tailored for: - Running un cloud infrastructures (do not leverage supercomputers specifics) - Process web data (web pages, tweets, ) And Java based - 8

HiMach [TU & al., HIMach, SC 2008] A map/reduce like framework for analysing molecular dynamics trajectories Key/value store + map/reduce like operators Implementa<on: Python + MPI No fault tolerance Use VMD for some compute kernels Some analysis need only to keep one <mestep at a <me in memory (coun<ng ion passing though a channel), other need a sliding window of <mesteps (RMSD on a sliding window ) - 9

VelaSSco (FP7) Query based Scien<fic Visualiza<on FEM/DEM simula<on data Hadoop soqware suite (MapReduce, HDFS, Hbase, Yarn, ThriQ) Key/value: (<mestep+rank-id, data) Scien<st request some visualiza<on (isosurface for a given <mestep): Vis client <-> front server <-> map/reduce job <- > HBASE - 10

Traditional Workflow Job submission Job Scheduler Not sustainable at Exascale! Execution (num. simulation) Big Machine Simulation codes may include some analysis Data Disks Data Analytics Visualization Small Machine (laptop) Limited support for parallelization - 11

Workflow with Map/Reduce Job submission Job Scheduler Execution (num. simulation) Big Machine Simulation codes may include some analysis Data Disks Data Analytics Visualization - Do not fix the data movement bottleneck Cluster Map/Reduce approach + High level parallel programming model - 12

WorkFlow with In-situ Analytics Job submission Job Scheduler Execution: num. simulation interleaved with analytics Data Disks Data Analytics Big Machine Reduced Data Movements In-situ analytics: Data reduction Large scale parallel analytics On-line monitoring Visualization - 13

In Situ Processing: What for? Data compression (Isabela [Lehmann & al. LDAV 14] ) Indexing (FastBit, Dirac [Lakshminarasimhan & al. HPDC 13] ) Storage (DataSpaces [Docan & al. Cluster Computing 12] ) Analytics (1D, 2D, 3D descriptor computing) [Dreher & al. Faraday Discussion 14] - 14

In-simulation Processing No analytics Simulation iteration(s) I/O Time Simulation iteration(s) In-simulation Simulation iteration(s) Analytics I/O Simulation iteratio Simulation slowdown mainly due to cache thrashing - 15

In-situ Processing Simulation slowdown due to concurrent use of some resources with analytics and I/Os No analytics Time Simulation iteration(s) I/O Simulation iteration(s) In-situ Simulation iteration(s) Data Copy Simulation iteration(s) Analytics I/O In-situ: simulation and analytics share the same nodes Resource allocation strategies: time sharing or space sharing (dedicated helper core) Node Helper core - 16

In-transit Processing In-transit Simulation iteration(s) Data Copy communication Analytics Time Simulation iteration(s) I/O Sim node In-transit: simulation and Analytics run on different nodes (staging nodes) communication Staging node - 17

In-Sim vs. In-Situ I/O [Dreher,CCGRID 14] Gromacs no I/O I/O in-situ on helper core Gromacs native I/O (in-simulation) 2048 cores (froggy@ciment) Gromacs without I/O: 15 cores/node 3% slower than 16 cores/node (- 6% if scalability would have been perfect) - 18

Parallel In-Situ Isosurface Extraction [Dreher,CCGRID 14] Compute a molecule surface based on atom density Tested different distributions of processing steps to in-situ and in-transit nodes. - 19

Performance [Dreher,CCGRID 14] Gromacs + Isosurface Gromacs no I/O In transit: 1 staging node every 64 compute nodes Density-intransit: costs 7% comp. to gromacs 15 cores Density-insitu costs 8% but use 1.5% less nodes than density-intransit (froggy@ciment) Atoms-intransit costs 8.6% but enables other in-transit analytics (3x more data to move on stagging nodes than Density-intransit) - 20

In-Situ Analytics Status - Paraview and Visit: support in-simula<on data processing - Advanced prototypes suppor<ng in-situ and in-transit: FlexIO (IPDPS 13), Damaris (Cluster 12), FlowVR (CCGrid 14) - In-memory data storage on staging nodes: DataSpace - Programming model: - MPI level (Damaris) - In I/O library (ADIOS) - Data-flow (FlowVR) No Standard Yet - 21

Conclusion and Discussion Map/Reduce model: successful in Big Data why not in HPC - High level programming model, efficient execu<ons In-situ Analy:cs: a paradigm shiq - An opportunity to rethink the use of the I/O budget In-situ versus post-mortem analysis: Different tools or same one? Interface between the two words with an in-memory database (à la DataSpace)? Programming model: Data flow oriented (à la Map/ Reduce) or a more classical HPC appraoch (à la MPI)? Reusing Big Data soqware stacks or need to develop HPC specific ones? - 22

- 23