Automating Real-time Seismic Analysis

Similar documents
Pegasus. Automate, recover, and debug scientific computations. Rafael Ferreira da Silva.

Pegasus. Automate, recover, and debug scientific computations. Mats Rynge

Pegasus. Pegasus Workflow Management System. Mats Rynge

Pegasus. Automate, recover, and debug scientific computations. Mats Rynge

Pegasus Workflow Management System. Gideon Juve. USC Informa3on Sciences Ins3tute

Managing large-scale workflows with Pegasus

A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis

Complex Workloads on HUBzero Pegasus Workflow Management System

Pegasus WMS Automated Data Management in Shared and Nonshared Environments

End-to-End Online Performance Data Capture and Analysis of Scientific Workflows

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Frank McKenna. Pacific Earthquake Engineering Research Center

Enabling Large-scale Scientific Workflows on Petascale Resources Using MPI Master/Worker

A Cleanup Algorithm for Implementing Storage Constraints in Scientific Workflow Executions

Part IV. Workflow Mapping and Execution in Pegasus. (Thanks to Ewa Deelman)

An exceedingly high-level overview of ambient noise processing with Spark and Hadoop

Data Analytics with HPC. Data Streaming

Reduction of Workflow Resource Consumption Using a Density-based Clustering Model

Introduction to MapReduce (cont.)

Data Intensive processing with irods and the middleware CiGri for the Whisper project Xavier Briand

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Clouds: An Opportunity for Scientific Applications?

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Chapter 5. The MapReduce Programming Model and Implementation

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

A characterization of workflow management systems for extreme-scale applications

Lecture 11 Hadoop & Spark

MOHA: Many-Task Computing Framework on Hadoop

The VERCE Architecture for Data-Intensive Seismology

EFFICIENT ALLOCATION OF DYNAMIC RESOURCES IN A CLOUD

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

CloudBATCH: A Batch Job Queuing System on Clouds with Hadoop and HBase. Chen Zhang Hans De Sterck University of Waterloo

Streaming data Model is opposite Queries are usually fixed and data are flows through the system.

STORM AND LOW-LATENCY PROCESSING.

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Dac-Man: Data Change Management for Scientific Datasets on HPC systems

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

10 years of CyberShake: Where are we now and where are we going

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Spark, Shark and Spark Streaming Introduction

High Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove

The Fusion Distributed File System

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Cloud Computing & Visualization

Corral: A Glide-in Based Service for Resource Provisioning

MapReduce for Data Intensive Scientific Analyses

CSE6331: Cloud Computing

Analytics in the cloud

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications

CSE 444: Database Internals. Lecture 23 Spark

A Parallel R Framework

Efficient Dynamic Resource Allocation Using Nephele in a Cloud Environment

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Building a Data-Friendly Platform for a Data- Driven Future

Using Imbalance Metrics to Optimize Task Clustering in Scientific Workflow Executions

Scheduling distributed applications can be challenging in a multi-cloud environment due to the lack of knowledge

Scalable Tools - Part I Introduction to Scalable Tools

IN-MEMORY DATA FABRIC: Real-Time Streaming

Chapter 20: Database System Architectures

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute

Embedded Technosolutions

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Data Intensive Scalable Computing

EE/CSCI 451: Parallel and Distributed Computation

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

DATA SCIENCE USING SPARK: AN INTRODUCTION

Deployment Planning Guide

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

On the Use of Cloud Computing for Scientific Workflows

Lecture 23 Database System Architectures

SciSpark 201. Searching for MCCs

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Introduction to Grid Computing

Distributed File Systems II

Isolation Forest for Anomaly Detection

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Big data systems 12/8/17

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Integration of Workflow Partitioning and Resource Provisioning

Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

Towards Jungle Computing with Ibis/Constellation

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Managing Large-Scale Scientific Workflows in Distributed Environments: Experiences and Challenges

Oracle Big Data Connectors

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Planning the SCEC Pathways: Pegasus at work on the Grid

Hadoop/MapReduce Computing Paradigm

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Transcription:

Automating Real-time Seismic Analysis Through Streaming and High Throughput Workflows Rafael Ferreira da Silva, Ph.D. http://pegasus.isi.edu

Do we need seismic analysis? Pegasus http://pegasus.isi.edu 2

USArray A continental-scale Seismic Observatory US Array TA (IRIS Service) 836 stations 394 stations have online data available http://ds.iris.edu/ds/nodes/dmc/earthscope/usarray/_us-ta-operational/ Pegasus http://pegasus.isi.edu 3

The development of reliable risk assessment methods for these hazards requires real-time analysis of seismic data Pegasus http://pegasus.isi.edu 4

So, how to efficiently process these data? Pegasus http://pegasus.isi.edu 5

Experiment Timeline Scientific Problem Earth Science, Astronomy, Neuroinformatics, Bioinformatics, etc. Computational Scripts Shell scripts, Python, Matlab, etc. Distributed Computing Clusters, HPC, Cloud, Grid, etc. Scientific Result Models, Quality Control, Image Analysis, etc. Analytical Solution Automation Workflows, MapReduce, etc. Monitoring and Debug Fault-tolerance, Provenance, etc. Pegasus http://pegasus.isi.edu 6

What is involved in an experiment execution? Pegasus http://pegasus.isi.edu 7

Why Scientific Workflows? Automates complex, multi-stage processing pipelines Automate Enables parallel, distributed computations Automatically executes data transfers Reusable, aids reproducibility Recover Records how data was produced (provenance) Handles failures with to provide reliability Keeps track of data and files Debug Pegasus http://pegasus.isi.edu 8

Taking a closer look into a workflow DAG directed-acyclic graphs abstract workflow executable workflow optimizations storage constraints job Command-line programs split merge dependency Usually data dependencies pipeline Pegasus http://pegasus.isi.edu 9

From the abstraction to execution! stage-in job Transfers the workflow input data abstract workflow executable workflow optimizations storage constraints stage-out job Transfers the workflow output data registration job Registers the workflow output data Pegasus http://pegasus.isi.edu 10

Optimizing storage usage abstract workflow executable workflow optimizations storage constraints cleanup job Removes unused data Pegasus http://pegasus.isi.edu 11

Workflow systems provide tools to generate the abstract workflow dax = ADAG("test_dax") firstjob = Job(name="first_job") firstinputfile = File("input.txt") firstoutputfile = File("tmp.txt") firstjob.addargument("input=input.txt", "output=tmp.txt") firstjob.uses(firstinputfile, link=link.input) firstjob.uses(firstoutputfile, link=link.output) dax.addjob(firstjob) for i in range(0, 5): simuljob = Job(id="%s" % (i+1), name="simul_job ) simulinputfile = File("tmp.txt ) simuloutputfile = File("output.%d.dat" % i) simuljob.addargument("parameter=%d" % i, "input=tmp.txt, output=%s" % simuloutputfile.getname()) simuljob.uses(simulinputfile, link=link.input) simuljob.uses(simuloutputfile, line=link.output) dax.addjob(simuljob) dax.depends(parent=firstjob, child=simuljob) fp = open("test.dax", "w ) dax.writexml(fp) fp.close() abstract workflow Pegasus http://pegasus.isi.edu 12

Nimrod Pegasus Which Workflow Management System? Pegasus http://pegasus.isi.edu 13

and which model to use? task-oriented stream-based files data transfers via files streams data transfers via memory or message parallelism no concurrency concurrency tasks run concurrently heterogeneous execution tasks can run in heterogeneous resources homogeneous execution tasks should run in homogeneous resources Pegasus http://pegasus.isi.edu 14

What does Pegasus provide? Automation Automates pipeline executions Parallel, distributed computations Automatically executes data transfers Heterogeneous resources Task-oriented model Application is seen as a black box Debug Workflow execution and job performance metrics Set of debugging tools to unveil issues Real-time monitoring, graphs, provenance Recovery Job failure detection Checkpoint Files Job Retry Rescue DAGs Optimization Job clustering Data cleanup Pegasus http://pegasus.isi.edu 15

and dispel4py? Automation Automates pipeline executions Concurrent, distributed computations Stream-based model Workflow Composition Python Library Grouping (all-to-all, all-to-one, one-to-all) Mapping Sequential Multiprocessing (shared memory) Distributed memory, message passing (MPI) Distributed Real-time (Apache Storm) Apache Spark (Prototype) Optimization Multiple streams (in/out) Avoids I/O (shared memory or message passing) Pegasus http://pegasus.isi.edu 16

Pegasus ASTERISM sub-workflow dispel4py workflow Asterism greatly simplifies the effort required to develop data-intensive applications that run across multiple heterogeneous resources distributed in the wide area Pegasus http://pegasus.isi.edu 17

Where to run scientific workflows? Pegasus http://pegasus.isi.edu 18

High Performance Computing There are several possible configurations shared filesystem Workflow Engine submit host (e.g., user s laptop) typically most HPC sites Pegasus http://pegasus.isi.edu 19

Cloud Computing High-scalable object storages object storage Workflow Engine submit host (e.g., user s laptop) Typical cloud computing deployment (Amazon S3, Google Storage) Pegasus http://pegasus.isi.edu 20

Grid Computing local data management Workflow Engine Typical OSG sites Open Science Grid submit host (e.g., user s laptop) Pegasus http://pegasus.isi.edu 21

And yes you can mix everything! shared filesystem Compute site A Workflow Engine Compute site B submit host (e.g., user s laptop) Pegasus object storage http://pegasus.isi.edu 22

How do we use Asterism to automate seismic analysis? Pegasus http://pegasus.isi.edu 23

WORKFLOW Seismic Ambient Noise Cross-Correlation Preprocesses and cross-correlates traces (sequences of measurements of acceleration in three dimensions) from multiple seismic stations (IRIS database) Phase 1: data preparation using statistics for extracting information from the noise Phase 1 (pre-process) Phase 2: compute correlation, identifying the time for signals to travel between stations. Infers properties of the intervening rock Phase 2 (cross-correlation) Pegasus http://pegasus.isi.edu 24

Seismic Ambient Noise Cross-Correlation Distributed computation framework for event stream processing Designed for massive scalability, supports faulttolerance with a fail fast, auto restart approach to processes Rich array of available spouts specialized for receiving data from all types of sources spout bolt Hadoop of real-time processing, very scalable Pegasus http://pegasus.isi.edu 25

Seismic workflow execution input data (~150MB) Compute site A (MPI-based) IRIS database (stations) Phase 1 data transfers between sites performed by Pegasus Phase 2 submit host (e.g., user s laptop) Compute site B (Apache Storm) output data (~40GB) Pegasus http://pegasus.isi.edu 26

Southern California Earthquake Center s CyberShake W ORKFLO W Builders ask seismologists: What will the peak ground motion be at my new building in the next 50 years? Seismologists answer this question using Probabilistic Seismic Hazard Analysis (PSHA) 286 sites, 4 models each workflow has 420,000 tasks Pegasus http://pegasus.isi.edu 27

A few more workflow features Pegasus http://pegasus.isi.edu 28

Performance, why not improve it? workflow restructuring workflow reduction hierarchical workflows pegasus-mpi-cluster clustered job Groups small jobs together to improve performance task small granularity Pegasus http://pegasus.isi.edu 29

What about data reuse? workflow restructuring workflow reduction data already available data reuse hierarchical workflows pegasus-mpi-cluster data also available workflow reduction data reuse Jobs which output data is already available are pruned from the DAG Pegasus http://pegasus.isi.edu 30

Handling large-scale workflows workflow restructuring workflow reduction hierarchical workflows pegasus-mpi-cluster sub-workflow sub-workflow recursion ends when workflow with only compute jobs is encountered http://pegasus.isi.edu 31

Running fine-grained workflows on HPC systems workflow restructuring workflow reduction hierarchical workflows pegasus-mpi-cluster submit host (e.g., user s laptop) HPC System workflow wrapped as an MPI job Allows sub-graphs of a Pegasus workflow to be submitted as monolithic jobs to remote resources Pegasus http://pegasus.isi.edu 32

Pegasus Automate, recover, and debug scientific computations. Pegasus Website http://pegasus.isi.edu Get Started Users Mailing List pegasus-users@isi.edu Support pegasus-support@isi.edu HipChat

Automating Real-time Seismic Analysis Through Streaming and High Throughput Workflows Thank You Questions? Rafael Ferreira da Silva, Ph.D. rafsilva@isi.edu Ewa Deelman Rafael Ferreira da Silva Karan Vahi Mats Rynge Rajiv Mayani