Automating Real-time Seismic Analysis

Size: px

Start display at page:

Download "Automating Real-time Seismic Analysis"

Jeremy Davis
6 years ago
Views:

1 Automating Real-time Seismic Analysis Through Streaming and High Throughput Workflows Rafael Ferreira da Silva, Ph.D.

2 Do we need seismic analysis? Pegasus 2

3 USArray A continental-scale Seismic Observatory US Array TA (IRIS Service) 836 stations 394 stations have online data available Pegasus 3

4 The development of reliable risk assessment methods for these hazards requires real-time analysis of seismic data Pegasus 4

5 So, how to efficiently process these data? Pegasus 5

6 Experiment Timeline Scientific Problem Earth Science, Astronomy, Neuroinformatics, Bioinformatics, etc. Computational Scripts Shell scripts, Python, Matlab, etc. Distributed Computing Clusters, HPC, Cloud, Grid, etc. Scientific Result Models, Quality Control, Image Analysis, etc. Analytical Solution Automation Workflows, MapReduce, etc. Monitoring and Debug Fault-tolerance, Provenance, etc. Pegasus 6

7 What is involved in an experiment execution? Pegasus 7

8 Why Scientific Workflows? Automates complex, multi-stage processing pipelines Automate Enables parallel, distributed computations Automatically executes data transfers Reusable, aids reproducibility Recover Records how data was produced (provenance) Handles failures with to provide reliability Keeps track of data and files Debug Pegasus 8

9 Taking a closer look into a workflow DAG directed-acyclic graphs abstract workflow executable workflow optimizations storage constraints job Command-line programs split merge dependency Usually data dependencies pipeline Pegasus 9

10 From the abstraction to execution! stage-in job Transfers the workflow input data abstract workflow executable workflow optimizations storage constraints stage-out job Transfers the workflow output data registration job Registers the workflow output data Pegasus 10

11 Optimizing storage usage abstract workflow executable workflow optimizations storage constraints cleanup job Removes unused data Pegasus 11

Workflow systems provide tools to generate the abstract workflow dax = ADAG("test_dax") firstjob = Job(name="first_job") firstinputfile = File("input.txt") firstoutputfile = File("tmp.txt") firstjob.

addjob(firstjob) for i in range(0, 5): simuljob = Job(id="%s" % (i+1), name="simul_job ) simulinputfile = File("tmp.txt ) simuloutputfile = File("output.%d.dat" % i) simuljob.

12 Workflow systems provide tools to generate the abstract workflow dax = ADAG("test_dax") firstjob = Job(name="first_job") firstinputfile = File("input.txt") firstoutputfile = File("tmp.txt") firstjob.addargument("input=input.txt", "output=tmp.txt") firstjob.uses(firstinputfile, link=link.input) firstjob.uses(firstoutputfile, link=link.output) dax.addjob(firstjob) for i in range(0, 5): simuljob = Job(id="%s" % (i+1), name="simul_job ) simulinputfile = File("tmp.txt ) simuloutputfile = File("output.%d.dat" % i) simuljob.addargument("parameter=%d" % i, "input=tmp.txt, output=%s" % simuloutputfile.getname()) simuljob.uses(simulinputfile, link=link.input) simuljob.uses(simuloutputfile, line=link.output) dax.addjob(simuljob) dax.depends(parent=firstjob, child=simuljob) fp = open("test.dax", "w ) dax.writexml(fp) fp.close() abstract workflow Pegasus 12

13 Nimrod Pegasus Which Workflow Management System? Pegasus 13

14 and which model to use? task-oriented stream-based files data transfers via files streams data transfers via memory or message parallelism no concurrency concurrency tasks run concurrently heterogeneous execution tasks can run in heterogeneous resources homogeneous execution tasks should run in homogeneous resources Pegasus 14

15 What does Pegasus provide? Automation Automates pipeline executions Parallel, distributed computations Automatically executes data transfers Heterogeneous resources Task-oriented model Application is seen as a black box Debug Workflow execution and job performance metrics Set of debugging tools to unveil issues Real-time monitoring, graphs, provenance Recovery Job failure detection Checkpoint Files Job Retry Rescue DAGs Optimization Job clustering Data cleanup Pegasus 15

16 and dispel4py? Automation Automates pipeline executions Concurrent, distributed computations Stream-based model Workflow Composition Python Library Grouping (all-to-all, all-to-one, one-to-all) Mapping Sequential Multiprocessing (shared memory) Distributed memory, message passing (MPI) Distributed Real-time (Apache Storm) Apache Spark (Prototype) Optimization Multiple streams (in/out) Avoids I/O (shared memory or message passing) Pegasus 16

17 Pegasus ASTERISM sub-workflow dispel4py workflow Asterism greatly simplifies the effort required to develop data-intensive applications that run across multiple heterogeneous resources distributed in the wide area Pegasus 17

18 Where to run scientific workflows? Pegasus 18

19 High Performance Computing There are several possible configurations shared filesystem Workflow Engine submit host (e.g., user s laptop) typically most HPC sites Pegasus 19

20 Cloud Computing High-scalable object storages object storage Workflow Engine submit host (e.g., user s laptop) Typical cloud computing deployment (Amazon S3, Google Storage) Pegasus 20

21 Grid Computing local data management Workflow Engine Typical OSG sites Open Science Grid submit host (e.g., user s laptop) Pegasus 21

22 And yes you can mix everything! shared filesystem Compute site A Workflow Engine Compute site B submit host (e.g., user s laptop) Pegasus object storage 22

23 How do we use Asterism to automate seismic analysis? Pegasus 23

WORKFLOW Seismic Ambient Noise Cross-Correlation Preprocesses

acceleration in three dimensions) from multiple seismic

statistics for extracting information from the noise Phase 1

time for signals to travel between stations.

24 WORKFLOW Seismic Ambient Noise Cross-Correlation Preprocesses and cross-correlates traces (sequences of measurements of acceleration in three dimensions) from multiple seismic stations (IRIS database) Phase 1: data preparation using statistics for extracting information from the noise Phase 1 (pre-process) Phase 2: compute correlation, identifying the time for signals to travel between stations. Infers properties of the intervening rock Phase 2 (cross-correlation) Pegasus 24

fast, auto restart approach to processes Rich

receiving data from all types of sources spout

25 Seismic Ambient Noise Cross-Correlation Distributed computation framework for event stream processing Designed for massive scalability, supports faulttolerance with a fail fast, auto restart approach to processes Rich array of available spouts specialized for receiving data from all types of sources spout bolt Hadoop of real-time processing, very scalable Pegasus 25

26 Seismic workflow execution input data (~150MB) Compute site A (MPI-based) IRIS database (stations) Phase 1 data transfers between sites performed by Pegasus Phase 2 submit host (e.g., user s laptop) Compute site B (Apache Storm) output data (~40GB) Pegasus 26

27 Southern California Earthquake Center s CyberShake W ORKFLO W Builders ask seismologists: What will the peak ground motion be at my new building in the next 50 years? Seismologists answer this question using Probabilistic Seismic Hazard Analysis (PSHA) 286 sites, 4 models each workflow has 420,000 tasks Pegasus 27

28 A few more workflow features Pegasus 28

29 Performance, why not improve it? workflow restructuring workflow reduction hierarchical workflows pegasus-mpi-cluster clustered job Groups small jobs together to improve performance task small granularity Pegasus 29

30 What about data reuse? workflow restructuring workflow reduction data already available data reuse hierarchical workflows pegasus-mpi-cluster data also available workflow reduction data reuse Jobs which output data is already available are pruned from the DAG Pegasus 30

31 Handling large-scale workflows workflow restructuring workflow reduction hierarchical workflows pegasus-mpi-cluster sub-workflow sub-workflow recursion ends when workflow with only compute jobs is encountered 31

32 Running fine-grained workflows on HPC systems workflow restructuring workflow reduction hierarchical workflows pegasus-mpi-cluster submit host (e.g., user s laptop) HPC System workflow wrapped as an MPI job Allows sub-graphs of a Pegasus workflow to be submitted as monolithic jobs to remote resources Pegasus 32

Pegasus Automate, recover, and debug scientific computations. Pegasus Website http://pegasus.isi.

33 Pegasus Automate, recover, and debug scientific computations. Pegasus Website Get Started Users Mailing List Support HipChat

34 Automating Real-time Seismic Analysis Through Streaming and High Throughput Workflows Thank You Questions? Rafael Ferreira da Silva, Ph.D. Ewa Deelman Rafael Ferreira da Silva Karan Vahi Mats Rynge Rajiv Mayani

Pegasus. Automate, recover, and debug scientific computations. Rafael Ferreira da Silva.

Pegasus. Automate, recover, and debug scientific computations. Rafael Ferreira da Silva. Pegasus Automate, recover, and debug scientific computations. Rafael Ferreira da Silva http://pegasus.isi.edu Experiment Timeline Scientific Problem Earth Science, Astronomy, Neuroinformatics, Bioinformatics,