An exceedingly high-level overview of ambient noise processing with Spark and Hadoop

Size: px

Start display at page:

Download "An exceedingly high-level overview of ambient noise processing with Spark and Hadoop"

Geoffrey Morrison
5 years ago
Views:

and Eric Matzel August 9, 2016 Previously presented at the 2017 Fall AGU This work was performed under the auspices of the U.S.

1 IRIS: USArray Short Course in Bloomington, Indian Special focus: Oklahoma Wavefields An exceedingly high-level overview of ambient noise processing with Spark and Hadoop Presented by Rob Mellors but based on work by Steven Magana-Zook, Douglas Knapp and Eric Matzel August 9, 2016 Previously presented at the 2017 Fall AGU This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA Lawrence Livermore National Security, LLC

Goals Can big data technologies reduce the time needed for extensive seismological time series computations? (yes) How difficult is it to implement (not simple) Is it worthwhile?

2 Goals Can big data technologies reduce the time needed for extensive seismological time series computations? (yes) How difficult is it to implement (not simple) Is it worthwhile? (I think so ) Test data and setup 14 days of 500 sps continuous data from 500 stations 16 node Hadoop cluster Available java code for ambient noise processing designed and tested on a single workstation (e.g. data prep, crosscorrelation, stacking) [we are not using GPU s] 2

3 What is big data (and how does it differ from HPC) $0.5 M and up From Magana-Zook,

Big data Move computation, not data Fundamental tools Hadoop distributed filesystem (HDFS) Filesystem for distributed data across nodes Not like a normal filesystem (some unixcommands but no

4 Big data Move computation, not data Fundamental tools Hadoop distributed filesystem (HDFS) Filesystem for distributed data across nodes Not like a normal filesystem (some unixcommands but no directories) Write-once, read-many Resilient to failures (mostly ) YARN (Yet Another Resource Negotiator) Handles resources (CPU, memory) Scheduling Spark, MapReduce Interface to YARN (and HDFS) that you can connect programs to We are using Spark Adapted from Magana-Zook,

5 Hadoop: HDFS and YARN HDFS HDFS Replicates data across nodes Assumes hardware failures occur Written mostly in Java YARN Resource manager Handles system resources (CPU, memory) and node managers Based on MapReduce but allows more complex applications Average users should not have to worry about this too much 5

6 Spark General purpose distributed computation platform Scales from small to big Handles Scala, Java, and Python Data structure is the resilient distributed dataset (RDD) Handles iterative algorithms and keeps results in memory Needs cluster manager and distributed storage system. Converts tasks to a Java Virtual Machine (JVM) And has a machine learning library 6

7 Spark Define distributed dataset (user) Write driver program (user) Connect to data nodes (in YARN) Assigns tasks (JVM s) Vendors and source (open-source): Hortenworks, Cloudera 7

8 Example: Spark shell in Python pyspark --master local[*] Load data datasetrdd = sc.textfile("file:///home/iris/okwavefields/dataset.tsv") # count records are in dataset datasetrdd.cache().count() # Create tuples of (key, average) avgbykey = sumbykey.join(countbykey).map(lambda j: (j[0], j[1][0]/j[1][1])) # View the results avgbykey.take(2) Adapted from Magana-Zook,

9 Example: Submitting a job to YARN spark-submit \ --master yarn \ --num-executors 2 \ --executor-cores 2 \ --class IU.IRIS.okalahoma.SparkBatchExercise \ OK_wavefields.jar \ dataset.tsv \ SparkResults Adapted from Magana-Zook,

10 Application to ambient noise correlation (ANC) Overview: Need to calculate possible pairs, cross-correlate, and stack These will be stages in Spark Time required for each stage varies All parts of one stage must complete before going to the next stage Write driver Read in all data information Calculate pairs Divide into tasks to be assigned Use 1 hour all station pairs as a basic unit Handles missing data When complete stack Includes call to jar files for the computations On the order of 200 lines for driver Script passes driver to YARN (less than 20 line) Then wait.. 10

11 Results so far 500 stations, 14 days, about a week to 10 days (~16 node cluster) Have had some problems handled two drive drive failures but not three. Be careful with driver easy to accidentally skip pair 11

12 Monitoring Hadoop cluster 12

13 YARN 13

14 Spark - stages 14

15 Conclusions If it is Dec. 1 and you want to process a bunch of data for AGU, probably not the best choice If you have an interest in big data or have a truly large dataset (1000 s of stations), it is worthwhile. Easy to start Cluster not required to learn Can run Hadoop, Spark, etc on a laptop for testing Might be able to do it on Amazon cloud (but $ to transfer) Technology advancing quickly Or you can apply to be a summer student at LLNL 15

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance