RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Size: px

Start display at page:

Download "RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING"

Joy Dorsey
6 years ago
Views:

1 RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica Remzi Can Aksoy

2 TABLE OF CONTENTS Background RDD in a nutshell Spark the implementation Evaluation What s new in Spark

RELATED WORKS PROBLEM General purpose MapReduce, Dryad Specialized External storage needed for data reuse cross computations Iterative algorithm Ad-hoc query Pregel: iterative graph

3 RELATED WORKS PROBLEM General purpose MapReduce, Dryad Specialized External storage needed for data reuse cross computations Iterative algorithm Ad-hoc query Pregel: iterative graph compute Don t generalize HaLoop: loop of MapReduce steps In-memory Storage Hard to implement efficient fault tolerance Distributed shared memory Key Value Storage / Databases Piccolo

4 RDD IN A NUTSHELL Resilient Distributed Dataset General purpose distributed memory abstraction In-memory Immutable Can only be created through deterministic operations (Transformations) Atomic piece of data: partition Fault-tolerant

5 RDD - OPERATIONS Transformations map, filter, union, join, etc. Actions count, collect, reduce, lookup, save External Source Transformation Transformation Action RDD1 RDD2 External Result

$filter(_.startswith("error")) errors.persist() // Return the time fields of errors mentioning errors.filter(_.contains("hdfs")).map(_.split( \t )(3)).$

6 RDD IN ACTION Store actual data => Lineage: how the partitions were derived from other datasets lines = spark.textfile("hdfs://...") errors = lines.filter(_.startswith("error")) errors.persist() // Return the time fields of errors mentioning errors.filter(_.contains("hdfs")).map(_.split( \t )(3)).collect()

7 ADVANTAGES OF RDD DSM systems => applications read and write to arbitrary locations in a global address space RDD => can only be created with coarse-grained transformations (i) Storing lineage is enough for recovery and only the lost partitions of an RDD need to be recomputed upon failure. This can be done in parallel without having to roll back the whole program (ii) Slow nodes can be mitigated with backup copies. In DSM, it is hard because of the confliction between two copies.

8 PAGERANK val links = spark.textfile(...).map(...).persist() var ranks = // RDD of (URL, rank) pairs for (i <- 1 to ITERATIONS) { } val contribs = links.join(ranks).flatmap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reducebykey((x,y) => x+y).mapvalues(sum => a/n + (1-a)*sum) *Checkpoint if the lineage is too long

9 RDD - PERSISTENCE & PARTITIONING Persistence Level In-memory Disk backed Replica Partitioning Default hashing User defined Advantages - In case when re-computation is more costly than storage IO - Help improve data locality

10 IMPLEMENTATION DETAILS

11 SPARK THE IMPLEMENTATION Job scheduler Memory manager Interactive interpreter Not a cluster manager Mesos YARN Standalone (added later)

12 SPARK THE SCHEDULER Mapping high-level logical representation to low-level tasks Build a DAG of stages to execute based on the RDD s lineage Transformation are lazily computed Factors Data locality Pipeline Worker fault tolerance

13 REVISIT RDD DEPENDENCIES Narrow Pipeline execution Partition-wise Easy recover Wide All parents must be present to compute any partition Full re-computation needed for recovering

14 SPARK THE SCHEDULER Stage: pipelined op with narrow dependencies Boundaries shuffle operations required by wide dependencies Already computed partitions

15 SPARK THE SCHEDULER Fault tolerance Re-run on another node in case a task fails Resubmit tasks for missing partitions in parallel Only worker failures are tolerated Scheduler (master) failure can be recovered by using additional service like zookeeper or simple local filesystem based checkpoint Optimization for long lineage: checkpointing Leave to the user to decide which RDD to checkpoint

16 SPARK THE MEMORY MANAGER Options for storage of persistent RDDs Serialized In-memory: JVM can access natively Deserialized in-memory: can be more space-efficient On disk: if no memory enough and the computation is costly Insufficient Memory LRU (skipping the RDD currently operating on) User defined priority via persistence priority

17 EVALUATION Three of them: Iterative machine learning Limited memory Interactive data mining

18 EVALUATION Iterative machine learning Spark : first vs. later iterations Hadoop, Spark: first iteration HadoopBinMem, Spark: later iteration EECS 582 F16

19 EVALUATION Limited memory

20 EVALUATION Interactive data mining

21 EXAMPLE - INVERTED INDEX Spark version """InvertedIndex.py"" from pyspark import SparkContext sc = SparkContext("local", Inverted Index") docfile = path/to/input/file" # Should be some file on your system # each record is <docid, doccontent> docdata = sc.textfile(docfile) # split words, of type <word, docid> docwords = docdata.flatmap(lambda k, v: [(wd, docid) for wd in v.split()]) # sort and then group by key, invindex is of type <word, list<docid> > invindex = docwords.sortbykey().groupbykey() # persist invindex.save( path/to/output/file )

22 EXAMPLE - INVERTED INDEX MapReduce version (pseudo code) map(string key, String value): // key: document id // value: document contents for each word w in value: EmitIntermediate(w, key); reduce(string key, Iterator values): // key: a word // values: a list of document ids sort(values) Emit(key, values)

23 WHAT S NEW IN SPARK Language bindings Java, Scala (original paper), Python, R Libraries built on top of Spark Spark SQL: working with structured data, mix SQL queries with Spark programs Spark Streaming: build scalable fault-tolerant streaming application MLlib: scalable machine learning library GraphX: API for graphs and graph-parallel computation SparkNet: distributed neural networks for Spark.

24 TAKE AWAY RDD Immutable in-memory data partitions Fault tolerance using lineage, with optional checkpoint Lazily computed until user requested Limited operation, but still quite expressive Spark Schedule computation task Move data and code around in cluster Interactive interpreter

Resilient Distributed Datasets

Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,