Pig on Spark project proposes to add Spark as an execution engine option for Pig, similar to current options of MapReduce and Tez.
|
|
- Gilbert Jones
- 6 years ago
- Views:
Transcription
1 Pig on Spark Mohit Sabharwal and Xuefu Zhang, 06/30/2015 Objective The initial patch of Pig on Spark feature was delivered by Sigmoid Analytics in September Since then, there has been effort by a small team comprising of developers from Intel, Sigmoid Analytics and Cloudera towards feature completeness. This document gives a broad overview of the project. It describes the current design, identifies remaining feature gaps and finally, defines project milestones. Introduction Pig on Spark project proposes to add Spark as an execution engine option for Pig, similar to current options of MapReduce and Tez. Pig Latin commands can be easily translated to Spark transformations and actions. Each command carries out a single data transformation such as filtering, grouping or aggregation. This characteristic translates well to Spark, where the data flow model enables step-by-step transformations of Resilient Distributed Datasets (RDDs). Spark will be simply plugged in as a new execution engine. Any optimizations or features added to Pig (like new UDFs, logical plan or physical plan optimization) will be automatically available to the Spark engine. For more information on Pig and Spark projects, see References. Motivation The main motivation for enabling Pig to run on Spark is to: Increase Pig adoption amongst users who would like to standardize on one (Spark) execution backend for operational convenience. Improve performance: For Pig query plans that result in multiple MapReduce jobs, such jobs can be combined into a single Spark job such that each intermediate shuffle output ( working dataset ) is stored on local disk, and not replicated across the network on HDFS only to be read back again. Spark re-uses YARN containers, so does not need to launch new AppMaster and Task JVMs for each job. Spark allows explicit in-memory caching of RDD dataset, which supports multi-query implementation in Pig. Spark features like broadcast variables support implementation of specialized joins in Pig like fragment-replicate join. Functionality Pig on Spark users can expect all existing Pig functionality. Users may switch to the Spark execution engine by: Setting the SPARK_MASTER environment variable to point to user s spark cluster, and specifying the -x spark argument in pig command line. Note: At this stage of development, testing has only been done in Spark local mode (i.e. with SPARK_MASTER as local ). Additional code changes and environment settings may be required to configure Pig with a Spark cluster. Spark engine will support: EXPLAIN command that displays the Spark execution engine operator plan. Progress, statistics and completion status for commands as well as error and debug logs. Design The design approach is to implement Pig Latin semantics using Spark primitives. Since a Pig Latin command approximates a Spark RDD transformation, expressing Pig semantics directly as Spark primitives is a natural option.
2 Moreover, like Pig, Spark supports lazy execution which is triggered only when certain commands (actions in Spark) are invoked. This design was part of the initial patch, and is inline with that of Pig on Tez. Note that this approach is different from one adopted by Hive on Spark project, which implements Hive QL semantics as MapReduce primitives which, in turn, are translated to Spark primitives. Design Components Pig Input Data Files as Spark RDDs The first step in a Pig Latin program is to specify what the input data is, and how it s contents are to be deserialized, i.e., converted from input format into Pig s data model which views input as a sequence of Tuples (aka Bag). This step is carried out by Pig s LOAD command which returns a handle to the bag. This bag is then processed by the next Pig command, and so on. For Spark engine, an input Pig bag is simply a RDD of Tuples, and each subsequent Pig command can be translated to one or more RDD transformations. InputFormat and OutputFormat PigInputFormat abstracts out the underlying input format for the execution engine such that it always returns Pig Tuples.It is a wrapper around Pig LoadFunc which, in turn, is a wrapper around underlying Hadoop InputFormat. All input and output formats supported with Pig should work with Spark engine. No changes are expected related to input or output formats. Logical Plan A Pig Latin program is translated in a one-to-one fashion to a query plan called LogicalPlan containing LogicalRelationalOperators. Pig builds a logical plan for every independent bag defined in the program. No changes are expected to the logical plan. Physical Plan The LogicalPlan is converted to PhysicalPlan containing PhysicalOperators. Note that some operators in LogicalPlan and PhysicalPlan may contain optimization information that is available to be used by the execution engine (Spark). One such scenario is when ORDER BY is following by a LIMIT operator. This is discussed in optimizations section later in the document. Spark Plan Spark Plan Compilation The SparkCompiler identifies Spark jobs that need to be run for a given physical plan. It groups physical operators into one or more SparkOperato rs such that each SparkOperator represents a distinct Spark job. A SparkPlan is simply a DAG of SparkOperators. The physical plan inside each SparkOperator forms the operator pipeline that gets executed by the execution engine. The purpose of creating a SparkPlan is two fold: It identifies all Spark jobs that need to be run. It allows for Spark specific optimizations to be performed to the plan before execution. Design for SparkPlan needs improvement. In the current implementation, we convert the SparkPlan into a pipeline of RDD transformations and immediately execute the RDD pipeline (by performing a Spark action). There is no intermediate step that allows optimization of the RDD pipeline, if so deemed necessary, before execution. This task will entail re-working of the current sparkplantordd()code, for example by introducing an R DDPlan of RDDOperators.
3 Spark Plan Execution Executing a SparkPlan entails converting underlying PhysicalOperators to RDDs and then triggering the execution via a Spark action. Execution begins by converting the POLoad operator into an RDD<Tuple> using Spark s API that accepts Hadoop input format (PigInputFormat). Next, we move down SparkPlan s operator pipeline and perform a series of RDD transformations, resulting in a new RDD at each step. Step-by-step conversion of physical operators to RDDs is show in the example below: For a given physical operator, an RDD transformation generally involves taking input tuples one by by from the predecessor RDD, attaching it to the underlying physical plan and calling getnexttuple() on the leaf operator of the physical plan to do the actual processing. (Pig uses the pull model for execution of physical operator pipeline.) Physical Operator to Spark RDD Conversion Every Pig Latin command translates to one or more physical operators. Converting every physical operator to a Spark RDD is an important milestone in feature completion. Physical Operator Converter Description Spark APIs Status POLoad PODistinct POForEach POFilter Creates an RDD for given HDFS file read with PigInputFormat. FileSpec info is passed to input format via config. Returns RDD of Pig Tuples, i.e. RDD<Tuple> Shuffles using reducebykey() Note that Spark has rdd.distinct() API as well - needs investigation whether using distinct() is more optimal. calls getnexttuple(). When GROUP BY is followed by FOREACH with algebraic UDFs or nested DISTINCT, there is opportunity to use a Combiner function. This optimization remains to be done for Spark engine. calls getnexttuple(). sc.newapihadoopfile(), rdd.map() rdd.reducebykey() rdd.mappartitions(...) rdd.filter(...) POCross N/A since POCross is only used when CROSS is used inside nested ForEach. Note that for (non-nested) CROSS, Pig parallelizes the operation by generating a randomized synthetic key (in GFCross UDF) for every record, replicating the records, performing a shuffle based on the synthetic key and then joining records in each reducer. Spark engine simply re-uses the physical plan without any changes. N/A POLimit POSort calls getnexttuple(). Sorts using JavaPairRDD.sortBykey() using POSort.SortComparator rdd.coalesce(), rdd.mappartitions() rdd.map() rdd.sortbykey(), rdd.mappartitions() POSplit Used either explicitly or implicitly in case of multiple stores ( multi-query execution feature) POStore Persists Pig Tuples (i.e. RDD<Tuple>) using PigOutputFormat to HDFS. PairRDDFunctions.saveAsNewAPIHadoopFile() POUnion Returns union of all predecessor RDDs as a new UnionRDD. new UnionRDD() POStream POSkewedJoin POFRJoin POMergeJoin calls getnexttuple() Optimizes case where there is significant skew in the number of records per key. Currently implemented as a regular RDD join. No shuffle join when one input fits in memory. Currently implemented as a regular RDD join. A no-shuffle join if both inputs are already sorted. Currently implemented as a regular RDD join. rdd.mappartitions() JavaPairRDD.join() JavaPairRDD.join() JavaPairRDD.leftOuterJoin() JavaPairRDD.join()
4 POLocalRearrange POGlobalRearrange POPackage calls getnexttuple(). Generates tuples of the form (index, key,(tuple without key)). Creates a new CoGroupRDD of predecessor RDDs. Generates tuples of the form (bag index, key, {tuple without key}). The output is always processed next by the POPackage operator. Note that PIG represents ordinary shuffle operations like GROUP BY as three physical operators: LocalRearrange (to identify the key and source), GlobalRearrange (to do the actual shuffle) and Package (to generate the output in each reducer). We use a Spark API to do the shuffle (CoGroupRDD). We just need to identify the key, not the sources. So, the Packaging step can be combined with GlobalRearrange step for Spark. This optimization remains to be done for Spark engine. Packages globally rearranged tuples into format required by co-group. Attaches Pig tuple as input to underlying physical operator and calls getnexttuple(). rdd.map() new CoGroupRDD() rdd.map() rdd.map() PONative Native MR. Follow up with Native Spark. POCollectedGroup POMergeGroup POCounter PORank calls getnexttuple(). This operator supports the RANK command and appears right before the PORank operator in the plan. The output is an RDD of tuples of the form (partition index, counter, tuple) where the counter is incremented for every record (there is special handling for DENSE rank). The operator appears right after the POCounter operator. Runs two Spark jobs. First to compute number of records per partition index. And a second Spark job to compute the rank of each tuple by adding offset to counter values in tuples based on output of the first job. rdd.mappartitions() rdd.mappartitionswithindex() rdd.maptopair().groupbykey().sortbykey().collectasmap() rdd.map() = Implemented, = Needs optimal implementation, = Not Implemented Special Considerations Multi-Query Execution Multi-query execution in Pig is motivated by the fact that users often process the same data set in multiple ways, but do not want to pay the cost of reading it multiple times. To address this case, Pig inserts a SPLIT operator for every logical operator that has multiple outputs, which essentially means materialize state at this point. For Spark engine, a SPLIT can be translated to an optimization step where the RDD data set is pulled into Spark s cluster-wide in-memory cache, such that child operators read from the cache. (In the MapReduce engine, child operators read from disk.) Without this optimization, Spark engine Pig jobs will run significantly slower in the multi-query case because RDDs will need to be recomputed. Th is optimization needs to be implemented. Remaining Optimizations Specialized Joins and Groups Pig supports specialized joins like fragment replicate join, merge join and skew join, as well as specialized grouping like collected groups and merge groups. These are explicitly specified by the user with the USING clause in the Pig command. (Pig does not automatically choose a specific join or group based on input data set.)
5 These are currently implemented as regular joins and groups. Specialized versions need to be implemented. Secondary Sort In Pig with MapReduce engine, there are several map-side performance optimizations. A good example is secondary key sort: B = GROUP A by FOO; C = FOREACH B { } D = ORDER A by BAR; GENERATE D; MapReduce provides specialized API to support secondary key sort within groups. Spark currently does not have support for secondary sort (SPARK-3655). Currently, secondary sort in Spark engine is implemented using two shuffles. This needs to be fixed. Combiner Optimizations Using a combiner lowers shuffle volume and skew on the reduce side. The Pig combiner is an optimization that applies to certain FOREACH cases: In nested foreach when the only nested operation is DISTINCT (i.e. dedupes in map phase to reduce shuffle volume). In non-nested foreach, where all projections are either: Expressions on the GROUP column, Or UDFs implementing the Algebraic interface. The combiner either translates to MR combiner or a special Pig operator which does in-memory combining in the map stage ( partial aggregation feature). Combiner support is currently not implemented for Spark engine. Limit after Sort In MapReduce engine, a sort entails three map reduce jobs - first one for computing quantiles from samples of input data, second one for performing the shuffle partitioned based on quantile ranges, and third one which is a 1-reduce-task shuffle to generate the final output. In the scenario where ORDER BY is followed by LIMIT n, logical and physical plans do not have the POLimit operator. Instead, the sort operator (POSort) contains the limit information (see LimitOptimizer and LimitAdjuster). MapReduce uses the limit information to optimize the cost of sorting in the second MR job where the combiner and the reducer stages output just the top n items. Currently, Spark sort API does not take limit information. Hence no limit related optimization is implemented for the Spark engine. See PIG Use Optimal Spark API for shuffling Currently shuffle is implemented using Spark s groupby, sortbykey and CoGroupRDD APIs. However, Spark has since added other variants like aggregatebykey (which also support combiner functions). Parallelism during shuffle Currently no parallelism estimate is made when calling Spark s shuffle APIs, leaving Spark to set it. Native operator for Spark For several reasons (performance, difficult translation to Pig, legacy code, etc.), user may want to directly run Spark code written in Scala, Python or Java from a Pig script. This entails breaking the Pig pipeline, writing data to disk (added POStore), invoking the native Spark script, and then reading data back from disk (added POLoad). Some issues:
6 Co-ordination between Pig s spark jobs and native spark jobs. Adding stats and progress for native Spark jobs. Handling any security implications when running Spark code natively. This is a low priority item for the first version. Packaging as part of GlobalRearrange As described earlier, the Packaging operator does not necessarily need it s own RDD transformation in Spark and may be made part of the GlobalRearrange RDD transformation. This is an optimization step which can save a few extra transformations. Though it might make it more confusing to diverge the behavior from MR and Tez. Progress Reporting and Statistics Basic support for Spark job progress reporting, statistics and logs has been implemented. Needs more work for comprehensive support. Test Infrastructure Unit Tests Status of latest unit test run is here. Unit tests with Spark engine use the standard minidfs cluster. However, currently unit tests run in Spark local mode. Spark offers a way to run jobs in local cluster mode, where a cluster is made up of a given number of processes on the local machine. Unit test execution needs to be switched to local-cluster mode once local mode tests pass. More Spark specific unit tests need to be added. No testing has been done so far with actual Spark cluster. Not much thought has been given so far on benchmark and performance testing. Summary of Remaining Tasks Design Current design for SparkPlan needs improvement as mentioned earlier. Functionality All physical operators are supported at this point, except PONative. Unit test failures point to some important gaps in existing implementation. These are highlighted as items that need to implemented as part of Milestone 1 below. Optimizations Specialized versions of joins and cogroups. Running combiner for Algebraic UDFs and Foreach optimizaton. Computing optimal parallelism for shuffles. Spark now has several similar shuffle APIs. Need to choose the optimal ones. More efficient implementation of secondary sort. Spark integration Progress and error reporting support is implemented but needs improvement. Tests Test with Spark local-cluster mode. Add remaining unit tests for Spark engine specific code. Test on Spark cluster. Benchmarking and performance tests. Comparison with Pig on Tez Tez, as a backend execution engine, is very similar to Spark in that it offers the same optimizations that Spark does (speeds up scenarios that
7 require multiple shuffles by storing intermediate output in local disk or memory, re-use of YARN containers and support for distributed in-memory caching.). The main implementation difference when using Tez as a backend engine is that Tez offers a much lower level API for expressing computation. From the direct user perspective, Tez also does not offer a built-in shell. Pig on Tez design is very similar to current Pig on Spark design, in that it constructs a new plan directly from the PhysicalPlan. In Pig on Tez, every shuffle boundary translates into two Tez vertices and the connecting edge expresses the fact we are shuffling. In Pig on Spark, the API is not as low level, so every shuffle is expressed as a high level call to Spark like reduceby or CoGroupRDD. Significant refactoring of code was done in Pig 0.13 to support backends other than MapReduce starting with Pig on Tez. Pig on Spark builds on that effort. Milestones This document lists milestones for ongoing work on Pig on Spark. Milestone 1 Goal: Functional completeness of major items ETA: ~ developer weeks Missing Operators Operator Comment Owner ETA Status POCross (Top level & Nested Cross) PIG-4549 (top level) and PIG-4552 (nested) Mohit 3 days POFRJoin PIG-4278 Mohit 1 day POMergeJoin PIG-4422 Mohit 1 day PONative Low priority item for first version. Need a (no-op) implementation. 1 day POMergeGroup 1 day POSkewJoin PIG-4421 Kelly Fix or disable tests for specialized JOIN and GROUP/COGROUP operators. Missing Features Feature Comment Test Owner ETA Status Support for custom P artitioner Combiner support for Algebraic UDFs PIG-4565 Used in DISTINCT, GROUP, JOIN, CROSS. Need to wrap user s custom MR Partitioner in Spark Partitioner object. TestCustomPartitioner Mohit 1 week TestCombiner TBD Spark should call cleanup in MR OutputCommitter API Support for HBase storage Support for secondary sort Low priority clean-up item. Not a must-do for first version. TestStore 3 days PIG-4585, PIG-4611 TestHBaseStorage Mohit/Kelly 4 days PIG-4504 TestAccumulator#testAccumWithSort Kelly 2 days Blocked by SPARK-7953 and multi-query
8 Use optimal shuffle API Secondary Sort using one shuffle Currently, we use groupbykey, which assumes all values for a key will fit in memory. Use aggregatebykey or reduce ByKey instead. PIG-4504 implements secondary sort using 2 shuffles. We should do it in one (PIG-4553). This is a performance item, but a well used feature, so we should do it in first milestone. 2 days 4 days Multi-query support TestMultiQuery, TestMultiQueryLocal broken Kelly 1 week Critical Tests Corner cases failing in already implemented features Test Comment Owner ETA Status TestJoin Joining empty tuples fails Kelly 1 day TestPigContext, TestGrunt PIG Deserializing UDF classpath config fails in Spark because it s thread local. Kelly 3 days TestProjectRange PIG-4297 Range expressions fail with groupby Kelly 2 days TestAssert FilterConverter issue? 1 day TestLocationInPhysicalPlan 1 day Other Tests ETA: 1 week Several tests are failing due to either: Ordering difference in shuffle results (MR returns sorted results, Spark doesn t), Or Gaps in SparkPigStats. We should fix these tests as these as we find time as these are low hanging fruit and might help us uncover other issues. These include TestScriptLanguage, TestPigRunner, TestJoinLocal, TestEvalPipelineLocal, TestDefaultDateTimeZone, etc. Investigate And fix if needed Feature Comment Owner ETA Status Streaming Test are passing, but we need confirmation that this feature works. 1 day Spark Unit Tests Few Spark engine specific unit tests have been written so far (for features that have Spark specific implementations). Following is a partial list of what we need to add. Need to update this list as we add more Spark specific code. We should also add tests for POConverter implementations. TestSparkLauncher TestSparkPigStats Test Comment Owner ETA Status
9 TestSecondarySortSpark Kelly Enhance Test Infrastructure ETA: ~ 2 weeks (additional test failures expected) Use local-cluster mode to run unit tests and fix resulting failures. Milestone 2 Goal: Spark Integration & remaining functionality items ETA: ~ 5 developer weeks Spark Integration ETA: 2 weeks Including error reporting, improved progress and stats reporting Fix Remaining Tests ETA: 3 weeks TestScriptLanguageJavaScript TestPigRunner TestPruneColumn: Fixed in PIG-4582 TestForEachNestedPlanLocal:Fixed in PIG-4552 TestRank1 TestPoissonSampleLoader TestPigServerLocal TestNullConstant: Fixed in PIG-4597 TestCase: Fixed PIG-4589 TestOrcStoragePushdown Milestone 3 Goal: Performance optimization and code cleanup ETA: TBD Performance Tests TBD Performance Optimizations Split / MultiQuery using RDD.cache() Feature Comment Owner ETA Status
10 In Group + Foreach aggregations, use aggregatebykey or reducebykey for much better performance For example: COUNT or DISTINCT aggregation inside nested foreach is handled by Pig code. We should use Spark to do in more efficiently Compute optimal Shuffle Parallelism Currently we let Spark pick the default Combiner support for Group+ForEach Multiple GROUP BYs on the same data set can avoid multiple shuffles. See MultiQueryPackager Switch to Kryo for Spark data serialization Are all Pig serializable classes compatible with Kryo? FR Join Merge Join (including sparse merge join) Skew Join Merge CoGroup Collected CoGroup Note that there is ongoing work in Spark SQL to support specialized joins: See SPARK As an example, support for merge join is in Spark SQL in Spark 1.4 (SPARK-2213 and SPARK-7165). This implies that Spark community will not be adding support for these joins in Spark Core library. Re-design Spark Plan Currently, the SparkLauncher converts the SparkPlan to RDD pipeline and immediately executes it. There is no intermediate step that allows optimization of the RDD pipeline, if so deemed necessary, before execution. This will need re-working of sparkplantordd(), perhaps by introduction of a RDDPlan of RDDOperators. Other Features Native Spark operator support. Allow Spark Partitioner to be specified using PARTITION BY. Getting Started Github: Please refer to PIG-4059 for instructions on how to setup your development environment, PIG-4266 for instructions on how to run unit tests and PIG-4604 for instructions on package import order. References [1] Pig [2] Pig Latin [3] Pig Execution Model [4] Apache Spark wiki [5] Spark [6] Spark blog post [7] Hive on Spark design doc
Faster ETL Workflows using Apache Pig & Spark. - Praveen Rachabattuni,
Faster ETL Workflows using Apache Pig & Spark - Praveen Rachabattuni, Sigmoid @praveenr019 About me Apache Pig committer and Pig on Spark project lead. OUR CUSTOMERS Why pig on spark? Spark shell (scala),
More informationChapter 4: Apache Spark
Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationBig Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture
Big Data Syllabus Hadoop YARN Setup Programming in YARN framework j Understanding big data and Hadoop Big Data Limitations and Solutions of existing Data Analytics Architecture Hadoop Features Hadoop Ecosystem
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationSpark Overview. Professor Sasu Tarkoma.
Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationApache Pig Releases. Table of contents
Table of contents 1 Download...3 2 News... 3 2.1 19 June, 2017: release 0.17.0 available...3 2.2 8 June, 2016: release 0.16.0 available...3 2.3 6 June, 2015: release 0.15.0 available...3 2.4 20 November,
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More informationTechno Expert Solutions An institute for specialized studies!
Course Content of Big Data Hadoop( Intermediate+ Advance) Pre-requistes: knowledge of Core Java/ Oracle: Basic of Unix S.no Topics Date Status Introduction to Big Data & Hadoop Importance of Data& Data
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More informationApache Spark Internals
Apache Spark Internals Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Apache Spark Internals 1 / 80 Acknowledgments & Sources Sources Research papers: https://spark.apache.org/research.html Presentations:
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationHadoop course content
course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationPig Latin Reference Manual 1
Table of contents 1 Overview.2 2 Pig Latin Statements. 2 3 Multi-Query Execution 5 4 Specialized Joins..10 5 Optimization Rules. 13 6 Memory Management15 7 Zebra Integration..15 1. Overview Use this manual
More informationPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationData Analytics Job Guarantee Program
Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction
More informationDell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/
More informationBlended Learning Outline: Cloudera Data Analyst Training (171219a)
Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills
More informationAnalytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig
Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics
More informationAccelerating Spark Workloads using GPUs
Accelerating Spark Workloads using GPUs Rajesh Bordawekar, Minsik Cho, Wei Tan, Benjamin Herta, Vladimir Zolotov, Alexei Lvov, Liana Fong, and David Kung IBM T. J. Watson Research Center 1 Outline Spark
More informationAn Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.
An Introduction to Apache Spark Big Data Madison: 29 July 2014 William Benton @willb Red Hat, Inc. About me At Red Hat for almost 6 years, working on distributed computing Currently contributing to Spark,
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationOverview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::
Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationCertified Big Data and Hadoop Course Curriculum
Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationData-intensive computing systems
Data-intensive computing systems University of Verona Computer Science Department Damiano Carra Acknowledgements q Credits Part of the course material is based on slides provided by the following authors
More informationOutline. CS-562 Introduction to data analysis using Apache Spark
Outline Data flow vs. traditional network programming What is Apache Spark? Core things of Apache Spark RDD CS-562 Introduction to data analysis using Apache Spark Instructor: Vassilis Christophides T.A.:
More informationThe Pig Experience. A. Gates et al., VLDB 2009
The Pig Experience A. Gates et al., VLDB 2009 Why not Map-Reduce? Does not directly support complex N-Step dataflows All operations have to be expressed using MR primitives Lacks explicit support for processing
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationHadoop. Introduction to BIGDATA and HADOOP
Hadoop Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big Data and Hadoop What is the need of going ahead with Hadoop? Scenarios to apt Hadoop Technology in REAL
More informationSpark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
Spark: A Brief History https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf A Brief History: 2004 MapReduce paper 2010 Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 MapReduce @ Google
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationL3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences
Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences L3: Spark & RDD Department of Computational and Data Science, IISc, 2016 This
More informationHadoop. copyright 2011 Trainologic LTD
Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationCERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)
CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program
More informationYARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa
YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa ozawa.tsuyoshi@lab.ntt.co.jp ozawa@apache.org About me Tsuyoshi Ozawa Research Engineer @ NTT Twitter: @oza_x86_64 Over 150 reviews in 2015
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationCompile-Time Code Generation for Embedded Data-Intensive Query Languages
Compile-Time Code Generation for Embedded Data-Intensive Query Languages Leonidas Fegaras University of Texas at Arlington http://lambda.uta.edu/ Outline Emerging DISC (Data-Intensive Scalable Computing)
More informationExpert Lecture plan proposal Hadoop& itsapplication
Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile
More informationAbout Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals
More information6.830 Lecture Spark 11/15/2017
6.830 Lecture 19 -- Spark 11/15/2017 Recap / finish dynamo Sloppy Quorum (healthy N) Dynamo authors don't think quorums are sufficient, for 2 reasons: - Decreased durability (want to write all data at
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More informationSpark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of abstraction in cluster
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More information빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기
빅데이터기술개요 2016/8/20 ~ 9/3 윤형기 (hky@openwith.net) D4 http://www.openwith.net 2 Hive http://www.openwith.net 3 What is Hive? 개념 a data warehouse infrastructure tool to process structured data in Hadoop. Hadoop
More informationHadoop & Big Data Analytics Complete Practical & Real-time Training
An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationHADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)
HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation) Introduction to BIGDATA and HADOOP What is Big Data? What is Hadoop? Relation between Big
More informationJaql. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata. IBM Almaden Research Center
Jaql Running Pipes in the Clouds Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata IBM Almaden Research Center http://code.google.com/p/jaql/ 2009 IBM Corporation Motivating Scenarios
More informationPrincipal Software Engineer Red Hat Emerging Technology June 24, 2015
USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging
More informationPig A language for data processing in Hadoop
Pig A language for data processing in Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Pig: Introduction Tool for querying data on Hadoop
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More information2/4/2019 Week 3- A Sangmi Lee Pallickara
Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1
More informationA Tutorial on Apache Spark
A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationSpark. In- Memory Cluster Computing for Iterative and Interactive Applications
Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,
More informationHadoop Online Training
Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the
More informationData processing in Apache Spark
Data processing in Apache Spark Pelle Jakovits 5 October, 2015, Tartu Outline Introduction to Spark Resilient Distributed Datasets (RDD) Data operations RDD transformations Examples Fault tolerance Frameworks
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under
More informationImpala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam
Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?
More information1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions
Big Data Hadoop Architect Online Training (Big Data Hadoop + Apache Spark & Scala+ MongoDB Developer And Administrator + Apache Cassandra + Impala Training + Apache Kafka + Apache Storm) 1 Big Data Hadoop
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationShark: Hive (SQL) on Spark
Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce
More informationBig Data and Hadoop. Course Curriculum: Your 10 Module Learning Plan. About Edureka
Course Curriculum: Your 10 Module Learning Plan Big Data and Hadoop About Edureka Edureka is a leading e-learning platform providing live instructor-led interactive online training. We cater to professionals
More informationSpark. In- Memory Cluster Computing for Iterative and Interactive Applications
Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,
More informationLambda Architecture with Apache Spark
Lambda Architecture with Apache Spark Michael Hausenblas, Chief Data Engineer MapR First Galway Data Meetup, 2015-02-03 2015 MapR Technologies 2015 MapR Technologies 1 Polyglot Processing 2015 2014 MapR
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationIndex. Symbols A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Symbols A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Symbols + addition operator?: bincond operator /* */ comments - multi-line -- comments - single-line # deference operator (map). deference operator
More informationHadoop: The Definitive Guide
THIRD EDITION Hadoop: The Definitive Guide Tom White Q'REILLY Beijing Cambridge Farnham Köln Sebastopol Tokyo labte of Contents Foreword Preface xv xvii 1. Meet Hadoop 1 Daw! 1 Data Storage and Analysis
More informationEvolution From Shark To Spark SQL:
Evolution From Shark To Spark SQL: Preliminary Analysis and Qualitative Evaluation Xinhui Tian and Xiexuan Zhou Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese
More informationShuffling, Partitioning, and Closures. Parallel Programming and Data Analysis Heather Miller
Shuffling, Partitioning, and Closures Parallel Programming and Data Analysis Heather Miller What we ve learned so far We extended data parallel programming to the distributed case. We saw that Apache Spark
More informationDistributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 22. Spark Paul Krzyzanowski Rutgers University Fall 2016 November 26, 2016 2015-2016 Paul Krzyzanowski 1 Apache Spark Goal: generalize MapReduce Similar shard-and-gather approach to
More informationIntroduction to Apache Spark. Patrick Wendell - Databricks
Introduction to Apache Spark Patrick Wendell - Databricks What is Spark? Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop Efficient General execution graphs In-memory storage
More informationAn Overview of Apache Spark
An Overview of Apache Spark CIS 612 Sunnie Chung 2014 MapR Technologies 1 MapReduce Processing Model MapReduce, the parallel data processing paradigm, greatly simplified the analysis of big data using
More informationPractical Big Data Processing An Overview of Apache Flink
Practical Big Data Processing An Overview of Apache Flink Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de With slides from Volker Markl and data artisans 1 2013
More informationDistributed Computation Models
Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case
More informationBIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG
BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG Prof R.Angelin Preethi #1 and Prof J.Elavarasi *2 # Department of Computer Science, Kamban College of Arts and Science for Women, TamilNadu,
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 26: Spark CSE 414 - Spring 2017 1 HW8 due next Fri Announcements Extra office hours today: Rajiv @ 6pm in CSE 220 No lecture Monday (holiday) Guest lecture Wednesday Kris
More informationPig Latin: A Not-So-Foreign Language for Data Processing
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins (Yahoo! Research) Presented by Aaron Moss (University of Waterloo)
More informationImproving the MapReduce Big Data Processing Framework
Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM
More informationData Engineering. How MapReduce Works. Shivnath Babu
Data Engineering How MapReduce Works Shivnath Babu Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job Lifecycle of a MapReduce Job Map function Reduce function
More informationCOSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?
COSC 6339 Big Data Analytics Introduction to Spark Edgar Gabriel Fall 2018 What is SPARK? In-Memory Cluster Computing for Big Data Applications Fixes the weaknesses of MapReduce Iterative applications
More information