Survey of Big Data Frameworks for Different Application Characteristics

Size: px

Start display at page:

Download "Survey of Big Data Frameworks for Different Application Characteristics"

Jeffry Morgan
5 years ago
Views:

1 Survey of Big Data Frameworks for Different Application Characteristics Praveen Kumar Singh TCS Research Mumbai, India Rekha Singhal TCS Research Mumbai, India Abstract Nowadays applications are migrating from traditional 3-tier architecture to Big data platform which are widely available in open source and can do parallel data processing on cluster of commodity machines. The challenges are to choose the right available Big data framework for an application with the available features of the framework. We have proposed a rule base methodology for making this choice. We first categorize an application into 4 major categories such as Iterative computation based application, SQL Query based analytic workload, online transaction processing application and streaming based application based on their features. The proposed rule base stores level of support given by popular big data frameworks for each of the application s features. We have presented rule base for each of these application category in this paper. Keywords SQL, NoSQL, Machine Learning, Rule Base, Hadoop, Spark, Flink, Iterative Computation, Streaming, Graph, Framework I. INTRODUCTION In the world of Big data, enormously data generated in the terms of Petabyte, Yotabyte from the sources like social networks, satellites, sensors, user devices, search engines etc. To analyze the large data and extract the useful information from the large data, people come up with the large data processing frameworks. People from the open world are providing the tools to implement and execute the parallel, complex and scientific application. Most of the scientific application needs parallel and distributed computation. As we know Hadoop is not suitable for iterative computation and the most of the scientific computation and machine learning algorithms requires iteration. Spark[13], Twister[7], Haloop[6] and Flink[1] are come up with alternative solution of Map Reduce framework and providing better support for applications which requires iterative computation. These frameworks are parallel and distributed in nature. Programmers can distribute their data across the cluster and execute their task in parallel over the data present in the cluster. Framework like Hadoop and Twister evolves the disk I/O s, and Spark reduce the I/O s activity due to in-memory concept. It doesn t always depend on the framework but it also depends on the application. Applications are need to be categorize into I/O intensive and CPU intensive. There are plethora s of Big data frameworks available for executing various types of applications. There is always a challenge for a user to choose the right Big data processing platform for his/her problem statement. This paper provides a mapping of different type of applications and workload to possible set of data processing frameworks. We further categorize application based on their features such as speed, latency, dependency, language etc. For the perspective of users we are trying to put comparisons among different computation framework. In order to help users to choose appropriate framework, which can be used to process their real world problem. We are categorizing the application to the different framework based on their suitability and features. In the categorization, we will focus on the application classes like Machine Learning and Iterative computation based Application, SQL Query based Application, Online Transaction Processing Application (OLTP) and Streaming based Application. Some of the open source Big data frameworks like Hadoop[3], Spark[13], Flink[1] with their pros, cons and Rule base for choosing these frameworks. The contribution of this paper is to define features for different types of applications and matching it to appropriate Big data framework based on several studies available in the literature. The rest of the paper is organized as follows. In the next Section, we will discuss about the various Application Classes, in Section 3 will explains about different Big data frameworks and in Section 4 we have Rule Base for choosing these frameworks and finally concluded in section 5. II. APPLICATION CLASSES All domains such as Banking, Media, Health care, Education, Manufacturing, Insurance, Government, Retail and Transportation etc have challenges of processing Big Data. Most of these applications have following type of use cases for processing large data-sizes. 1) Knowledge Discovery in Database: Applications which are required to identify the hidden data or extract the useful information from the given data which is applicable to identify the current trends in market and more insight over the customer data. 2) Fraud Detection and prevention: used to identify the fraudulent cases happens in every sector, can be prevent before it happens. 3) Device-generated Data Analytic: enormously data generated through the sensor, remote devices, mobile and The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.

2 satellites. Generated data can be used for weather forecasting and it is being used to provide intelligence based on the location. 4) Social Networking and their relationship Analytic: Social networking sites are generating huge amount of data in a seconds that need to be analyze in a real time with the correct identification of relationship and respond back to the user. 5) Recommendation based Analytic: In the digital world users are spending most of the time with their devices, data need to be analyzed and recommend the useful information to the user. This analytic can be able to busy the users by their recommendations. In this paper we are providing the rule base for choosing framework, and categorizing application broadly into Iterative computation based application, SQL Query based analytic workload, online transaction processing application and streaming based application. A. Machine Learning and Iterative computation based Application Machine Learning is a part of Artificial Intelligence where the machine is trained with the previous data to make the decisions for the future and we can apply some algorithm to get some output. To apply machine learning few steps need to follow 1) Data need to be collected from the various sources. 2) Data need to be prepared well by analyzing the data, removing the outlier and missing values issues. 3) Data need to be representing in training and testing part to build the model and test it. 4) Check the accuracy of a model by evaluating the algorithm outcome. 5) Improve the performance if possible by the mean of selecting different model altogether. Supervised learning is also known as task driven in which learning is to be done through the past data to get future prediction. Learning can be define through the attribute of a class, if the attribute is discrete then it is classification and if the attribute is continuous then it is regression. Decision Tree, SVM and Regression are comes into this category. Unsupervised Learning is used in a cluster analysis based on the given input data set with the help of some objective function. A well known K-Means clustering algorithm and Principal Component Analysis (PCA) comes under this category. B. SQL Query based Application The Big data framework provides the functionality for users to run and analyze the query on the data, which is stored on the Hadoop Distributed (HDFS) and S3. Hive on Hadoop Map-reduce, SparkSQL on Spark and many more provides the API to deal with SQL Query. Data can be viewed and analyzed by basic query like aggregation and join, even on the fly for business purposes through these available frameworks. C. Online Transaction Processing Application (OLTP) These applications have read, write and update operations. Each transaction execution will be associated with some semantics such as ACID, CAP etc. Most of the NoSQL based framework such as HBase, Neo4j, MongoDB, etc supports OLTP workload. Some of the NoSQL framework such as GraphLab and Pregel are specialized to work on graph based data which is required for social networking application. The amount of data generated through the social networking sites and various sources of data is in exponential rate. Graph based model used for parallel computation on the graph ensures the data should be consistent and different techniques are being applied to get some useful information from huge amount of data. Graph analysis is one of the solutions to understand complex relations and to check the different patterns present in the large data. The available frameworks for graph processing are GraphX[8], GraphLab[9], Giraph[2] and Pregel[11]. PageRank, Connected Component, SVD++, Strongly Connected Components and Triangle Count are the applications belongs to this category. D. Streaming based Application To Process the large amount of data is challenging task but to take decisions on-fly even more challenging in real time, where stream processing comes into the picture. Most of the business firms are more focused on streaming analysis to take business decision such as recommendations for the users/customers/clients in real time. Stream processing mainly does the calculation, statistical analysis and continuous query runs over the stream of data in real time. The available framework for stream processing are Spark Streaming[14], Apache storm[4] from Twitter, IBM InfoSphere Streams[5] from IBM. Fraud detection, E-commerce, feed, page-view and Trading are the streaming based applications. III. BIG DATA FRAMEWORKS In this section, we discuss features of few of the popular Big data frameworks available in Open source. A. Hadoop Generally used for Batch processing and designed for parallel algorithms used in scientific applications. It is used for parallel processing but due to lack of single cluster the improvised version of Hadoop[3] comes with the name of Hadoop YARN[3]. The associated ML tool is Mahout. Hadoop is not suitable first when the communication between the splits happens. Secondly Hadoop not suitable for the long running MR job and Last but not the least there is no concept called in-memory and caching in MR loops. B. Spark It is also one of the programming models used to process the large data. According to the spark community [13], much faster than the Hadoop. The main reason behind the claim is instead of placing intermediate data into the disks, it stores into the memory. Due to that we can save the I/O time

3 which requires while storing and retrieving the data from the disks. Spark introduces memory abstraction using the concept called as resilient distributed datasets (RDDs), functions run on each record of HDFS or any available storage in parallel manner. Two types of operation performed on the RDD one is Transformations which is lazy evaluation and another one is Action applied on the transformed RDD. RDD is collection of objects that partitioned across the cluster and rebuilt at time of partition lost. The lost partition can be rebuilt using lineage [12]. Spark scheduling is done using the directed acyclic graph (DAG). The job has multiple stages and it can be scheduled simultaneously if the jobs are independent. It supports the wide range of application category like Machine Learning, Spark SQL, Spark Streaming and Spark Graph computation. The associated ML tools are MLib, Mahout and H 2 O. C. Flink It is an open source platform for Hybrid, interactive, Real- Time Streaming, Real-World Streaming (out of order streams, Windowing, Back-pressure) and Native iterative Processing[1]. Major abstraction is Cyclic Data-flows. The associated ML tools are Flink-ML and SAMOA. Flink is in-memory processing and Low-Latency engine. D. GraphX GraphX[8] framework is used for the graph processing on the top of Spark. It has low-cost fault tolerance as well as due to iterative processing of graphs, it reduces the memory overhead. Different API s are available for the different framework for graph processing. E. Spark Streaming Hadoop doesn t have full controlled solution for streaming application, even though they are able to process microbatch job[10]. Spark streaming provides strong consistency, scalability, parallel and fault recovery due to introduction of discretized streams(d-streams)[14] programming model for intermix streaming, batch and interactive queries. IV. RULE BASE FOR CHOOSING FRAMEWORK In this Section will provide a rule base mapping between different applications and frameworks discussed in Section 2 and 3 respectively. The Table I first column presents different features of SQL Query based analytic workload based on our experience in client project. Rest of the columns represents different Big data framework supporting processing of SQL Query based analytic workload. Each cell in the Table shows a match between the feature and the framework. Similarly Table II, III and IV present rule based mapping for NoSQL workload, Iterative Computation based application and Streaming application respectively. We have described the different workloads and features of an framework on the tables mentioned below. Refer Table II which is used to describe the NoSQL workload categories, Table III describes the features of an framework used for Iterative Computation based application and Table IV describes the features of an framework used for Streaming application. V. CONCLUSIONS AND FUTURE WORK There are plethora of Big Data frameworks are available in open source for parallel processing of CPU and/or data intensive applications. It is humongous task for a user to select a right platform for his/her workload. In this paper we have categorized application broadly into SQL Query based analytic workload, NoSQL workload, Iterative Computation based application and Streaming applications. We have further specified features for each of these application categories. Finally we have presented rule base mapping for each of these category by specifying level of support provided by various available Big data frameworks for each of the specified features. In future we shall benchmark these available Big data platforms for different types of applications with features as mentioned in the paper and corroborate the result with actual measurements. We also plan to integrate this rule base mapping with our larger project on migration of applications deployed on traditional systems to Big data platforms. REFERENCES [1] [2] [3] [4] [5] [6] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: Efficient iterative data processing on large clusters. PVLDB, 3(1): , [7] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In S. Hariri and K. Keahey, editors, HPDC, pages ACM, [8] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In J. Flinn and H. Levy, editors, OSDI, pages USENIX Association, [9] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning in the cloud. PVLDB, 5(8): , [10] S. Shahrivari. Beyond batch processing: Towards real-time and streaming big data. Computers, 3(4): , [11] C. E. Tsourakakis. Pegasus: A system for large-scale graph processing. In S. Sakr and M. M. Gaber, editors, Large Scale and Big Data, pages Auerbach Publications, [12] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A faulttolerant abstraction for in-memory cluster computing. In S. D. Gribble and D. Katabi, editors, NSDI, pages USENIX Association, [13] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In E. M. Nahum and D. Xu, editors, HotCloud. USENIX Association, [14] M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In R. Fonseca and D. A. Maltz, editors, HotCloud. USENIX Association, 2012.

4 TABLE I: Rule Base for SQL Analytic Workload Features Hive Presto Drill Spark Framework MapReduce MapReduce Pipeline Queries MapReduce and DAG Model Batch Processing Real Time Interation using Query Pipeline Real Time Interaction Micro-Batch processing, Streaming Speed Fast 10x faster than Hive Fast 10x faster than MR on disk and 100x faster in memory Capability of Processing Data Terabytes Petabytes Gegabytes to Petabytes Petabytes Size ML Support No No No Yes Language support SQL C, Java, PHP, Python, R, Ruby, SQL ANSI SQL, Mongo QL, Java API Scala, Java, Python, SQL Horizontal cluster Scalable( 100+ Nodes) Scalable(1000+ Nodes) Scalable(100+ Nodes) Scalable(1000+ Nodes) scalability Query/ rule based/ Optimization Catalyst Cost-Based(Not Yet) cost based YARN Support Yes Not Yet Yes Yes HDFS/local file Local/ Local/HDFS HDFS/S3 system/s3 HDFS/S3 /S3/MapR-FS /Hbase/Cassandra TABLE II: Rule Base for NoSQL Workload Key Value Store Document Store Graph Based Wide Column Features Redis Riak MongoDB CouchDB Titan Neo4j HBase Cassandra Schema Flexibility High High High High High High Moderate Moderate Implementation -Language C Erlang C++ Erlang Java Java Java Java Distribution -Replication Master-Master/ Design and Features Sorted sets Vector clocks Indexing, GridFS Partitioning Query Method Through key-commands MapReduce term matching Concurrency In Memory Eventually API and other access methods Proprietary protocol HTTP API, Native Erlang interface Dynamic object-based language and MapReduce Update in Place (Master-slave with Multi granularity locking) Proprietary protocol using JSON, binary(bson) MapReduceR of Javascript funcs MVCC(Application can select Optimistic or Pessimistic locking) RESTful HTTP /JSON API Multi-Master pluggable backends Cassandra,Hbase, MapR,Hazlecast Gremlin,SparQL ACID Tunable C Blueprints,Gremlin, Python,Clojure ACID Applicable SparQL, nativejavaapi Non-block reads, write locks involved nodes/relationship until commit Cypher query language, RESTful HTTP API built-in data MapReduce support compression and MapReduce support Internal API, Map Reduce Optimistic Locking with MVCC RESTful HTTP API,Thrift Internal API, SQL like(cql) Tunable Consistency Thrift,Custom binary CQL Complexity None None Low Low High High Low Low CAP AP AP CP CP AP/CP CP AP/CP Partitioning None Range Based Scheme Volatile memory Bitcask LevelDB Volatile memory Volatile memory Volatile memory Volatile memory Cassandra HDFS file system Volatile memory file system file system file system file system Map Reduce No Yes Yes Yes Yes No Yes Yes Horizontal Scalable Yes Yes Yes Yes Yes No Yes Yes

5 TABLE III: Rule Base for Iterative Computation Application Features Hadoop Twister Flink Spark Framework MapReduce MapReduce PACT-MapReduce MapReduce Model batch Operator-based micro-batch Auto-Tuning/ Manually/Static Auto-tuned/caches Manually/static data Manually/NA Optimization data caching static data path cached in iteration Latency low Medium Medium(better than spark) Medium Fault Tolerance Medium Medium High High(Through lineage ) Memory Management Manually Manually Automatic Automatic(Spark ) Iteration Nature No Yes Yes Yes General Purpose ETL Iterative Algorithm ETL/Machine Learning ETL/Machine Learning /Iterative Algorithms /Iterative Algorithms Storage support HDFS/local filesystem/s3 Local File systems HDFS/S3/ HDFS/local file system/ MapR/Tachyon S3/Hbase/Cassandra Language support Java Java Java,scala Java,scala Horizontal cluster scalability Scalable (100 to 1000 Nodes) Scalable (100 to 1000 Nodes),python Highly Scalable (100 to Nodes),python Highly Scalable (100 to Nodes) TABLE IV: Rule Base for Streaming Application Features Hadoop Spark Streaming Storm Flink Samza Real Time Event Batch, Real Time Event Processing Framework Batch Batch, Streaming Real Time stream Processing based Processing based Processing Native/Micro-batching Native/Relies on Native/micro-batching Streaming Model batch Micro-batching by accumulates stream Kafka for internal with Trident API messages messaging Response Time Minutes Seconds Milliseconds Milliseconds Sub-seconds Functionalities/supports Stream Source/Primitive /Computation YARN,HDFS,MapReuce NA YARN,HDFS,Mesos Receivers/D Streams /Transformations window operations Exactly once(excepts in some failure ) Stateful(writes state to storage), Dedicated DStream Runs on YARN interact with Hadoop and HDFS Runs on YARN as an Application Requires YARN and HDFS Spouts/Tuple/Bolts DataStream consumers/message/tasks Delivery Semantics Exactly once At Least Once(Exactly once with Trident) Exactly once Atleat once State Management Stateless Not build-in/stateless Stateful(Embedded (Roll your own or Stateful Operators Key-Value store) use Trident) JVM-languages, Language Supported Java Scala,Java,Python Ruby,Python, Java,scala,python Scala,Java,JVM Languages only Javascript,Perl API Declarative Compositional Declarative Compositional Latency Minutes Seconds milliseconds milliseconds milliseconds Throughput NA 100k+ records per 10k+ records per 100k+ records per 100k+ records per node per seconds node per seconds node per seconds node per seconds Scalabilty to Large input Volume(streams) No No yes yes yes Fault Tolerance( completeing the compuation yes yes yes yes yes correctly under failure) Accuracy and Repeatability No No No Yes No Queryable(querying the results inside the stream processor No No No No(Upcoming feature) No without exporting them to an external database) In-Memory Processing No Yes Yes Yes No Resource Manager YARN YARN,Mesos YARN,Mesos YARN YARN Supported ML Tools Mahout Mahout/Mlib/H2o SAMOA Flink-ML/SAMOA SAMOA

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes