Survey of Big Data Frameworks for Different Application Characteristics

Size: px
Start display at page:

Download "Survey of Big Data Frameworks for Different Application Characteristics"

Transcription

1 Survey of Big Data Frameworks for Different Application Characteristics Praveen Kumar Singh TCS Research Mumbai, India Rekha Singhal TCS Research Mumbai, India Abstract Nowadays applications are migrating from traditional 3-tier architecture to Big data platform which are widely available in open source and can do parallel data processing on cluster of commodity machines. The challenges are to choose the right available Big data framework for an application with the available features of the framework. We have proposed a rule base methodology for making this choice. We first categorize an application into 4 major categories such as Iterative computation based application, SQL Query based analytic workload, online transaction processing application and streaming based application based on their features. The proposed rule base stores level of support given by popular big data frameworks for each of the application s features. We have presented rule base for each of these application category in this paper. Keywords SQL, NoSQL, Machine Learning, Rule Base, Hadoop, Spark, Flink, Iterative Computation, Streaming, Graph, Framework I. INTRODUCTION In the world of Big data, enormously data generated in the terms of Petabyte, Yotabyte from the sources like social networks, satellites, sensors, user devices, search engines etc. To analyze the large data and extract the useful information from the large data, people come up with the large data processing frameworks. People from the open world are providing the tools to implement and execute the parallel, complex and scientific application. Most of the scientific application needs parallel and distributed computation. As we know Hadoop is not suitable for iterative computation and the most of the scientific computation and machine learning algorithms requires iteration. Spark[13], Twister[7], Haloop[6] and Flink[1] are come up with alternative solution of Map Reduce framework and providing better support for applications which requires iterative computation. These frameworks are parallel and distributed in nature. Programmers can distribute their data across the cluster and execute their task in parallel over the data present in the cluster. Framework like Hadoop and Twister evolves the disk I/O s, and Spark reduce the I/O s activity due to in-memory concept. It doesn t always depend on the framework but it also depends on the application. Applications are need to be categorize into I/O intensive and CPU intensive. There are plethora s of Big data frameworks available for executing various types of applications. There is always a challenge for a user to choose the right Big data processing platform for his/her problem statement. This paper provides a mapping of different type of applications and workload to possible set of data processing frameworks. We further categorize application based on their features such as speed, latency, dependency, language etc. For the perspective of users we are trying to put comparisons among different computation framework. In order to help users to choose appropriate framework, which can be used to process their real world problem. We are categorizing the application to the different framework based on their suitability and features. In the categorization, we will focus on the application classes like Machine Learning and Iterative computation based Application, SQL Query based Application, Online Transaction Processing Application (OLTP) and Streaming based Application. Some of the open source Big data frameworks like Hadoop[3], Spark[13], Flink[1] with their pros, cons and Rule base for choosing these frameworks. The contribution of this paper is to define features for different types of applications and matching it to appropriate Big data framework based on several studies available in the literature. The rest of the paper is organized as follows. In the next Section, we will discuss about the various Application Classes, in Section 3 will explains about different Big data frameworks and in Section 4 we have Rule Base for choosing these frameworks and finally concluded in section 5. II. APPLICATION CLASSES All domains such as Banking, Media, Health care, Education, Manufacturing, Insurance, Government, Retail and Transportation etc have challenges of processing Big Data. Most of these applications have following type of use cases for processing large data-sizes. 1) Knowledge Discovery in Database: Applications which are required to identify the hidden data or extract the useful information from the given data which is applicable to identify the current trends in market and more insight over the customer data. 2) Fraud Detection and prevention: used to identify the fraudulent cases happens in every sector, can be prevent before it happens. 3) Device-generated Data Analytic: enormously data generated through the sensor, remote devices, mobile and The copyright of this paper is owned by the author(s). The author(s) hereby grants Computer Measurement Group Inc a royalty free right to publish this paper in CMG India Annual Conference Proceedings.

2 satellites. Generated data can be used for weather forecasting and it is being used to provide intelligence based on the location. 4) Social Networking and their relationship Analytic: Social networking sites are generating huge amount of data in a seconds that need to be analyze in a real time with the correct identification of relationship and respond back to the user. 5) Recommendation based Analytic: In the digital world users are spending most of the time with their devices, data need to be analyzed and recommend the useful information to the user. This analytic can be able to busy the users by their recommendations. In this paper we are providing the rule base for choosing framework, and categorizing application broadly into Iterative computation based application, SQL Query based analytic workload, online transaction processing application and streaming based application. A. Machine Learning and Iterative computation based Application Machine Learning is a part of Artificial Intelligence where the machine is trained with the previous data to make the decisions for the future and we can apply some algorithm to get some output. To apply machine learning few steps need to follow 1) Data need to be collected from the various sources. 2) Data need to be prepared well by analyzing the data, removing the outlier and missing values issues. 3) Data need to be representing in training and testing part to build the model and test it. 4) Check the accuracy of a model by evaluating the algorithm outcome. 5) Improve the performance if possible by the mean of selecting different model altogether. Supervised learning is also known as task driven in which learning is to be done through the past data to get future prediction. Learning can be define through the attribute of a class, if the attribute is discrete then it is classification and if the attribute is continuous then it is regression. Decision Tree, SVM and Regression are comes into this category. Unsupervised Learning is used in a cluster analysis based on the given input data set with the help of some objective function. A well known K-Means clustering algorithm and Principal Component Analysis (PCA) comes under this category. B. SQL Query based Application The Big data framework provides the functionality for users to run and analyze the query on the data, which is stored on the Hadoop Distributed (HDFS) and S3. Hive on Hadoop Map-reduce, SparkSQL on Spark and many more provides the API to deal with SQL Query. Data can be viewed and analyzed by basic query like aggregation and join, even on the fly for business purposes through these available frameworks. C. Online Transaction Processing Application (OLTP) These applications have read, write and update operations. Each transaction execution will be associated with some semantics such as ACID, CAP etc. Most of the NoSQL based framework such as HBase, Neo4j, MongoDB, etc supports OLTP workload. Some of the NoSQL framework such as GraphLab and Pregel are specialized to work on graph based data which is required for social networking application. The amount of data generated through the social networking sites and various sources of data is in exponential rate. Graph based model used for parallel computation on the graph ensures the data should be consistent and different techniques are being applied to get some useful information from huge amount of data. Graph analysis is one of the solutions to understand complex relations and to check the different patterns present in the large data. The available frameworks for graph processing are GraphX[8], GraphLab[9], Giraph[2] and Pregel[11]. PageRank, Connected Component, SVD++, Strongly Connected Components and Triangle Count are the applications belongs to this category. D. Streaming based Application To Process the large amount of data is challenging task but to take decisions on-fly even more challenging in real time, where stream processing comes into the picture. Most of the business firms are more focused on streaming analysis to take business decision such as recommendations for the users/customers/clients in real time. Stream processing mainly does the calculation, statistical analysis and continuous query runs over the stream of data in real time. The available framework for stream processing are Spark Streaming[14], Apache storm[4] from Twitter, IBM InfoSphere Streams[5] from IBM. Fraud detection, E-commerce, feed, page-view and Trading are the streaming based applications. III. BIG DATA FRAMEWORKS In this section, we discuss features of few of the popular Big data frameworks available in Open source. A. Hadoop Generally used for Batch processing and designed for parallel algorithms used in scientific applications. It is used for parallel processing but due to lack of single cluster the improvised version of Hadoop[3] comes with the name of Hadoop YARN[3]. The associated ML tool is Mahout. Hadoop is not suitable first when the communication between the splits happens. Secondly Hadoop not suitable for the long running MR job and Last but not the least there is no concept called in-memory and caching in MR loops. B. Spark It is also one of the programming models used to process the large data. According to the spark community [13], much faster than the Hadoop. The main reason behind the claim is instead of placing intermediate data into the disks, it stores into the memory. Due to that we can save the I/O time

3 which requires while storing and retrieving the data from the disks. Spark introduces memory abstraction using the concept called as resilient distributed datasets (RDDs), functions run on each record of HDFS or any available storage in parallel manner. Two types of operation performed on the RDD one is Transformations which is lazy evaluation and another one is Action applied on the transformed RDD. RDD is collection of objects that partitioned across the cluster and rebuilt at time of partition lost. The lost partition can be rebuilt using lineage [12]. Spark scheduling is done using the directed acyclic graph (DAG). The job has multiple stages and it can be scheduled simultaneously if the jobs are independent. It supports the wide range of application category like Machine Learning, Spark SQL, Spark Streaming and Spark Graph computation. The associated ML tools are MLib, Mahout and H 2 O. C. Flink It is an open source platform for Hybrid, interactive, Real- Time Streaming, Real-World Streaming (out of order streams, Windowing, Back-pressure) and Native iterative Processing[1]. Major abstraction is Cyclic Data-flows. The associated ML tools are Flink-ML and SAMOA. Flink is in-memory processing and Low-Latency engine. D. GraphX GraphX[8] framework is used for the graph processing on the top of Spark. It has low-cost fault tolerance as well as due to iterative processing of graphs, it reduces the memory overhead. Different API s are available for the different framework for graph processing. E. Spark Streaming Hadoop doesn t have full controlled solution for streaming application, even though they are able to process microbatch job[10]. Spark streaming provides strong consistency, scalability, parallel and fault recovery due to introduction of discretized streams(d-streams)[14] programming model for intermix streaming, batch and interactive queries. IV. RULE BASE FOR CHOOSING FRAMEWORK In this Section will provide a rule base mapping between different applications and frameworks discussed in Section 2 and 3 respectively. The Table I first column presents different features of SQL Query based analytic workload based on our experience in client project. Rest of the columns represents different Big data framework supporting processing of SQL Query based analytic workload. Each cell in the Table shows a match between the feature and the framework. Similarly Table II, III and IV present rule based mapping for NoSQL workload, Iterative Computation based application and Streaming application respectively. We have described the different workloads and features of an framework on the tables mentioned below. Refer Table II which is used to describe the NoSQL workload categories, Table III describes the features of an framework used for Iterative Computation based application and Table IV describes the features of an framework used for Streaming application. V. CONCLUSIONS AND FUTURE WORK There are plethora of Big Data frameworks are available in open source for parallel processing of CPU and/or data intensive applications. It is humongous task for a user to select a right platform for his/her workload. In this paper we have categorized application broadly into SQL Query based analytic workload, NoSQL workload, Iterative Computation based application and Streaming applications. We have further specified features for each of these application categories. Finally we have presented rule base mapping for each of these category by specifying level of support provided by various available Big data frameworks for each of the specified features. In future we shall benchmark these available Big data platforms for different types of applications with features as mentioned in the paper and corroborate the result with actual measurements. We also plan to integrate this rule base mapping with our larger project on migration of applications deployed on traditional systems to Big data platforms. REFERENCES [1] [2] [3] [4] [5] [6] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: Efficient iterative data processing on large clusters. PVLDB, 3(1): , [7] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In S. Hariri and K. Keahey, editors, HPDC, pages ACM, [8] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In J. Flinn and H. Levy, editors, OSDI, pages USENIX Association, [9] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning in the cloud. PVLDB, 5(8): , [10] S. Shahrivari. Beyond batch processing: Towards real-time and streaming big data. Computers, 3(4): , [11] C. E. Tsourakakis. Pegasus: A system for large-scale graph processing. In S. Sakr and M. M. Gaber, editors, Large Scale and Big Data, pages Auerbach Publications, [12] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A faulttolerant abstraction for in-memory cluster computing. In S. D. Gribble and D. Katabi, editors, NSDI, pages USENIX Association, [13] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In E. M. Nahum and D. Xu, editors, HotCloud. USENIX Association, [14] M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In R. Fonseca and D. A. Maltz, editors, HotCloud. USENIX Association, 2012.

4 TABLE I: Rule Base for SQL Analytic Workload Features Hive Presto Drill Spark Framework MapReduce MapReduce Pipeline Queries MapReduce and DAG Model Batch Processing Real Time Interation using Query Pipeline Real Time Interaction Micro-Batch processing, Streaming Speed Fast 10x faster than Hive Fast 10x faster than MR on disk and 100x faster in memory Capability of Processing Data Terabytes Petabytes Gegabytes to Petabytes Petabytes Size ML Support No No No Yes Language support SQL C, Java, PHP, Python, R, Ruby, SQL ANSI SQL, Mongo QL, Java API Scala, Java, Python, SQL Horizontal cluster Scalable( 100+ Nodes) Scalable(1000+ Nodes) Scalable(100+ Nodes) Scalable(1000+ Nodes) scalability Query/ rule based/ Optimization Catalyst Cost-Based(Not Yet) cost based YARN Support Yes Not Yet Yes Yes HDFS/local file Local/ Local/HDFS HDFS/S3 system/s3 HDFS/S3 /S3/MapR-FS /Hbase/Cassandra TABLE II: Rule Base for NoSQL Workload Key Value Store Document Store Graph Based Wide Column Features Redis Riak MongoDB CouchDB Titan Neo4j HBase Cassandra Schema Flexibility High High High High High High Moderate Moderate Implementation -Language C Erlang C++ Erlang Java Java Java Java Distribution -Replication Master-Master/ Design and Features Sorted sets Vector clocks Indexing, GridFS Partitioning Query Method Through key-commands MapReduce term matching Concurrency In Memory Eventually API and other access methods Proprietary protocol HTTP API, Native Erlang interface Dynamic object-based language and MapReduce Update in Place (Master-slave with Multi granularity locking) Proprietary protocol using JSON, binary(bson) MapReduceR of Javascript funcs MVCC(Application can select Optimistic or Pessimistic locking) RESTful HTTP /JSON API Multi-Master pluggable backends Cassandra,Hbase, MapR,Hazlecast Gremlin,SparQL ACID Tunable C Blueprints,Gremlin, Python,Clojure ACID Applicable SparQL, nativejavaapi Non-block reads, write locks involved nodes/relationship until commit Cypher query language, RESTful HTTP API built-in data MapReduce support compression and MapReduce support Internal API, Map Reduce Optimistic Locking with MVCC RESTful HTTP API,Thrift Internal API, SQL like(cql) Tunable Consistency Thrift,Custom binary CQL Complexity None None Low Low High High Low Low CAP AP AP CP CP AP/CP CP AP/CP Partitioning None Range Based Scheme Volatile memory Bitcask LevelDB Volatile memory Volatile memory Volatile memory Volatile memory Cassandra HDFS file system Volatile memory file system file system file system file system Map Reduce No Yes Yes Yes Yes No Yes Yes Horizontal Scalable Yes Yes Yes Yes Yes No Yes Yes

5 TABLE III: Rule Base for Iterative Computation Application Features Hadoop Twister Flink Spark Framework MapReduce MapReduce PACT-MapReduce MapReduce Model batch Operator-based micro-batch Auto-Tuning/ Manually/Static Auto-tuned/caches Manually/static data Manually/NA Optimization data caching static data path cached in iteration Latency low Medium Medium(better than spark) Medium Fault Tolerance Medium Medium High High(Through lineage ) Memory Management Manually Manually Automatic Automatic(Spark ) Iteration Nature No Yes Yes Yes General Purpose ETL Iterative Algorithm ETL/Machine Learning ETL/Machine Learning /Iterative Algorithms /Iterative Algorithms Storage support HDFS/local filesystem/s3 Local File systems HDFS/S3/ HDFS/local file system/ MapR/Tachyon S3/Hbase/Cassandra Language support Java Java Java,scala Java,scala Horizontal cluster scalability Scalable (100 to 1000 Nodes) Scalable (100 to 1000 Nodes),python Highly Scalable (100 to Nodes),python Highly Scalable (100 to Nodes) TABLE IV: Rule Base for Streaming Application Features Hadoop Spark Streaming Storm Flink Samza Real Time Event Batch, Real Time Event Processing Framework Batch Batch, Streaming Real Time stream Processing based Processing based Processing Native/Micro-batching Native/Relies on Native/micro-batching Streaming Model batch Micro-batching by accumulates stream Kafka for internal with Trident API messages messaging Response Time Minutes Seconds Milliseconds Milliseconds Sub-seconds Functionalities/supports Stream Source/Primitive /Computation YARN,HDFS,MapReuce NA YARN,HDFS,Mesos Receivers/D Streams /Transformations window operations Exactly once(excepts in some failure ) Stateful(writes state to storage), Dedicated DStream Runs on YARN interact with Hadoop and HDFS Runs on YARN as an Application Requires YARN and HDFS Spouts/Tuple/Bolts DataStream consumers/message/tasks Delivery Semantics Exactly once At Least Once(Exactly once with Trident) Exactly once Atleat once State Management Stateless Not build-in/stateless Stateful(Embedded (Roll your own or Stateful Operators Key-Value store) use Trident) JVM-languages, Language Supported Java Scala,Java,Python Ruby,Python, Java,scala,python Scala,Java,JVM Languages only Javascript,Perl API Declarative Compositional Declarative Compositional Latency Minutes Seconds milliseconds milliseconds milliseconds Throughput NA 100k+ records per 10k+ records per 100k+ records per 100k+ records per node per seconds node per seconds node per seconds node per seconds Scalabilty to Large input Volume(streams) No No yes yes yes Fault Tolerance( completeing the compuation yes yes yes yes yes correctly under failure) Accuracy and Repeatability No No No Yes No Queryable(querying the results inside the stream processor No No No No(Upcoming feature) No without exporting them to an external database) In-Memory Processing No Yes Yes Yes No Resource Manager YARN YARN,Mesos YARN,Mesos YARN YARN Supported ML Tools Mahout Mahout/Mlib/H2o SAMOA Flink-ML/SAMOA SAMOA

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Twitter data Analytics using Distributed Computing

Twitter data Analytics using Distributed Computing Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

Integration of Machine Learning Library in Apache Apex

Integration of Machine Learning Library in Apache Apex Integration of Machine Learning Library in Apache Apex Anurag Wagh, Krushika Tapedia, Harsh Pathak Vishwakarma Institute of Information Technology, Pune, India Abstract- Machine Learning is a type of artificial

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

Real-time data processing with Apache Flink

Real-time data processing with Apache Flink Real-time data processing with Apache Flink Gyula Fóra gyfora@apache.org Flink committer Swedish ICT Stream processing Data stream: Infinite sequence of data arriving in a continuous fashion. Stream processing:

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Fast, Interactive, Language-Integrated Cluster Computing

Fast, Interactive, Language-Integrated Cluster Computing Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org

More information

Shark: Hive on Spark

Shark: Hive on Spark Optional Reading (additional material) Shark: Hive on Spark Prajakta Kalmegh Duke University 1 What is Shark? Port of Apache Hive to run on Spark Compatible with existing Hive data, metastores, and queries

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Apache Spark 2.0. Matei

Apache Spark 2.0. Matei Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

2/4/2019 Week 3- A Sangmi Lee Pallickara

2/4/2019 Week 3- A Sangmi Lee Pallickara Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1

More information

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Principal Software Engineer Red Hat Emerging Technology June 24, 2015 USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging

More information

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica. Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of abstraction in cluster

More information

Flash Storage Complementing a Data Lake for Real-Time Insight

Flash Storage Complementing a Data Lake for Real-Time Insight Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum

More information

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21 Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma

More information

CIB Session 12th NoSQL Databases Structures

CIB Session 12th NoSQL Databases Structures CIB Session 12th NoSQL Databases Structures By: Shahab Safaee & Morteza Zahedi Software Engineering PhD Email: safaee.shx@gmail.com, morteza.zahedi.a@gmail.com cibtrc.ir cibtrc cibtrc 2 Agenda What is

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Apache Flink Big Data Stream Processing

Apache Flink Big Data Stream Processing Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017

More information

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING Volume 119 No. 16 2018, 937-948 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING K.Anusha

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Spark: A Brief History. https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf Spark: A Brief History https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf A Brief History: 2004 MapReduce paper 2010 Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 MapReduce @ Google

More information

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. IV (May-Jun. 2016), PP 125-129 www.iosrjournals.org I ++ Mapreduce: Incremental Mapreduce for

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Survey on Incremental MapReduce for Data Mining

Survey on Incremental MapReduce for Data Mining Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,

More information

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017 Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017 About the Presentation Problems Existing Solutions Denis Magda

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences L3: Spark & RDD Department of Computational and Data Science, IISc, 2016 This

More information

Turning Relational Database Tables into Spark Data Sources

Turning Relational Database Tables into Spark Data Sources Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following

More information

Stream Processing on IoT Devices using Calvin Framework

Stream Processing on IoT Devices using Calvin Framework Stream Processing on IoT Devices using Calvin Framework by Ameya Nayak A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Distributed Computing with Spark and MapReduce

Distributed Computing with Spark and MapReduce Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

Massive Online Analysis - Storm,Spark

Massive Online Analysis - Storm,Spark Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R

More information

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Analytics in Spark Yanlei Diao Tim Hunter Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig Outline 1. A brief history of Big Data and Spark 2. Technical summary of Spark 3. Unified analytics

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Practical Big Data Processing An Overview of Apache Flink

Practical Big Data Processing An Overview of Apache Flink Practical Big Data Processing An Overview of Apache Flink Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de With slides from Volker Markl and data artisans 1 2013

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems Jargons, Concepts, Scope and Systems Key Value Stores, Document Stores, Extensible Record Stores Overview of different scalable relational systems Examples of different Data stores Predictions, Comparisons

More information

Hadoop, Yarn and Beyond

Hadoop, Yarn and Beyond Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Over the last few years, we have seen a disruption in the data management

Over the last few years, we have seen a disruption in the data management JAYANT SHEKHAR AND AMANDEEP KHURANA Jayant is Principal Solutions Architect at Cloudera working with various large and small companies in various Verticals on their big data and data science use cases,

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

EXTRACT DATA IN LARGE DATABASE WITH HADOOP International Journal of Advances in Engineering & Scientific Research (IJAESR) ISSN: 2349 3607 (Online), ISSN: 2349 4824 (Print) Download Full paper from : http://www.arseam.com/content/volume-1-issue-7-nov-2014-0

More information

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu NoSQL Databases MongoDB vs Cassandra Kenny Huynh, Andre Chik, Kevin Vu Introduction - Relational database model - Concept developed in 1970 - Inefficient - NoSQL - Concept introduced in 1980 - Related

More information

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Course Schedule Tuesday 10.3. Introduction and the Big Data Challenge Tuesday 17.3. MapReduce and Spark: Overview Tuesday

More information

BIG DATA COURSE CONTENT

BIG DATA COURSE CONTENT BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016 Khadija Souissi Auf z Systems 07. 08. November 2016 @ IBM z Systems Mainframe Event 2016 Acknowledgements Apache Spark, Spark, Apache, and the Spark logo are trademarks of The Apache Software Foundation.

More information

Backtesting with Spark

Backtesting with Spark Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1 Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

Embedded Technosolutions

Embedded Technosolutions Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication

More information

Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens

Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens Data Science and Open Source Software Iraklis Varlamis Assistant Professor Harokopio University of Athens varlamis@hua.gr What is data science? 2 Why data science is important? More data (volume, variety,...)

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

An Overview of Apache Spark

An Overview of Apache Spark An Overview of Apache Spark CIS 612 Sunnie Chung 2014 MapR Technologies 1 MapReduce Processing Model MapReduce, the parallel data processing paradigm, greatly simplified the analysis of big data using

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

IMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES

IMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES IMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES by PRAMOD KUMAR GUDIPATI B.E., OSMANIA UNIVERSITY (OU), INDIA, 2012 A REPORT submitted in partial fulfillment of the requirements of the

More information

Shen PingCAP 2017

Shen PingCAP 2017 Shen Li @ PingCAP About me Shen Li ( 申砾 ) Tech Lead of TiDB, VP of Engineering Netease / 360 / PingCAP Infrastructure software engineer WHY DO WE NEED A NEW DATABASE? Brief History Standalone RDBMS NoSQL

More information

International Journal of Advance Engineering and Research Development. Performance Comparison of Hadoop Map Reduce and Apache Spark

International Journal of Advance Engineering and Research Development. Performance Comparison of Hadoop Map Reduce and Apache Spark Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 03, March -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 Performance

More information

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017. Dublin Apache Kafka Meetup, 30 August 2017 The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Joseph @pleia2 * ASF projects 1 Elizabeth K. Joseph, Developer Advocate Developer Advocate

More information

Cloud, Big Data & Linear Algebra

Cloud, Big Data & Linear Algebra Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes

More information

Introduction to NoSQL Databases

Introduction to NoSQL Databases Introduction to NoSQL Databases Roman Kern KTI, TU Graz 2017-10-16 Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 1 / 31 Introduction Intro Why NoSQL? Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 2 / 31 Introduction

More information

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko Shark: SQL and Rich Analytics at Scale Michael Xueyuan Han Ronny Hajoon Ko What Are The Problems? Data volumes are expanding dramatically Why Is It Hard? Needs to scale out Managing hundreds of machines

More information

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype? Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/

More information

Oracle GoldenGate for Big Data

Oracle GoldenGate for Big Data Oracle GoldenGate for Big Data The Oracle GoldenGate for Big Data 12c product streams transactional data into big data systems in real time, without impacting the performance of source systems. It streamlines

More information