Apache Spark: A Literature Review. Presenter: Aaron Sarson

Similar documents
Big Data with Hadoop Ecosystem

Stages of Data Processing

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data Architect.

BIG DATA COURSE CONTENT

D DAVID PUBLISHING. Big Data; Definition and Challenges. 1. Introduction. Shirin Abbasi

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Hadoop, Yarn and Beyond

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Databases and Big Data Today. CS634 Class 22

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Latest Trends in Database Technology NoSQL and Beyond

Cloud Computing & Visualization

Microsoft Big Data and Hadoop

Distributed Databases: SQL vs NoSQL

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms

Study of NoSQL Database Along With Security Comparison

An Introduction to Apache Spark

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens

Twitter data Analytics using Distributed Computing

Hadoop course content

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research

Time Series Storage with Apache Kudu (incubating)

CSE 444: Database Internals. Lecture 23 Spark

Oracle GoldenGate for Big Data

A data-driven framework for archiving and exploring social media data

A Security Audit Module for HBase

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

The age of Big Data Big Data for Oracle Database Professionals

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Why NoSQL? Why Riak?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

"Big Data... and Related Topics" John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute

Scalable Tools - Part I Introduction to Scalable Tools

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Review - Relational Model Concepts

New Approaches to Big Data Processing and Analytics

CS639: Data Management for Data Science. Lecture 1: Intro to Data Science and Course Overview. Theodoros Rekatsinas

Splunk Review. 1. Introduction

CIB Session 12th NoSQL Databases Structures

DATABASE DESIGN II - 1DL400

TOOLS FOR INTEGRATING BIG DATA IN CLOUD COMPUTING: A STATE OF ART SURVEY

Virtual IMS user group: Newsletter 57

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Oracle Big Data Connectors

Cloud Storage with AWS: EFS vs EBS vs S3 AHMAD KARAWASH

Lecture 25 Overview. Last Lecture Query optimisation/query execution strategies

Big Data Using Hadoop

Logging Reservoir Evaluation Based on Spark. Meng-xin SONG*, Hong-ping MIAO and Yao SUN

Innovatus Technologies

Oracle Database and Application Solutions

Cloud, Big Data & Linear Algebra

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

Wearable Technology Orientation Using Big Data Analytics for Improving Quality of Human Life

Appropches used in efficient migrption from Relptionpl Dptpbpse to NoSQL Dptpbpse

Introduction to Data Science Day 2

STATE OF MODERN APPLICATIONS IN THE CLOUD

MapR Enterprise Hadoop

Unit 10 Databases. Computer Concepts Unit Contents. 10 Operational and Analytical Databases. 10 Section A: Database Basics

/ Cloud Computing. Recitation 10 March 22nd, 2016

COMPARATIVE EVALUATION OF BIG DATA FRAMEWORKS ON BATCH PROCESSING

Hadoop. Introduction / Overview

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Big Data and FrameWorks; Perspectives to Applied Machine Learning

BUSINESS INTELLIGENCE FOR EVALUATION E-VOUCHER AIRLINE REPORT

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture

Security and Performance advances with Oracle Big Data SQL

Big Data Hadoop Stack

IT directors, CIO s, IT Managers, BI Managers, data warehousing professionals, data scientists, enterprise architects, data architects

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Election Analysis and Prediction Using Big Data Analytics

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

Online Bill Processing System for Public Sectors in Big Data

Introduction to Graph Databases

In-Memory Data processing using Redis Database

Big Data Performance and Comparison with Different DB Systems

What Next for DBAs in the Big Data Era

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

E6895 Advanced Big Data Analytics Lecture 4:

Challenges for Data Driven Systems

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

Adoption of E-Governance Applications towards Big Data Approach

Embedded Technosolutions

Introduction to NoSQL by William McKnight

Relational databases


Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases

Transcription:

Apache Spark: A Literature Review Presenter: Aaron Sarson

Outline Introduction to Spark Problem to be addressed Proposed Approach Ø Research Questions Contributions Results Ø RQ1, RQ2, RQ3 Conclusion & Future Work

Introduction to Apache Spark Apache Spark is an open source big data processing framework Spark has become increasingly popular due to: Ø Avoids I/O bottlenecks Ø In- memory computations Uses linage- based recovery which is more efficient recovery method employed in Hadoop Apache Spark has a streaming component useful for providing input for real time analytics

Problem Lack of systematic survey that focuses on Apache Spark. What is state of the art work being done with Spark? Perform the gap analysis based on current literature. Discover limitations within current literature about Apache Spark Provide direction to researchers where the unexplored areas are i. No analysis of the domains where Apache Spark is being utilized ii. Lack of examination into the types/categories of papers iii. No exploration of the tools/components that have been used with Spark

Proposed Approach Number of papers to collect: 20 These papers will should represent a snapshot of the research area Databases Searched: ACM Digital Library, IEEE Library, Google Scholar Query Strings: { spark, apache spark, spark- based } x {, health, networking, BI, system, prediction } Inclusion/exclusion criteria: Journal articles, conference papers/proceedings Publication Year: [2014, 2016] Books, articles which only briefly mention spark

Proposed Approach- Research Questions The questions this literature review is considering: RQ1. What domains does the literature about Spark encompass? RQ2. What categories of papers are in literature? RQ3. What tools are used in the literature with Apache Spark?

Contributions My contributions to this research problem: Ø Systematic literature review on the current literature about Apache Spark Ø Literature review will evaluate the state of the art research as well as present a gap analysis

Results RQ1 Domain Count Health 2 i. A Spark- based workflow for probabilistic record linkage of health care data ii. SparkGIS: Efficient Comparison and Evaluation of Algorithm Results in Tissue Image Analysis Studies E- Commerce 1 i. A Reference Architecture and Road map for Enabling E- commerce on Apache Spark Media 2 i. A Spark- based Political Event Coding ii. Real- time News Recommendations using Apache Spark Network Monitoring 1 i. A Cloud Computing Based Network Monitoring and Threat Detection System for Critical Infrastructures Other 1 i. Apache Spark: A Unified Engine for Big Data Processing

Results - RQ2 Category Count Comparison/Informative 2 i. Apache Spark: A Unified Engine for Big Data Processing ii. SparkGIS: Efficient Comparison and Evaluation of Algorithm Results in Tissue Image Analysis Studies Systems 3 i. A Reference Architecture and Road map for Enabling E- commerce on Apache Spark ii. A Cloud Computing Based Network Monitoring and Threat Detection System for Critical Infrastructures iii. Real- time News Recommendations using Apache Spark Records Management 1 i. A Spark- based workflow for probabilistic record linkage of health care data Natural Language Processing 1 i. A Spark- based Political Event Coding

Results RQ3 NoSQL document database MongoDB CouchDB Terrastore NoSQL key- value database S3 Cassandra NoSQL columnar database HBASE Riak Casandra Hadoop (HDFS) Apache Spark Core SQL Streaming MLib GraphX CRM suite Standford CoreNLP PERTRARCH MySQL database Web Server

Conclusion & Future Work Preliminary Results: RQ1 à Domains where Apache Spark is used are concentrated in Health and Media RQ3 à a trend is emerging that the Apache Spark s streaming component is a heavily utilized component RQ3 à there are not a great deal of papers using Apache Spark s GraphX component. Future Work: Current state: 7 papers. Goal: 20 papers. Graph and analyze the results of RQ3 in more detail Try and find papers in other domains such as finance/economics relating to Spark

References [1] M. Zaharia, et al., "Apache Spark", Communications of the ACM, vol. 59, no. 11, pp. 56-65, 2016. [2]M. Solaimani, et al., "Spark- Based Political Event Coding", 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), 2016. [3]R. Pitta, C. Pinto and P. Melo, "A Spark- based workflow for probabilistic record linkage of Healthcare data", Workshop Proceedings of the EDBT/ICDT 2015 Joint Conference, pp. 1-10, 2015.

References [4] J. Domann, et al., "Real- time news recommendations using apache spark", Working Notes of the 7th International Conference of the CLEF Initiative, Evora, Portugal, 2016. [5]M. Sewak and S. Singh, "A Reference Architecture and Road map for Enabling E- commerce on Apache Spark", Communications on Applied Electronics (CAE), vol. 2, no. 1, pp. 37-42, 2015. [6]F. Baig, et al., "SparkGIS: Efficient Comparison and Evaluation of Algorithm Results in Tissue Image Analysis Studies", Lecture Notes in Computer Science, pp. 134-146, 2016. [7]Z. Chen, et al., "A Cloud Computing Based Network Monitoring and Threat Detection System for Critical Infrastructures", Big Data Research, vol. 3, pp. 10-23, 2016.