Apache Spark: A Literature Review. Presenter: Aaron Sarson

Apache Spark: A Literature Review Presenter: Aaron Sarson

Outline Introduction to Spark Problem to be addressed Proposed Approach Ø Research Questions Contributions Results Ø RQ1, RQ2, RQ3 Conclusion & Future Work

Introduction to Apache Spark Apache Spark is an open source big data processing framework Spark has become increasingly popular due to: Ø Avoids I/O bottlenecks Ø In- memory computations Uses linage- based recovery which is more efficient recovery method employed in Hadoop Apache Spark has a streaming component useful for providing input for real time analytics

Problem Lack of systematic survey that focuses on Apache Spark. What is state of the art work being done with Spark? Perform the gap analysis based on current literature. Discover limitations within current literature about Apache Spark Provide direction to researchers where the unexplored areas are i. No analysis of the domains where Apache Spark is being utilized ii. Lack of examination into the types/categories of papers iii. No exploration of the tools/components that have been used with Spark

Proposed Approach Number of papers to collect: 20 These papers will should represent a snapshot of the research area Databases Searched: ACM Digital Library, IEEE Library, Google Scholar Query Strings: { spark, apache spark, spark- based } x {, health, networking, BI, system, prediction } Inclusion/exclusion criteria: Journal articles, conference papers/proceedings Publication Year: [2014, 2016] Books, articles which only briefly mention spark

Proposed Approach- Research Questions The questions this literature review is considering: RQ1. What domains does the literature about Spark encompass? RQ2. What categories of papers are in literature? RQ3. What tools are used in the literature with Apache Spark?

Contributions My contributions to this research problem: Ø Systematic literature review on the current literature about Apache Spark Ø Literature review will evaluate the state of the art research as well as present a gap analysis

Results RQ1 Domain Count Health 2 i. A Spark- based workflow for probabilistic record linkage of health care data ii. SparkGIS: Efficient Comparison and Evaluation of Algorithm Results in Tissue Image Analysis Studies E- Commerce 1 i. A Reference Architecture and Road map for Enabling E- commerce on Apache Spark Media 2 i. A Spark- based Political Event Coding ii. Real- time News Recommendations using Apache Spark Network Monitoring 1 i. A Cloud Computing Based Network Monitoring and Threat Detection System for Critical Infrastructures Other 1 i. Apache Spark: A Unified Engine for Big Data Processing

Results - RQ2 Category Count Comparison/Informative 2 i. Apache Spark: A Unified Engine for Big Data Processing ii. SparkGIS: Efficient Comparison and Evaluation of Algorithm Results in Tissue Image Analysis Studies Systems 3 i. A Reference Architecture and Road map for Enabling E- commerce on Apache Spark ii. A Cloud Computing Based Network Monitoring and Threat Detection System for Critical Infrastructures iii. Real- time News Recommendations using Apache Spark Records Management 1 i. A Spark- based workflow for probabilistic record linkage of health care data Natural Language Processing 1 i. A Spark- based Political Event Coding

Results RQ3 NoSQL document database MongoDB CouchDB Terrastore NoSQL key- value database S3 Cassandra NoSQL columnar database HBASE Riak Casandra Hadoop (HDFS) Apache Spark Core SQL Streaming MLib GraphX CRM suite Standford CoreNLP PERTRARCH MySQL database Web Server

Conclusion & Future Work Preliminary Results: RQ1 à Domains where Apache Spark is used are concentrated in Health and Media RQ3 à a trend is emerging that the Apache Spark s streaming component is a heavily utilized component RQ3 à there are not a great deal of papers using Apache Spark s GraphX component. Future Work: Current state: 7 papers. Goal: 20 papers. Graph and analyze the results of RQ3 in more detail Try and find papers in other domains such as finance/economics relating to Spark

References [1] M. Zaharia, et al., "Apache Spark", Communications of the ACM, vol. 59, no. 11, pp. 56-65, 2016. [2]M. Solaimani, et al., "Spark- Based Political Event Coding", 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), 2016. [3]R. Pitta, C. Pinto and P. Melo, "A Spark- based workflow for probabilistic record linkage of Healthcare data", Workshop Proceedings of the EDBT/ICDT 2015 Joint Conference, pp. 1-10, 2015.

References [4] J. Domann, et al., "Real- time news recommendations using apache spark", Working Notes of the 7th International Conference of the CLEF Initiative, Evora, Portugal, 2016. [5]M. Sewak and S. Singh, "A Reference Architecture and Road map for Enabling E- commerce on Apache Spark", Communications on Applied Electronics (CAE), vol. 2, no. 1, pp. 37-42, 2015. [6]F. Baig, et al., "SparkGIS: Efficient Comparison and Evaluation of Algorithm Results in Tissue Image Analysis Studies", Lecture Notes in Computer Science, pp. 134-146, 2016. [7]Z. Chen, et al., "A Cloud Computing Based Network Monitoring and Threat Detection System for Critical Infrastructures", Big Data Research, vol. 3, pp. 10-23, 2016.