Processing of big data with Apache Spark

Similar documents
Spark Overview. Professor Sasu Tarkoma.

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

CSE 444: Database Internals. Lecture 23 Spark

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

An Introduction to Apache Spark

Introduction to Spark

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Cloud, Big Data & Linear Algebra

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Big data systems 12/8/17

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

An Introduction to Apache Spark

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Cloud Computing & Visualization

An Introduction to Big Data Analysis using Spark

Beyond MapReduce: Apache Spark Antonino Virgillito

DATA SCIENCE USING SPARK: AN INTRODUCTION

Chapter 4: Apache Spark

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Spark, Shark and Spark Streaming Introduction

Turning Relational Database Tables into Spark Data Sources

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Big Data Infrastructures & Technologies

Big Data Architect.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Introduction to Apache Spark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Lecture 11 Hadoop & Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

Apache Spark Internals

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Introduction to Apache Spark. Patrick Wendell - Databricks

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Jupyter and Spark on Mesos: Best Practices. June 21 st, 2017

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Hadoop course content

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Big Data Analytics using Apache Hadoop and Spark with Scala


RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Big Data Analytics with Apache Spark. Nastaran Fatemi

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Apache Spark 2.0. Matei

Backtesting with Spark

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Stream Processing on IoT Devices using Calvin Framework

Hadoop Development Introduction

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Comparative Study of Apache Hadoop vs Spark

A Tutorial on Apache Spark

Data-intensive computing systems

Distributed Computation Models

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Outline. CS-562 Introduction to data analysis using Apache Spark

Lecture 4, 04/08/2015. Scribed by Eric Lax, Andreas Santucci, Charles Zheng.

Hadoop Map Reduce 10/17/2018 1

Adaptive Executive Layer with Pentaho Data Integration

Big Data Hadoop Course Content

Data Analytics and Machine Learning: From Node to Cluster

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Specialist ICT Learning

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

Data Engineering. How MapReduce Works. Shivnath Babu

Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Apache Spark and Scala Certification Training

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

Distributed Machine Learning" on Spark

CompSci 516: Database Systems

2/4/2019 Week 3- A Sangmi Lee Pallickara

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Big Data Hadoop Stack

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

Parallelism with the PDI AEL Spark Engine. Jonathan Jarvis Pentaho Senior Solution Engineer, Hitachi Vantara

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Going Big Data on Apache Spark. KNIME Italy Meetup

Massive Online Analysis - Storm,Spark

Resilient Distributed Datasets

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

An Overview of Apache Spark

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Spark supports several storage levels

15/04/2018. Spark supports several storage levels. The storage level is used to specify if the content of the RDD is stored

An exceedingly high-level overview of ambient noise processing with Spark and Hadoop

Transcription:

Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski

AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2

WHAT IS APACHE SPARK? Engine for processing of large-scale data Open source Interact with Java, Scala, Python, and R Run as Standalone or on YARN, Kubernetes, and Mesos Access HDFS, HBase, Cassandra, S3, and etc. 3

SPARK ARCHITECTURE 4

RESILIENT DISTRIBUTED DATASET (RDD) Characteristics: Immutable, distributed, partitioned, and resilient API: 1. Transformations: map(), filter(), distinct(), union(), subtract(), and etc. 2. Actions: reduce(), collect(), count(), first(), take(), and etc. 5

RDD OPERATIONS Transformations are executed on workers Actions may transfer data from the workers to the driver collect() sends all the partitions to the single driver Persistence: persist() and cache() 6

AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 7

SPARK VS MAPREDUCE Spark Real time, streaming Processes data in-memory Handle structures which could not be decomposed to key-value pairs MapReduce Batch mode, not real-time Persist on disk after map operation Key-value pairs 8

AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 9

APPLICATION REQUIREMENTS Verify that application is thread-safe Use synchronization blocks appropriately Avoid duplication of objects Try to use array of objects and primitive types Avoid unneeded data in the objects Always remember that application is executed in parallel! 10

APPLICATION PIPELINE Define the application pipeline with usage of SparkContext object Encapsulate the common data needed through all pipeline steps Prepare the common data and broadcast it through the workers as needed 11

AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 12

Worker 1 Load Content Process & Enhance Persist Content List of Files Balance Driver Process Metrics Worker N Load Content Process & Enhance Persist Content 13

AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 14

DEPENDENCY ISSUES Example: Log4j 1.x Spark Log4j 2.x Application (Async Loggers) Application fails at the very beginning Resolution: Shading dependencies Provide the dependencies in --jars property and add spark.{driver,executor}.userclasspathfirst=true properties 15

MEMORY ISSUES Example: Default usage of 1Gb RAM per executor Executors fail with OOM error, thus application fails Resolution: Verify cluster available memory Monitor and measure memory usage Tune per application case 16

PERFORMANCE ISSUES Example: Application execution time is taking too long for simple set of data Last task executing time is taking too long Resolution: Verify the partitioning Adjust the processing time of each task 17

APPLICATION ISSUES Example: Default Java serialization is being used Serialization time is taking too long Resolution: Verify the objects data and data structures used Use Kryo serialization 18

API ISSUES Three to four month cycle releases Lots of hood changes Verification if application is affected 19

20