DATA SCIENCE USING SPARK: AN INTRODUCTION

Similar documents
An Introduction to Apache Spark

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Spark Overview. Professor Sasu Tarkoma.

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing & Visualization

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Cloud, Big Data & Linear Algebra

Spark, Shark and Spark Streaming Introduction

Hadoop. Introduction / Overview

Apache Spark 2.0. Matei

CSE 444: Database Internals. Lecture 23 Spark

Processing of big data with Apache Spark

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Big Data Infrastructures & Technologies

A Tutorial on Apache Spark

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Analyzing Flight Data

Hadoop Development Introduction

Unifying Big Data Workloads in Apache Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

Big data systems 12/8/17

An Overview of Apache Spark

Certified Big Data Hadoop and Spark Scala Course Curriculum

Big Data Architect.

Chapter 4: Apache Spark

The Datacenter Needs an Operating System

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Lambda Architecture with Apache Spark

Distributed Machine Learning" on Spark

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

An Introduction to Apache Spark

2/4/2019 Week 3- A Sangmi Lee Pallickara

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

Dell In-Memory Appliance for Cloudera Enterprise

Specialist ICT Learning

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Backtesting with Spark

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

MapR Enterprise Hadoop

Modern Data Warehouse The New Approach to Azure BI

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Distributed Computing with Spark and MapReduce

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Practical Big Data Processing An Overview of Apache Flink

Hadoop An Overview. - Socrates CCDH

Beyond MapReduce: Apache Spark Antonino Virgillito

The Evolution of Big Data Platforms and Data Science

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

Big Data Analytics using Apache Hadoop and Spark with Scala

The age of Big Data Big Data for Oracle Database Professionals

Scalable Tools - Part I Introduction to Scalable Tools

Big Data Hadoop Course Content

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Webinar Series TMIP VISION

Bringing Data to Life

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Hadoop, Yarn and Beyond

Shark: Hive (SQL) on Spark

Franck Mercier. Technical Solution Professional Data + AI Azure Databricks

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Hadoop course content

Turning Relational Database Tables into Spark Data Sources

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Apache Spark and Scala Certification Training

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Certified Big Data and Hadoop Course Curriculum

Introduction to Big-Data

Microsoft Big Data and Hadoop

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Comparative Study of Apache Hadoop vs Spark

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Databricks, an Introduction

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Twitter data Analytics using Distributed Computing

Innovatus Technologies

microsoft

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Hadoop. Introduction to BIGDATA and HADOOP

Lecture 11 Hadoop & Spark

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig


Transcription:

DATA SCIENCE USING SPARK: AN INTRODUCTION

TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2

DATA SCIENCE PROCESS Exploratory Data Analysis Real World Raw data is collected Data is processed Clean Data Machine Learning Algorithms Statistical Models Build Data Product Communicate ---------------- visualizations ----------------Report Findings Make Decisions Source: Doing Data Science by Rachel Schutt & Cathy O Neil 3

DATA SCIENCE & DATA MINING Distinctions are blurred Business Analytics Data Science Knowledge Discovery Data Mining Visual Data Mining Domain Knowledge (Structured) Data Mining (Unstructured) Data Mining Big Data Engineering Natural Language Processing Text Mining Web Mining Statistics Machine Learning Database Management Management Science 4

WHAT DO WE NEED TO SUPPORT DATA SCIENCE WORK? Data Input /Output Ability to read data in multiple formats Ability to read data from multiple sources Ability to deal with Big Data (Volume, Velocity, Veracity and Variety) Data Transformations Easy to describe and perform transformations on rows and columns of data Requires abstraction of data and a dataflow paradigm Model Development Library of Data Science Algorithms Ability to import / export models from other sources Data Science pipelines / workflow Development Analytics Applications Development Seamless integration with programming languages / IDEs 5

INTRODUCTION TO SPARK S H A R A N K A L W A N I 6

WHAT IS SPARK? A distributed computing platform designed to be Fast General Purpose A general engine that allows combination of multiple types of computations Batch Interactive Iterative SQL Queries Text Processing Machine learning 7

Fast/Speed Computations in memory Faster than MR even for disk computations Generality Designed for a wide range of workloads Single Engine to combine batch, interactive, iterative, streaming algorithms. Has rich high-level libraries and simple native APIs in Java, Scala and Python. Reduces the management burden of maintaining separate tools. 8

SPARK UNIFIED STACK 9

CLUSTER MANAGERS Can run on a variety of cluster managers Hadoop YARN - Yet Another Resource Negotiator is a cluster management technology and one of the key features in Hadoop 2. Apache Mesos - abstracts CPU, memory, storage, and other compute resources away from machines, enabling fault-tolerant and elastic distributed systems. Spark Standalone Scheduler provides an easy way to get started on an empty set of machines. Spark can leverage existing Hadoop infrastructure 10

SPARK HISTORY Started in 2009 as a research project in UC Berkeley RAD lab which became AMP Lab. Spark researchers found that Hadoop MapReduce was inefficient for iterative and interactive computing. Spark was designed from the beginning to be fast for interactive, iterative with support for in-memory storage and fault-tolerance. Apart from UC Berkeley, Databricks, Yahoo! and Intel are major contributors. Spark was open sourced in March 2010 and transformed into Apache Foundation project in June 2013. 11

SPARK VS HADOOP Hadoop MapReduce Mostly suited for batch jobs Difficulty to program directly in MR Batch doesn t compose well for large apps Specialized systems needed as a workaround Spark Handles batch, interactive, and real-time within a single framework Native integration with Java, Python, Scala Programming at a higher level of abstraction More general than MapReduce C O N F I D E N T I A L A N D P R O P R I E T A R Y 12

GETTING STARTED WITH SPARK 13

GETTING STARTED WITH SPARK..NOT COVERED TODAY! There are multiple ways of using Spark Certified Spark Distributions Datastax Enterprise (Cassandra + Spark) HortonWorks HDP MAPR Local/Standalone Databricks cloud Amazon AWS EC2 14

LOCAL MODE Install Java JDK 6/7 on MacOSX or Windows http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads- 1880260.html Install Python 2.7 using Anaconda (only on Windows) https://store.continuum.io/cshop/anaconda/ Download Apache Spark from Databricks, unzip the downloaded file to a convenient location http://training.databricks.com/workshop/usb.zip Connect to the newly created spark-training directory Run the interactive Scala shell (REPL)./spark/bin/spark-shell val data = 1 to 1000 val distdata = sc.parallelize(data) val filtereddata = distdata.filter(s => s<25) filtereddata.collect() 15

DATABRICKS CLOUD A hosted data platform powered by Apache Spark Features Exploration and Visualization Managed Spark Clusters Production Pipelines Support for 3 rd party apps (Tableau, Pentaho, Qlik View) Databricks Cloud Trail http://databricks.com/registration Demo 16

DATABRICKS CLOUD Workspace Tables Clusters 17

DATABRICKS CLOUD Notebooks Python Scala SQL Visualizations Markup Comments Collaboration 18

DATABRICKS CLOUD Tables Hive tables SQL Dbfs S3 CSV Databases 19

AMAZON EC2 Launch a Linux instance on EC2 and setup EC2 Keys 20

AMAZON EC2 Setup an EC2 pair from the AWS console 21

AMAZON EC2 Spark binary ships with a spark-ec2 script to manage clusters on EC2 Launching Spark cluster on EC2./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name> Running Applications./spark-ec2 -k <keypair> -i <key-file> login <cluster-name> Terminating a cluster./spark-ec2 destroy <cluster-name> Accessing data in S3 s3n://<bucket>/path 22

PROGRAMMING IN SPARK S H A R A N K A L W A N I 23

Spark Cluster Mesos YARN Standalone 24

Scala Scalable Language Scala is a multi-paradigm programming language with focus on the functional programming paradigm. In functional programming functions are used and they use variables that are immutable. Every operator, variable and function is an object. Scala generates bytecode that runs on the top of any JVM and can also use any of the java libraries. Spark is completely written in Scala. Spark SQL, GraphX, Spark Streaming etc. are libraries written in Scala. Scala Crash Course by Holden Karau @databricks lintool.github.io/sparktutorial/slides/day1_scala_crash_course.pdf 25

Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) Read-only collections of objects that can be stored in memory or disk across a cluster Partitions are automatically rebuilt on failure Parallel functional transformations ( map, filter,..) Familiar Scala collections API for distributed data and computation Lazy transformations 26

Spark Core RDD Resilient Distributed Dataset A primary abstraction in Spark a fault-tolerant collection of elements that can be operated on in parallel. Two Types Parallelized Scala collections Hadoop datasets Transformations and Actions can be performed on RDDs. Transformations Operate on an RDD and return a new RDD. Are Lazily Evaluated Actions Return a value after running a computation on a RDD. The DAG is evaluated only when an action takes place. 27

Spark Shell Interactive Queries and prototyping Local, YARN, Mesos Static type checking and auto complete 28

Spark compared to Java (native Hadoop) 29

Spark compared to Java (native Hadoop) 30

Spark Streaming Real time computation similar to Storm Input distributed to memory for fault tolerance Streaming input in to sliding windows of RDDs Kafka, Flume, Kinesis, HDFS 31

32

Spark Streaming 33

DATA SCIENCE USING SPARK 34

WHAT DO WE NEED TO SUPPORT DATA SCIENCE WORK? Data Input /Output Ability to read data in multiple formats Ability to read data from multiple sources Ability to deal with Big Data (Volume, Velocity, and Variety) Data Transformations Easy to describe and perform transformations on rows and columns of data Requires abstraction of data and a dataflow paradigm Model Development Library of Data Science Algorithms Ability to import / export models from other sources Data Science pipelines / workflow Development Analytics Applications Development Seamless integration with programming languages / IDEs 35

WHY SPARK FOR DATA SCIENCE? FAST Distributed In-memory Platform Scalable Small to Big Data; Well integrated into the Big Data Ecosystem SPARK is HOT Expressive Simple, higher level abstractions for describing computations Flexible Extendible, Multiple language bindings (Scala, Java, Python, R) C O N F I D E N T I A L A N D P R O P R I E T A R Y 36

Traditional Data Science Tools Matlab SAS RapidMiner R SPSS And many others. Designed to work on single machines Proprietary & Expensive 37

What is available in Spark? Analytics Workflows (ML Pipeline) Library of Algorithms(MLlib, R packages, Mahout?, Graph Algorithms) Extensions to RDD (SchemaRDD, RRDD, RDPG, DStreams) Basic RDD (Transformations & Actions) 38

DATA TYPES FOR DATA SCIENCE (MLLIB) Single Machine Data Types Local Vector Labeled Point Local Matrix Distributed Data Types (supported by RDDs) Distributed Matrix RowMatrix IndexedRowMatrix CoordinateMatrix 39

Schema RDDs 40

41

R to Spark Dataflow Local Worker Spark Executor tasks broadcast vars R pacakges R R Spark Context (ref. in R) Java Spark Context Worker Spark Executor tasks broadcast vars R pacakges R 42

43

MESOS SHARK SPARK 44