CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2016 Doc 25 Spark 2 Nov 29, 2016

Similar documents
The input data can be queried by using

2/26/2017. DataFrame. The input data can be queried by using

Higher level data processing in Apache Spark

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Pyspark standalone code

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

spark-testing-java Documentation Release latest

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

10/05/2018. The following slides show how to

09/05/2018. Spark MLlib is the Spark component providing the machine learning/data mining algorithms

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS

SparkSQL 11/14/2018 1

/ Cloud Computing. Recitation 13 April 17th 2018

Using Numerical Libraries on Spark

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Spark MLlib is the Spark component providing the machine learning/data mining algorithms. MLlib APIs are divided into two packages:

Fast, Interactive, Language-Integrated Cluster Computing

A Tutorial on Apache Spark

Parallel Processing Spark and Spark SQL

Chapter 1 - The Spark Machine Learning Library

Using Existing Numerical Libraries on Spark

Turning Relational Database Tables into Spark Data Sources

CMU MSP 36602: Spark, Part 3

Apache Spark and Scala Certification Training

Specialist ICT Learning

13/05/2018. Spark MLlib provides

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Data processing in Apache Spark

Reactive App using Actor model & Apache Spark. Rahul Kumar Software

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Spark supports several storage levels

15/04/2018. Spark supports several storage levels. The storage level is used to specify if the content of the RDD is stored

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Big Data Infrastructures & Technologies

Unifying Big Data Workloads in Apache Spark

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Data processing in Apache Spark

Oracle Big Data Connectors

Using Spark SQL in Spark 2

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Data processing in Apache Spark

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

An Introduction to Apache Spark

MLI - An API for Distributed Machine Learning. Sarang Dev

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

An Introduction to Apache Spark

An Introduction to Big Data Formats

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Data Science Bootcamp Curriculum. NYC Data Science Academy

Outline. CS-562 Introduction to data analysis using Apache Spark


CSE 414: Section 7 Parallel Databases. November 8th, 2018

Distributed Machine Learning" on Spark

Hadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Spark MLlib provides a (limited) set of clustering algorithms

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay

An Introduction to Apache Spark

a Spark in the cloud iterative and interactive cluster computing

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Data Analytics and Machine Learning: From Node to Cluster

CS 378 Big Data Programming. Lecture 20 Spark Basics

Processing of big data with Apache Spark

TestsDumps. Latest Test Dumps for IT Exam Certification

Beyond MapReduce: Apache Spark Antonino Virgillito

Chapter 4: Apache Spark

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

New Developments in Spark

microsoft

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3

Big Data Hadoop Stack

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225

Python With Data Science

Data Analytics Job Guarantee Program

Apache Spark: Hands-on Session A.A. 2016/17

Big Data Analytics with. Apache Spark. Machiel Jansen. Matthijs Moed

Apache Spark 2.0. Matei

Apache Spark: Hands-on Session A.A. 2017/18

Databases and Big Data Today. CS634 Class 22

Spark Overview. Professor Sasu Tarkoma.

An Overview of Apache Spark

Resilient Distributed Datasets

Data Analytics with Spark

CSE 444: Database Internals. Lecture 23 Spark

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Cloud Computing & Visualization

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Big Data Architect.

Hadoop Development Introduction

Scalable Machine Learning in R. with H2O

Transcription:

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2016 Doc 25 Spark 2 Nov 29, 2016 Copyright, All rights reserved. 2016 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http:// www.opencontent.org/openpub/) license defines the copyright on this document.

Transformations JavaSparkContext sc = new JavaSparkContext(); JavaRDD<Integer> rdd = sc.parallelize(arrays.aslist(1,2,3,3)); rdd.map(x -> x + 1) [2, 3, 4, 4] rdd.distinct() [1, 2, 3] rdd.sample(false,0.5); varies 2

Transformations JavaSparkContext sc = new JavaSparkContext(); JavaRDD<Integer> rdd = sc.parallelize(arrays.aslist(1, 2, 3)); JavaRDD<Integer> other = sc.parallelize(arrays.aslist(3, 4, 5)); rdd.union(other) [1, 2, 3, 3, 4, 5] rdd.intersection(other) [3] rdd.subtract(other) [1, 2] 3

Transformations & Output JavaSparkContext sc = new JavaSparkContext(); JavaRDD<Integer> rdd = sc.parallelize(arrays.aslist(1, 2, 3)); JavaRDD<Integer> other = sc.parallelize(arrays.aslist(3, 4, 5)); JavaRDD<Integer> result = rdd.union(other); result.saveastextfile(somepath); 4

Actions & Output JavaSparkContext sc = new JavaSparkContext(); JavaRDD<Integer> rdd = sc.parallelize(arrays.aslist(1, 2, 3)); JavaRDD<Integer> other = sc.parallelize(arrays.aslist(3, 4, 5)); JavaRDD<Integer> result = rdd.union(other); List onmaster = result.collect(); List is java.util.list - no spark method saveastextfile 5

Writing to HDFS import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.filesystem; import org.apache.hadoop.fs.fsdataoutputstream; import org.apache.hadoop.fs.path; static void saveastextfile(string destination, Object contents) throws Exception { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(destination), conf); FSDataOutputStream out = fs.create(new Path(destination)); out.writechars(contents.tostring()); out.close(); } destination /full/path/to/file.txt local/path/file.txt s2://bucket/file.txt 6

Actions JavaSparkContext sc = new JavaSparkContext(); JavaRDD<Integer> rdd = sc.parallelize(arrays.aslist(3, 1, 2, 3)); rdd.collect() [3,1, 2, 3] rdd.reduce((a,b) -> a + b) 9 rdd.first 3 rdd.top(2) [3, 1] rdd.take(2) [3, 1] rdd.takeordered(2) [1, 2] rdd.count() 3 rdd.countbyvalue() {3=2, 1=1, 2=1} 7

Actions on DoubleRDD JavaSparkContext sc = new JavaSparkContext(); JavaDoubleRDD rdd = sc.parallelizedoubles(arrays.aslist(5.0,1.0,2.0,3.0,4.0,5.0)); rdd.mean() rdd.min() rdd.sum() rdd.variance() rdd.stdev() rdd.stats() rdd.histogram(bucketcount) 8

JavaPairRDD RDD on key-value pairs JavaSparkContext sc = new JavaSparkContext(); JavaRDD a = sc.parallelize(arrays.aslist(1, 3, 3)); JavaRDD b = sc.parallelize(arrays.aslist(2, 4, 6)); JavaPairRDD<Integer, Integer> rdd = a.zip(b); rdd.collect(); // [(1,2), (3,4), (3,6)] 9

Transformations on Pair RDD rdd.collect(); // [(1,2), (3,4), (3,6)] rdd.reducebykey( (x, y) -> x + y) [(1, 2), (3, 10)] rdd.groupbykey() [(1, [2]), (3, [4, 6])] rdd.mapvalues( x -> x + 1) [(1,3), (3,5), (3,7)] rdd.top(2) [3, 1] rdd.keys() [1, 3, 6] rdd.values() [2, 4, 6] rdd.sortbykey() [(1,2), (3,4), (3,6)] 10

Transformations on Pair RDD rdd [(1,2), (3,4), (3,6)] other [(3,9)] rdd.subtractbykey(other) [(1,2)] rdd.join(other) [(3, (4, 9)), (3, (6,9))] rdd.cogroup(other) [(1, ([2], [])), (3, ([4, 6],[9]))] 11

Shared Variables BroadCast Variables Read-only variable on each machine Sent from master Broadcast<int[]> broadcastvar = sc.broadcast(new int[] {1, 2, 3}); Accumulators Write-only on each machine - using add() Master can read LongAccumulator accum = jsc.sc().longaccumulator(); 12

Some Debugging help 40 41 42 JavaRDD<Integer> rdd = sc.parallelize(arrays.aslist(1,2,3,3)); JavaRDD<Integer> result = rdd.map(x -> x + 2); String test = result.todebugstring(); test (2) MapPartitionsRDD[1] at map at JavaWordCount.java:41 [] ParallelCollectionRDD[0] at parallelize at JavaWordCount.java:40 [] 13

Spark vs Hadoop MapReduce Hadoop MapReduce Implements two common functional programming functions is OO manner Spark Uses set of common functional programming functions in a functional way 14

Spark SQL 15

Spark SQL Structured data processing Structure allow extra optimizations via Spark SQL optimized catalyst engine Structured data SQL Datasets & Dataframes Data Sources Hive (distributed data warehouse) Parquet (compressed column data) SQL databases JSON, CSV, text files RDD 16

17

Datasets & DataFrames Dataset Distributed collection of structured data Like RDD but uses Spark SQL optimized execution engine Scala, Java, JVM languages only (no python or R) DataFrame Dataset organized into named columns Like SQL table or dataframe in Julia Dataset<Row> in Java 18

Creating DataFrames SparkSession spark = SparkSession.builder().appName("JavaWordCount").getOrCreate(); Dataset<Row> df = spark.read().json( people.json ); df.show(); +----+-------+ age name +----+-------+ null Michael 30 Andy 19 Justin +----+-------+ people.json {"name":"michael"} {"name":"andy", "age":30} {"name":"justin", "age":19} 19

Reading CSV SparkSession spark = SparkSession.builder().appName("JavaWordCount").getOrCreate(); DataFrameReader reader = spark.read(); reader.option("header",true); reader.option("inferschema",true); Dataset<Row> df = reader.csv(args[0]); df.show(); df.printschema(); people.csv name,age Andy,30 Justin,19 Michael, +-------+----+ name age +-------+----+ Andy 30 Justin 19 Michael null +-------+----+ root -- name: string (nullable = true) -- age: integer (nullable = true) 20

Types We Can Read csv jdbc json orc parquet text 21

Some CSV options encoding sep (erator) header inferschema ignoreleadingwhitespace nullvalue dateformat timestampformat mode PERMISSIVE - sets record field on corrupt record DROPMALFORMED - ignores whole corrupt records FAILFAST - throw exception on corrupt record 22

Dataset Functions common with RDDs collect count distinct filter flatmap foreach groupby map reduce 23

We can work with columns import static org.apache.spark.sql.functions.col; Dataset<Row> df = reader.csv( people.csv ); df.select("name").show(); df.select(col("name")).show(); people.csv name,age Andy,30 Justin,19 Michael, +-------+ name +-------+ Andy Justin Michael +-------+ 24

// Select everybody, but increment the age by 1 df.select(col("name"), col("age").plus(1)).show(); +-------+---------+ name (age + 1) +-------+---------+ Andy 31 Justin 20 Michael null +-------+---------+ 25

// Select people older than 21 df.filter(col("age").gt(21)).show(); +----+---+ name age +----+---+ Andy 30 +----+---+ // Count people by age df.groupby("age").count().show(); +----+-----+ age count +----+-----+ null 1 19 1 30 1 +----+-----+ 26

Using SQL df.createorreplacetempview("people"); Dataset<Row> sqldf = spark.sql("select * FROM people"); sqldf.show(); +-------+----+ name age +-------+----+ Andy 30 Justin 19 Michael null +-------+----+ 27

Saving DataFrames df.write().format("json").save("people2.json"); {"name":"andy","age":30} {"name":"justin","age":19} {"name":"michael"} Formats json, parquet, jdbc, orc, libsvm, csv, text 28

DataFrames & Objects Can define encoders/decoders so can have dataframes/datasets of objects Efficient transmission on network Efficient in memory allocation 29

DataFrames & RDDs Can convert RDDs to dataframes/datasets - But need schema for dataframe Converting DataFame to RDD fd.tojavardd() 30

More Statistics DataFrameStatFunctions stat = df.stat(); approxquantile bloomfilter corr (Pearson Correlation Coeffient) cov (Covariance) freqitems find frequent items in column(s) 31

Spark vs Hadoop MapReduce Hadoop MapReduce Implements two common functional programming functions is OO manner Spark Uses set of common functional programming functions in a functional way Reads more data formats SQL More queries Interacts with SQL databases directly 10 to 100 times faster But there is more - machine learning! 32

MLib Linear Algebra Statistics Correlations, Hypothesis testing, Sampling, density estimation Classification & regression Linear, logistic, isotonic, & ridge regression Decision trees, naive Bayes Collaborative filtering (recommender systems) Clustering Dimensionality reduction Feature extraction Frequent pattern mining 33