spark-testing-java Documentation Release latest

Size: px
Start display at page:

Download "spark-testing-java Documentation Release latest"

Transcription

1 spark-testing-java Documentation Release latest Nov 20, 2018

2

3 Contents 1 Input data preparation 3 2 Java Context creation Data preparation Comparison Scala Context creation Data preparation Comparison i

4 ii

5 Apache Spark is become widely used, code become more complex, and integration tests are become important for check code quality. Below integration testing approaches with code samples. Two languages are covered - Java and Scala in separate sections. Testing steps 1. Resource allocation: SparkContext/SparkSession creation for test. Them can be created manually, or existing framework can be used; 2. Data preparation: RDD/DataFrame created in code or read from disk; 3. Run functionality: Two functionality types: a) reading data from storage - files have to be prepared; b) transformations - data can be created in code; 4. Expected and actual comparison. Examples Apples weight and color manipulation is used as example. Sections contains links to available for download test projects. Contents 1

6 2 Contents

7 CHAPTER 1 Input data preparation Several approaches can be used. Dynamically from: empty with predefined structure one primitive primitive list tuple list data objects list Pre saved on disk. For check reading and for complex structures data, in formats: parquet csv Pre saved and dynamically modified on fly. For improve visibility on complex structures testing. 3

8 4 Chapter 1. Input data preparation

9 CHAPTER 2 Java Project with code examples on GitLab: spark-testing-java Functionality located in package repository 2.1 Context creation Code examples in package: context Manual Asterix means all available processor cores will be used. For most cases two cores is enough, and local[2] is appropriate. SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("test").set("spark.ui.enabled", "false"); JavaSparkContext jsc = new JavaSparkContext(conf); // close after usage jsc.stop(); Code example: JavaSparkContextCreationTest.java Framework: spark-testing-base Framework spark-testing-base located in github Drawbacks: current version provided SparkContext and JavaSparkContext only, some additional code required for get SparkSession. 5

10 Inclusion in project with Maven <dependency> <groupid>com.holdenkarau</groupid> <artifactid>spark-testing-base_2.11</artifactid> <version>2.3.1_0.10.0</version> <scope>test</scope> </dependency> Example of pom.xml Enabling in testcases Test classes have to extends SharedJavaSparkContext. public class ApplesRepositoryTestingBaseIT extends SharedJavaSparkContext SparkContext and JavaSparkContext Can be received with methods: sc() jsc() Integer[] weights = {120, 150}; long expected = Arrays.stream(weights).reduce(0, (a, v) -> a + v); Dataset<Row> df = new SQLContext(jsc()).createDataset(Arrays.asList(weights), Encoders.INT()).toDF("weight"); long actual = repository.mass(df); assertequals(expected, actual); SQLContext creation new SQLContext(jsc()) Code example: ApplesRepositoryTestingBaseIT.java 2.2 Data preparation Code examples in package: data_preparation JavaRDD Empty jsc().emptyrdd(); 6 Chapter 2. Java

11 From primitive list List<String> data = Lists.newArrayList("green", "red"); jsc().parallelize(data); From entity list List<Apple> rows = Lists.newArrayList(new Apple("green", 70), new Apple("red", 110)); JavaRDD<Apple> result = jsc().parallelize(rows); Example: RDDCreationTest.java DataFrame (Dataset<Row> in Java) Empty with predefined structure final Dataset<Row> actual = spark().emptydataframe().withcolumn("color", lit("green ")); actual.printschema(); Single primitive Long value = 12L; List<Row> rows = Collections.singletonList(RowFactory.create(value)); final Dataset<Row> actual = spark().createdataframe(rows, Encoders.LONG().schema()); Primitive list List<String> data = Lists.newArrayList("green", "red"); Dataset<Row> actual = spark().createdataset(data, Encoders.STRING()).toDF("color"); Row list with assigned schema List<Row> rows = Arrays.asList(RowFactory.create("green"), RowFactory.create("red")); StructType schema = DataTypes.createStructType( new StructField[]{DataTypes.createStructField("color", DataTypes.StringType, false)}); final Dataset<Row> actual = spark().createdataframe(rows, schema); List of entities List<Apple> rows = Lists.newArrayList(new Apple("green", 70), new Apple("red", 110)); final Dataset<Row> actual = spark().createdataframe(rows, Apple.class); Example: DataFrameCreationTest.java 2.2. Data preparation 7

12 2.2.3 Dataset From list of entities List<Apple> rows = Lists.newArrayList(new Apple("green", 70), new Apple("red", 110)); final Dataset<Apple> actual = spark().createdataset(rows, Encoders.bean(Apple.class)); Example: DataSetCreationTest.java Dynamically modified on fly If input data structure is complex, and only several fields take part in transformation, presaved data can be read, and fields modified on public void testmass_whenweightspecified_thenweightssum() { int expected = 85; Dataset<Row> input = getbulkdata().withcolumn("weight", lit(85)); assertequals(expected, repository.mass(input)); = Exception.class) public void testmass_whenweightisnull_thennpe() { Dataset<Row> input = getbulkdata().withcolumn("weight", lit(null).cast(datatypes. IntegerType)); repository.mass(input); } private Dataset<Row> getbulkdata() { return spark().read().option("header", "true").csv(gettestdatafolder()); } Example: DynamicallyModifiedOnFlyTest.java 2.3 Comparison Check DataFrame structure List<Apple> data = Lists.newArrayList(new Apple("Green", 85)); Dataset<Row> df = spark().createdataframe(data, Apple.class); assertequals(encoders.bean(apple.class).schema(), df.schema()); Get one value Apple apple = new Apple("Green", 85); List<Apple> data = Lists.newArrayList(apple); Dataset<Row> df = spark().createdataframe(data, Apple.class); Integer actual = df.first().getas("weight"); assertequals(apple.getweight(), actual); 8 Chapter 2. Java

13 2.3.3 Primitives list - with as List<String> data = Lists.newArrayList("green", "red"); Dataset<Row> df = spark().createdataset(data, Encoders.STRING()).toDF("color"); List<String> actual = df.select("color").as(encoders.string()).collectaslist(); Compare DataFrames List<Apple> data = Lists.newArrayList(new Apple("Green", 85)); Dataset<Row> expected = spark().createdataframe(data, Apple.class); Dataset<Row> actual = spark().createdataframe(data, Apple.class); assertequals(0, expected.except(actual).count()); Example: ComparisonTest.java 2.3. Comparison 9

14 10 Chapter 2. Java

15 CHAPTER 3 Scala ScalaTest FunSuite is used as testing framework. Project with code examples on GitLab: spark-testing-scala Functionality located in package repository 3.1 Context creation Library from Spark distributive is the best choice as base for integration testing, here named as spark-test-jar. Code examples in package: context Manual Two cores (highlighted) used in this example. val spark = SparkSession.builder.appName(getClass().getSimpleName).master("local[2]").getOrCreate() // use it spark.close() Code example: ManualSpec.scala 11

16 3.1.2 Framework: spark-test-jar Inclusion in project with Maven <dependency> <groupid>${spark.groupid}</groupid> <artifactid>spark-sql_${scala.suffix}</artifactid> <version>${spark.version}</version> <scope>provided</scope> <type>test-jar</type> </dependency> <dependency> <groupid>${spark.groupid}</groupid> <artifactid>spark-core_${scala.suffix}</artifactid> <version>${spark.version}</version> <scope>provided</scope> <type>test-jar</type> </dependency> <dependency> <groupid>${spark.groupid}</groupid> <artifactid>spark-catalyst_${scala.suffix}</artifactid> <version>${spark.version}</version> <scope>provided</scope> <type>test-jar</type> </dependency> Example of pom.xml Enabling in testcases Two traits have to be included for enable resources QueryTest, SharedSQLContext. class SparkTestJarSpec extends QueryTest with SharedSQLContext with Matchers SparkSession and SparkContext Can be received with methods: spark sparkcontext. val weights = List(120, 150) val expected = weights.reduce((a, v) => a + v) val df = spark.createdataset(weights).todf("weight") Code example: SparkTestJarSpec.scala 3.2 Data preparation Code examples in package: data_preparation 12 Chapter 3. Scala

17 3.2.1 RDD Empty RDD with predefined schema val rdd = sparkcontext.emptyrdd[apple] List of primitives val data = List(1, 2, 3) val rdd = sparkcontext.parallelize(data) List of entities val expected = Apple("Green", 120) val rdd = sparkcontext.parallelize(list(expected)) Example: RDDCreationSpec.scala DataFrame Empty with predefined structure val df = Seq.empty[(String, Int)].toDF("color", "weight") Empty with case class structure val df = Seq.empty[Apple].toDF() Empty with struct field val df = spark.emptydataframe.withcolumn("apple", struct(lit("green") as "color", lit(110) as "weight")) From primitive list val df = List(1, 2, 3).toDF("values") From tuples (often used) val df = List(("green", 70), ("red", 110)).toDF("color", "weight") 3.2. Data preparation 13

18 Array field val df = List( Array("red", "green", "yellow"), Array("green", "yellow") ).todf() With null values List(null.asInstanceOf[Integer]).toDF("color") Example: DataFrameCreationSpec.scala Dataset Empty val df = Seq.empty[Apple].toDS() From list List(Apple("green", 70), Apple("red", 110)).toDS() From DataFrame List(("green", 70)).toDF("color", "weight").as[apple] Example: DataSetCreationSpec.scala Dynamically modified on fly If input data structure is complex, and only several fields take part in transformation, presaved data can be read, and fields modified on fly. test("mass When weight specified Then weights sum") { val expected = 85 val input = getbulkdata.withcolumn("weight", lit(85)) repository.mass(input) shouldbe expected } test("mass When weight is null Then NPE") { val input = getbulkdata.withcolumn("weight", lit(null).cast(datatypes.integertype)) intercept[runtimeexception] { repository.mass(input) } } (continues on next page) 14 Chapter 3. Scala

19 (continued from previous page) private def getbulkdata = spark.read.option("header", "true").csv(gettestdatafolder) Example: DynamicallyModifiedOnFlySpec.scala 3.3 Comparison One value - with first val apple = Apple("Green", 85) val df = List(apple).toDF() val actual: Int = df.first.getas("weight") actual shouldequal apple.weight Primitives list - with as val df = List("Green", "Red").toDF("color") val actual = df.select("colour").as(encoders.string).collect() DataFrames - with checkanswer val apples = List(Apple("Green", 85)) val expected = apples.todf() val actual = apples.todf() checkanswer(expected, actual) Example: ComparisonSpec.scala 3.3. Comparison 15

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS Andrew Ray StampedeCon 2016 Silicon Valley Data Science is a boutique consulting firm focused on transforming your business through data science

More information

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2016 Doc 25 Spark 2 Nov 29, 2016

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2016 Doc 25 Spark 2 Nov 29, 2016 CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2016 Doc 25 Spark 2 Nov 29, 2016 Copyright, All rights reserved. 2016 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA.

More information

Spark Tutorial. General Instructions

Spark Tutorial. General Instructions CS246: Mining Massive Datasets Winter 2018 Spark Tutorial Due Thursday January 25, 2018 at 11:59pm Pacific time General Instructions The purpose of this tutorial is (1) to get you started with Spark and

More information

RECORD LINKAGE, A REAL USE CASE WITH SPARK ML. Alexis Seigneurin

RECORD LINKAGE, A REAL USE CASE WITH SPARK ML. Alexis Seigneurin RECORD LINKAGE, A REAL USE CASE WITH SPARK ML Alexis Seigneurin Who I am Software engineer for 15 years Consultant at Ippon USA, previously at Ippon France Favorite subjects: Spark, Machine Learning, Cassandra

More information

The input data can be queried by using

The input data can be queried by using 1 Spark SQL is the Spark component for structured data processing It provides a programming abstraction called Dataset and can act as a distributed SQL query engine The input data can be queried by using

More information

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Spark Lorenzo Di Gaetano THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION What is Apache Spark? A general purpose framework for big data processing It interfaces

More information

SparkSQL 11/14/2018 1

SparkSQL 11/14/2018 1 SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 2 Where are we? Pig Latin HiveQL SQL Pig Hive??? Hadoop MapReduce Spark RDD HDFS 11/14/2018 3

More information

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters are the primary abstraction in Spark are distributed collections of objects spread across the nodes of a clusters They are split in partitions Each node of the cluster that is used to run an application

More information

Beyond MapReduce: Apache Spark Antonino Virgillito

Beyond MapReduce: Apache Spark Antonino Virgillito Beyond MapReduce: Apache Spark Antonino Virgillito 1 Why Spark? Most of Machine Learning Algorithms are iterative because each iteration can improve the results With Disk based approach each iteration

More information

13/05/2018. Spark MLlib provides

13/05/2018. Spark MLlib provides Spark MLlib provides An itemset mining algorithm based on the FPgrowth algorithm That extracts all the sets of items (of any length) with a minimum frequency A rule mining algorithm That extracts the association

More information

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters 1 RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters They are split in partitions Each node of the cluster that is running an application

More information

2/26/2017. DataFrame. The input data can be queried by using

2/26/2017. DataFrame. The input data can be queried by using Spark SQL is the Spark component for structured data processing It provides a programming abstraction called s and can act as distributed SQL query engine The input data can be queried by using Ad-hoc

More information

Spark MLlib provides a (limited) set of clustering algorithms

Spark MLlib provides a (limited) set of clustering algorithms 1 Spark MLlib provides a (limited) set of clustering algorithms K-means Gaussian mixture 3 Each clustering algorithm has its own parameters However, all the provided algorithms identify a set of groups

More information

2/26/2017. Spark MLlib is the Spark component providing the machine learning/data mining algorithms. MLlib APIs are divided into two packages:

2/26/2017. Spark MLlib is the Spark component providing the machine learning/data mining algorithms. MLlib APIs are divided into two packages: Spark MLlib is the Spark component providing the machine learning/data mining algorithms Pre-processing techniques Classification (supervised learning) Clustering (unsupervised learning) Itemset mining

More information

10/05/2018. The following slides show how to

10/05/2018. The following slides show how to The following slides show how to Create a classification model based on the logistic regression algorithm for textual documents Apply the model to new textual documents The input training dataset represents

More information

Pyspark standalone code

Pyspark standalone code COSC 6339 Big Data Analytics Introduction to Spark (II) Edgar Gabriel Spring 2017 Pyspark standalone code from pyspark import SparkConf, SparkContext from operator import add conf = SparkConf() conf.setappname(

More information

HPCC / Spark Integration. Boca Raton Documentation Team

HPCC / Spark Integration. Boca Raton Documentation Team Boca Raton Documentation Team HPCC / Spark Integration Boca Raton Documentation Team Copyright 2018 HPCC Systems. All rights reserved We welcome your comments and feedback about this document via email

More information

Streaming, MLib and GraphX

Streaming, MLib and GraphX Streaming, MLib and GraphX 14/05/2018 - Big Data 2018 Queue of RDDs as a Stream For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using streamingcontext.queuestream(queueofrdds)

More information

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts

More information

Spark Streaming. Big Data Analysis with Scala and Spark Heather Miller

Spark Streaming. Big Data Analysis with Scala and Spark Heather Miller Spark Streaming Big Data Analysis with Scala and Spark Heather Miller Where Spark Streaming fits in (1) Spark is focused on batching Processing large, already-collected batches of data. For example: Where

More information

09/05/2018. Spark MLlib is the Spark component providing the machine learning/data mining algorithms

09/05/2018. Spark MLlib is the Spark component providing the machine learning/data mining algorithms Spark MLlib is the Spark component providing the machine learning/data mining algorithms Pre-processing techniques Classification (supervised learning) Clustering (unsupervised learning) Itemset mining

More information

Spark supports several storage levels

Spark supports several storage levels 1 Spark computes the content of an RDD each time an action is invoked on it If the same RDD is used multiple times in an application, Spark recomputes its content every time an action is invoked on the

More information

15/04/2018. Spark supports several storage levels. The storage level is used to specify if the content of the RDD is stored

15/04/2018. Spark supports several storage levels. The storage level is used to specify if the content of the RDD is stored Spark computes the content of an RDD each time an action is invoked on it If the same RDD is used multiple times in an application, Spark recomputes its content every time an action is invoked on the RDD,

More information

Integrating Solr & Spark

Integrating Solr & Spark Integrating Solr & Spark https://github.com/lucidworks/spark-solr/ Indexing from Spark Reading data from Solr Solr data as a Spark SQL DataFrame Interacting with Solr from the Spark shell Document Matching

More information

Apache Spark: Hands-on Session A.A. 2016/17

Apache Spark: Hands-on Session A.A. 2016/17 Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Spark: Hands-on Session A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica

More information

Adding Native SQL Support to Spark with C talyst. Michael Armbrust

Adding Native SQL Support to Spark with C talyst. Michael Armbrust Adding Native SQL Support to Spark with C talyst Michael Armbrust Overview Catalyst is an optimizer framework for manipulating trees of relational operators. Catalyst enables native support for executing

More information

@h2oai presents. Sparkling Water Meetup

@h2oai presents. Sparkling Water Meetup @h2oai & @mmalohlava presents Sparkling Water Meetup User-friendly API for data transformation Large and active community Memory efficient Performance of computation Platform components - SQL Machine learning

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

L6: Introduction to Spark Spark

L6: Introduction to Spark Spark L6: Introduction to Spark Spark Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revised on December 20, 2017 Today we are going to learn...

More information

CSE 414: Section 7 Parallel Databases. November 8th, 2018

CSE 414: Section 7 Parallel Databases. November 8th, 2018 CSE 414: Section 7 Parallel Databases November 8th, 2018 Agenda for Today This section: Quick touch up on parallel databases Distributed Query Processing In this class, only shared-nothing architecture

More information

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71 Spark and Spark SQL Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark and Spark SQL June 29, 2016 1 / 71 What is Big Data? Amir H. Payberah (SICS) Spark and Spark SQL June 29,

More information

Compact and safely: static DSL on Kotlin. Dmitry Pranchuk

Compact and safely: static DSL on Kotlin. Dmitry Pranchuk Compact and safely: static DSL on Kotlin Dmitry Pranchuk 1 About me Dmitry Pranchuk Developer at Back-end + a little bit of devops @cortwave at gitter d.pranchuk@gmail.com 2 Problem 1. Some libraries have

More information

Apache Bahir Writing Applications using Apache Bahir

Apache Bahir Writing Applications using Apache Bahir Apache Big Data Seville 2016 Apache Bahir Writing Applications using Apache Bahir Luciano Resende About Me Luciano Resende (lresende@apache.org) Architect and community liaison at Have been contributing

More information

Apache Spark: Hands-on Session A.A. 2017/18

Apache Spark: Hands-on Session A.A. 2017/18 Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica

More information

Developing Apache Spark Applications

Developing Apache Spark Applications 3 Developing Applications Date of Publish: 2018-04-01 http://docs.hortonworks.com Contents Introduction... 3 Using the Spark DataFrame API...3 Using Spark SQL...5 Access Spark SQL through the Spark shell...6

More information

MLI - An API for Distributed Machine Learning. Sarang Dev

MLI - An API for Distributed Machine Learning. Sarang Dev MLI - An API for Distributed Machine Learning Sarang Dev MLI - API Simplify the development of high-performance, scalable, distributed algorithms. Targets common ML problems related to data loading, feature

More information

本文转载自 :

本文转载自 : 本文转载自 :http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-appin-cdh-5/ (Editor s note this post has been updated to reflect CDH 5.1/Spark 1.0) Apache Spark is a general-purpose, cluster

More information

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3 WHITE PAPER Apache Spark: RDD, DataFrame and Dataset API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3 Prepared by: Eyal Edelman, Big Data Practice Lead Michael Birch, Big Data and

More information

Using Spark SQL in Spark 2

Using Spark SQL in Spark 2 201704 Using Spark SQL in Spark 2 Spark SQL in Spark 2: Overview and SparkSession Chapter 1 Course Chapters Spark SQL in Spark 2: Overview and SparkSession Spark SQL in Spark 2: Datasets, DataFrames, and

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing

More information

Turning Relational Database Tables into Spark Data Sources

Turning Relational Database Tables into Spark Data Sources Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following

More information

Developing Apache Spark Applications

Developing Apache Spark Applications 3 Developing Apache Spark Applications Date of Publish: 2018-04-01 http://docs.hortonworks.com Contents ii Contents Introduction... 4 Using the Spark DataFrame API...4 Using Spark SQL...6 Access Spark

More information

Higher level data processing in Apache Spark

Higher level data processing in Apache Spark Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL

More information

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017 Intro To Spark John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2017 Spark Capabilities (i.e. Hadoop shortcomings) Performance First, use RAM Also, be smarter Ease of

More information

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017 Intro To Spark John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center Copyright 2017 Performance First, use RAM Also, be smarter Spark Capabilities (i.e. Hadoop shortcomings) Ease of

More information

Distributed Systems. Distributed computation with Spark. Abraham Bernstein, Ph.D.

Distributed Systems. Distributed computation with Spark. Abraham Bernstein, Ph.D. Distributed Systems Distributed computation with Spark Abraham Bernstein, Ph.D. Course material based on: - Based on slides by Reynold Xin, Tudor Lapusan - Some additions by Johannes Schneider How good

More information

Chronix A fast and efficient time series storage based on Apache Solr. Caution: Contains technical content.

Chronix A fast and efficient time series storage based on Apache Solr. Caution: Contains technical content. Chronix A fast and efficient time series storage based on Apache Solr Caution: Contains technical content. 68.000.000.000* time correlated data objects. How to store such amount of data on your laptop

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

Examples. Stack. VideoPlayer

Examples. Stack. VideoPlayer Examples Stack VideoPlayer 1 Stack Interface public interface Stack public String pop (); public void push (String item); public String top (); public boolean isempty (); 2 public class ArrayStack implements

More information

TestsDumps. Latest Test Dumps for IT Exam Certification

TestsDumps.   Latest Test Dumps for IT Exam Certification TestsDumps http://www.testsdumps.com Latest Test Dumps for IT Exam Certification Exam : CCA175 Title : CCA Spark and Hadoop Developer Exam Vendor : Cloudera Version : DEMO Get Latest & Valid CCA175 Exam's

More information

Scaling Up & Out. Haidar Osman

Scaling Up & Out. Haidar Osman Scaling Up & Out Haidar Osman 1- Crash course in Scala - Classes - Objects 2- Actors - The Actor Model - Examples I, II, III, IV 3- Apache Spark - RDD & DAG - Word Count Example 2 1- Crash course in Scala

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark

More information

Amira

Amira IOT, timeseries and prediction with Android, Cassandra and Spark Amira Lakhal @Miralak About me Agile Java Developer and Running addict Paris 2016 Marathon @Miralak github.com/miralak Duchess France www.duchess-france.org

More information

About Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals

More information

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme

Big Data Analytics. C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark. Lars Schmidt-Thieme Big Data Analytics C. Distributed Computing Environments / C.2. Resilient Distributed Datasets: Apache Spark Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under

More information

2. The object-oriented paradigm

2. The object-oriented paradigm 2. The object-oriented paradigm Plan for this section: Look at things we have to be able to do with a programming language Look at Java and how it is done there Note: I will make a lot of use of the fact

More information

Parallel Processing Spark and Spark SQL

Parallel Processing Spark and Spark SQL Parallel Processing Spark and Spark SQL Amir H. Payberah amir@sics.se KTH Royal Institute of Technology Amir H. Payberah (KTH) Spark and Spark SQL 2016/09/16 1 / 82 Motivation (1/4) Most current cluster

More information

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara

CS435 Introduction to Big Data FALL 2018 Colorado State University. 10/24/2018 Week 10-B Sangmi Lee Pallickara 10/24/2018 CS435 Introduction to Big Data - FALL 2018 W10B00 CS435 Introduction to Big Data 10/24/2018 CS435 Introduction to Big Data - FALL 2018 W10B1 FAQs Programming Assignment 3 has been posted Recitations

More information

A CD Framework For Data Pipelines. Yaniv

A CD Framework For Data Pipelines. Yaniv A CD Framework For Data Pipelines Yaniv Rodenski @YRodenski yaniv@apache.org Archetypes of Data Pipelines Builders Data People (Data Scientist/ Analysts/BI Devs) Exploratory workloads Code centric Software

More information

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters are the primary abstraction in Spark are distributed collections of objects spread across the nodes of a clusters They are split in partitions Each node of the cluster that is running an application contains

More information

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark About the Tutorial Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark community released a tool, PySpark. Using PySpark, you can work with RDDs in Python

More information

Big Data Analytics with Hadoop and Spark at OSC

Big Data Analytics with Hadoop and Spark at OSC Big Data Analytics with Hadoop and Spark at OSC 09/28/2017 SUG Shameema Oottikkal Data Application Engineer Ohio SuperComputer Center email:soottikkal@osc.edu 1 Data Analytics at OSC Introduction: Data

More information

Machine Learning with Sparkling Water: H2O + Spark

Machine Learning with Sparkling Water: H2O + Spark Machine Learning with Sparkling Water: H2O + Spark Michal Malohlava Jakub Hava Nidhi Mehta Edited by: Vinod Iyengar & Angela Bartz http://h2o.ai/resources January 2018: Third Edition Machine Learning with

More information

HOMEWORK 9. M. Neumann. Due: THU 8 NOV PM. Getting Started SUBMISSION INSTRUCTIONS

HOMEWORK 9. M. Neumann. Due: THU 8 NOV PM. Getting Started SUBMISSION INSTRUCTIONS CSE427S HOMEWORK 9 M. Neumann Due: THU 8 NOV 2018 4PM Getting Started Update your SVN repository. When needed, you will find additional materials for homework x in the folder hwx. So, for the current assignment

More information

CMU MSP 36602: Spark, Part 3

CMU MSP 36602: Spark, Part 3 CMU MSP 36602: Spark, Part 3 H. Seltman, March 27, 2019 I) pyspark DataFrames i) These are implemented in module pyspark.sql (http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html). ii) Spark

More information

Greenplum-Spark Connector Examples Documentation. kong-yew,chan

Greenplum-Spark Connector Examples Documentation. kong-yew,chan Greenplum-Spark Connector Examples Documentation kong-yew,chan Dec 10, 2018 Contents 1 Overview 1 1.1 Pivotal Greenplum............................................ 1 1.2 Pivotal Greenplum-Spark Connector...................................

More information

Introduction to Spark

Introduction to Spark Introduction to Spark Outlines A brief history of Spark Programming with RDDs Transformations Actions A brief history Limitations of MapReduce MapReduce use cases showed two major limitations: Difficulty

More information

Guidelines For Hadoop and Spark Cluster Usage

Guidelines For Hadoop and Spark Cluster Usage Guidelines For Hadoop and Spark Cluster Usage Procedure to create an account in CSX. If you are taking a CS prefix course, you already have an account; to get an initial password created: 1. Login to https://cs.okstate.edu/pwreset

More information

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225 Index A Anonymous function, 66 Apache Hadoop, 1 Apache HBase, 42 44 Apache Hive, 6 7, 230 Apache Kafka, 8, 178 Apache License, 7 Apache Mahout, 5 Apache Mesos, 38 42 Apache Pig, 7 Apache Spark, 9 Apache

More information

Big Data Analytics with Apache Spark. Nastaran Fatemi

Big Data Analytics with Apache Spark. Nastaran Fatemi Big Data Analytics with Apache Spark Nastaran Fatemi Apache Spark Throughout this part of the course we will use the Apache Spark framework for distributed data-parallel programming. Spark implements a

More information

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast Lightning Fast Cluster Computing Michael Armbrust - @michaelarmbrust Reflections Projections 2015 Michast What is Apache? 2 What is Apache? Fast and general computing engine for clusters created by students

More information

CS 112 Introduction to Computing II. Wayne Snyder Computer Science Department Boston University

CS 112 Introduction to Computing II. Wayne Snyder Computer Science Department Boston University 9/5/6 CS Introduction to Computing II Wayne Snyder Department Boston University Today: Arrays (D and D) Methods Program structure Fields vs local variables Next time: Program structure continued: Classes

More information

Using Apache Phoenix to store and access data

Using Apache Phoenix to store and access data 3 Date of Publish: 2018-08-30 http://docs.hortonworks.com Contents ii Contents What's New in Apache Phoenix...4 Orchestrating SQL and APIs with Apache Phoenix...4 Enable Phoenix and interdependent components...4

More information

An Introduction to Big Data Analysis using Spark

An Introduction to Big Data Analysis using Spark An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,

More information

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Introduction: Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Clustering is an important machine learning task that tackles the problem of classifying data into distinct groups based

More information

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller Structured Streaming Big Data Analysis with Scala and Spark Heather Miller Why Structured Streaming? DStreams were nice, but in the last session, aggregation operations like a simple word count quickly

More information

Object-Oriented Programming Concepts

Object-Oriented Programming Concepts Object-Oriented Programming Concepts Real world objects include things like your car, TV etc. These objects share two characteristics: they all have state and they all have behavior. Software objects are

More information

Going Big Data on Apache Spark. KNIME Italy Meetup

Going Big Data on Apache Spark. KNIME Italy Meetup Going Big Data on Apache Spark KNIME Italy Meetup Agenda Introduction Why Apache Spark? Section 1 Gathering Requirements Section 2 Tool Choice Section 3 Architecture Section 4 Devising New Nodes Section

More information

Apache Spark and Scala Certification Training

Apache Spark and Scala Certification Training About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over

More information

Introduction to Apache Spark. Patrick Wendell - Databricks

Introduction to Apache Spark. Patrick Wendell - Databricks Introduction to Apache Spark Patrick Wendell - Databricks What is Spark? Fast and Expressive Cluster Computing Engine Compatible with Apache Hadoop Efficient General execution graphs In-memory storage

More information

Using Apache Zeppelin

Using Apache Zeppelin 3 Using Apache Zeppelin Date of Publish: 2018-04-01 http://docs.hortonworks.com Contents Introduction... 3 Launch Zeppelin... 3 Working with Zeppelin Notes... 5 Create and Run a Note...6 Import a Note...7

More information

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK? COSC 6339 Big Data Analytics Introduction to Spark Edgar Gabriel Fall 2018 What is SPARK? In-Memory Cluster Computing for Big Data Applications Fixes the weaknesses of MapReduce Iterative applications

More information

Distributed Time Travel for Feature Generation. Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi

Distributed Time Travel for Feature Generation. Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi Turn on Netflix, and the absolute best content for you would automatically start playing Everything is a

More information

Big Data Analytics with. Apache Spark. Machiel Jansen. Matthijs Moed

Big Data Analytics with. Apache Spark. Machiel Jansen. Matthijs Moed Big Data Analytics with Apache Spark Machiel Jansen Matthijs Moed Schedule 10:15-10:45 General introduction to Spark 10:45-11:15 Hands-on: Python notebook 11:15-11:30 General introduction to Spark (continued)

More information

Introduction to Apache Spark

Introduction to Apache Spark Introduction to Apache Spark Bu eğitim sunumları İstanbul Kalkınma Ajansı nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı kapsamında yürütülmekte olan TR10/16/YNY/0036 no lu İstanbul

More information

Apache Kudu. Zbigniew Baranowski

Apache Kudu. Zbigniew Baranowski Apache Kudu Zbigniew Baranowski Intro What is KUDU? New storage engine for structured data (tables) does not use HDFS! Columnar store Mutable (insert, update, delete) Written in C++ Apache-licensed open

More information

Apurva Nandan Tommi Jalkanen

Apurva Nandan Tommi Jalkanen Apurva Nandan Tommi Jalkanen Analyzing Large Datasets using Apache Spark November 16 17, 2017 CSC IT Center for Science Ltd, Espoo >>>rdd = sc.parallelize([('python',2), ('Java',3), ('Scala',4), ('R',5),

More information

THE CONCEPT OF OBJECT

THE CONCEPT OF OBJECT THE CONCEPT OF OBJECT An object may be defined as a service center equipped with a visible part (interface) and an hidden part Operation A Operation B Operation C Service center Hidden part Visible part

More information

Array. Prepared By - Rifat Shahriyar

Array. Prepared By - Rifat Shahriyar Java More Details Array 2 Arrays A group of variables containing values that all have the same type Arrays are fixed length entities In Java, arrays are objects, so they are considered reference types

More information

CSE6331: Cloud Computing

CSE6331: Cloud Computing CSE6331: Cloud Computing Leonidas Fegaras University of Texas at Arlington c 2018 by Leonidas Fegaras Apache Spark Based on: Shannon Quinn slides at http://cobweb.cs.uga.edu/~squinn/mmd_s15/lectures/lecture13_mar3.pdf

More information

Machine Learning with Sparkling Water: H2O + Spark

Machine Learning with Sparkling Water: H2O + Spark Machine Learning with Sparkling Water: H2O + Spark Michal Malohlava Nidhi Mehta Edited by: Brandon Hill & Vinod Iyengar http://h2o.ai/resources July 2016: First Edition Machine Learning with Sparkling

More information

Scala. Introduction. Scala

Scala. Introduction. Scala Scala Introduction 1 Scala Scala was proposed by Professor Martin Odersky and his group at EPFL in 2003 to provide a highperformance, concurrent-ready environment for functional programming and object-oriented

More information

732A54 Big Data Analytics: SparkSQL. Version: Dec 8, 2016

732A54 Big Data Analytics: SparkSQL. Version: Dec 8, 2016 732A54 Big Data Analytics: SparkSQL Version: Dec 8, 2016 2016-12-08 2 DataFrames A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in

More information

Integrating Apache Hive with Spark and BI

Integrating Apache Hive with Spark and BI 3 Integrating Apache Hive with Spark and BI Date of Publish: 2018-07-12 http://docs.hortonworks.com Contents...3 Apache Spark-Apache Hive connection configuration...3 Zeppelin configuration for using the

More information

// Create a configuration object and set the name of the application SparkConf conf=new SparkConf().setAppName("Spark Exam 2 - Exercise

// Create a configuration object and set the name of the application SparkConf conf=new SparkConf().setAppName(Spark Exam 2 - Exercise import org.apache.spark.api.java.*; import org.apache.spark.sparkconf; public class SparkDriver { public static void main(string[] args) { String inputpathpm10readings; String outputpathmonthlystatistics;

More information

Certification Study Guide. MapR Certified Spark Developer v1 Study Guide

Certification Study Guide. MapR Certified Spark Developer v1 Study Guide Certification Study Guide MapR Certified Spark Developer v1 Study Guide 1 CONTENTS About MapR Study Guides... 3 MapR Certified Spark Developer (MCSD)... 3 SECTION 1 WHAT S ON THE EXAM?... 5 1. Load and

More information

Java. Representing Data. Representing data. Primitive data types

Java. Representing Data. Representing data. Primitive data types Computer Science Representing Data Java 02/23/2010 CPSC 449 161 Unless otherwise noted, all artwork and illustrations by either Rob Kremer or Jörg Denzinger (course instructors) Representing data Manipulating

More information

Python, PySpark and Riak TS. Stephen Etheridge Lead Solution Architect, EMEA

Python, PySpark and Riak TS. Stephen Etheridge Lead Solution Architect, EMEA Python, PySpark and Riak TS Stephen Etheridge Lead Solution Architect, EMEA Agenda Introduction to Riak TS The Riak Python client The Riak Spark connector and PySpark CONFIDENTIAL Basho Technologies 3

More information