Pyspark standalone code

Similar documents
COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Memory Management for Spark. Ken Salem Cheriton School of Computer Science University of Waterloo

MLI - An API for Distributed Machine Learning. Sarang Dev

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Big Data Analytics with Hadoop and Spark at OSC

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

L6: Introduction to Spark Spark

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Apurva Nandan Tommi Jalkanen

Beyond MapReduce: Apache Spark Antonino Virgillito

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Spark Tutorial. General Instructions

Introduction to Apache Spark. Patrick Wendell - Databricks

Big data systems 12/8/17

Processing of big data with Apache Spark

Spark Overview. Professor Sasu Tarkoma.

Turning Relational Database Tables into Spark Data Sources

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

An Introduction to Big Data Analysis using Spark

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

TUTORIAL: BIG DATA ANALYTICS USING APACHE SPARK

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

A Tutorial on Apache Spark

Big Data Infrastructures & Technologies

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2016 Doc 25 Spark 2 Nov 29, 2016

Streaming vs. batch processing

TensorFlowOnSpark Scalable TensorFlow Learning on Spark Clusters Lee Yang, Andrew Feng Yahoo Big Data ML Platform Team

EPL660: Information Retrieval and Search Engines Lab 11

Big Data Analytics at OSC

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

2/26/2017. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Machine Learning With Spark

@h2oai presents. Sparkling Water Meetup

08/04/2018. RDDs. RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

An Introduction to Apache Spark

Lijuan Zhuge & Kailai Xu May 3, 2017 In this short article, we describe how to set up spark on clusters and the basic usage of pyspark.

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

RDDs are the primary abstraction in Spark RDDs are distributed collections of objects spread across the nodes of a clusters

Parallel Processing Spark and Spark SQL

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Higher level data processing in Apache Spark

Big Data Analytics with Apache Spark. Nastaran Fatemi

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225

MapReduce, Hadoop and Spark. Bompotas Agorakis


Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Cloud, Big Data & Linear Algebra

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

DATA SCIENCE USING SPARK: AN INTRODUCTION

Chapter 1 - The Spark Machine Learning Library

Cloud Computing & Visualization

CSE 444: Database Internals. Lecture 23 Spark

Intro To Spark. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

732A54 Big Data Analytics: SparkSQL. Version: Dec 8, 2016

Specialist ICT Learning

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data processing: a framework suitable for Economists and Statisticians

Apache Spark 2.0. Matei

TDDE31/732A54 - Big Data Analytics Lab compendium

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

An Overview of Apache Spark

Unifying Big Data Workloads in Apache Spark

An exceedingly high-level overview of ambient noise processing with Spark and Hadoop

Guidelines For Hadoop and Spark Cluster Usage

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Running Apache Spark Applications

We consider the general additive objective function that we saw in previous lectures: n F (w; x i, y i ) i=1

Distributed Machine Learning" on Spark

Survey of data formats and conversion tools

Introduction to Apache Spark

MariaDB ColumnStore PySpark API Usage Documentation. Release d1ab30. MariaDB Corporation

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

An Introduction to Apache Spark

Hadoop Development Introduction

Introduction to Spark

Spark, Shark and Spark Streaming Introduction

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Data-intensive computing systems

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

CS Spark. Slides from Matei Zaharia and Databricks

Distributed Computing with Spark

Hadoop course content

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

spark-testing-java Documentation Release latest

Apache Spark and Scala Certification Training

SparkSQL 11/14/2018 1

WHAT S NEW IN SPARK 2.0: STRUCTURED STREAMING AND DATASETS

Python Certification Training

Data Analytics and Machine Learning: From Node to Cluster

Accelerating Spark Workloads using GPUs

Apache Spark: Hands-on Session A.A. 2016/17

2/4/2019 Week 3- A Sangmi Lee Pallickara

Apache Spark: Hands-on Session A.A. 2017/18

Analyzing Flight Data

Big Data Architect.

Transcription:

COSC 6339 Big Data Analytics Introduction to Spark (II) Edgar Gabriel Spring 2017 Pyspark standalone code from pyspark import SparkConf, SparkContext from operator import add conf = SparkConf() conf.setappname( Wordcount") conf.set("spark.executor.memory", 2g") sc = SparkContext(conf = conf) text=sc.textfile("/gabriel/simple-input.txt") words = text.flatmap(lambda line:line.split()) wcounts = words.map(lambda w: (w, 1) ) counts = wcounts.reducebykey(add, numpartitions=1) counts.saveastextfile( /gabriel/wordcount") 1

Submitting spark jobs For small test cases: spark-submit wordcount_pyspark2.py /gabriel/simpleinput.txt /gabriel/output Job will run on the front-end node locally! For anything non-trivial in size, submit the spark job through the yarn resource manager: will use the cluster! spark-submit --master yarn wordcount_pyspark2.py /gabriel/simple-input.txt /gabriel/output Other important options: --num-executors NUM Number of executors to launch --executor-cores NUM Number of cores per executor --executor-memory MEM Memory per executor --py-files add.py,.zip or.egg files to be distributed with your application. K-means example from future import print_function import sys import numpy as np from pyspark import SparkConf, SparkContext def parsevector(line): return np.array([float(x) for x in line.split(' ')]) def closestpoint(p, centers): bestindex = 0 closest = float("+inf") for i in range(len(centers)): tempdist = np.sum((p - centers[i]) ** 2) if tempdist < closest: closest = tempdist bestindex = i return bestindex 2

if name == " main ": K-means example conf = SparkConf() conf.setappname("kmeans_pyspark") sc = SparkContext(conf = conf) text=sc.textfile(sys.argv[1]) data = text.map(parsevector) K = 2 convergedist = 0.1 kpoints = data.takesample(false, K, 1) tempdist = 1.0 K-means example while tempdist > convergedist: closest = data.map( lambda p: (closestpoint(p, kpoints), (p, 1))) pointstats = closest.reducebykey( lambda p1, p2: (p1[0] + p2[0], p1[1] + p2[1])) newpoints = pointstats.map( lambda st: (st[0], st[1][0] / st[1][1])).collect() tempdist = sum(np.sum((kpoints[ik]-p)**2) for (ik, p) in newpoints) for (ik, p) in newpoints: kpoints[ik] = p print("final centers: " + str(kpoints)) sc.stop() 3

>>>text=sc.textfile("/gabriel/datapoints.txt") >>>text.collect() [u'1 1', u'2 2', u'3 3', u'4 4', u'3 4', u'4 3', u'1 2'] >>> data=text.map(parsevector) >>> data.collect() [array([ 1., 1.]), array([ 2., 2.]), array([ 3., 3.]), array([ 4., 4.]), array([ 3., 4.]), array([ 4., 3.]), array([ 1., 2.])] >>> kpoints = data.takesample(false, 2, 1) >>> kpoints [array([ 3., 3.]), array([ 1., 2.])] >>> closest = data.map(... lambda p: (closestpoint(p, kpoints), (p, 1))) >>> closest.collect() [(1, (array([ 1., 1.]), 1)), (1, (array([ 2., 2.]), 1)), (0, (array([ 3., 3.]), 1)), (0, (array([ 4., 4.]), 1)), (0, (array([ 3., 4.]), 1)), (0, (array([ 4., 3.]), 1)), (1, (array([ 1., 2.]), 1))] >>> pointstats = closest.reducebykey( lambda p1_c1, p2_c2: (p1_c1[0]+p2_c2[0], p1_c1[1]+p2_c2[1])) >>> pointstats.collect() [(0, (array([ 14., 14.]), 4)), (1, (array([ 4., 5.]), 3))] >>> newpoints = pointstats.map(... lambda st: (st[0], st[1][0] / st[1][1])).collect() >>> newpoints [(0, array([ 3.5, 3.5])), (1, array([ 1.33333333, 1.66666667]))] >>> tempdist = sum(np.sum((kpoints[ik]-p)**2) for (ik, p) in newpoints) >>> tempdist 0.7222222222222221 4

SPARK software MLib Spark s machine learning (ML) library. Provides support for Basic statistics Classification and Regression Clustering Feature extraction Frequent pattern mining Optimization Two sets of APIs available: RDD based: import pyspark.mllib DataFrames based (new): import pyspark.ml 5

import sys import numpy as np from pyspark import SparkContext from pyspark.mllib.clustering import KMeans def parsevector(line): return np.array([float(x) for x in line.split(' ')]) if name == " main ": sc = SparkContext(appName="KMeans") lines = sc.textfile(sys.argv[1]) data = lines.map(parsevector) model = KMeans.train(data, 2, maxiterations=10) sc.stop() What is a Model A model is a complex pipeline of components Data sources Joins Featurization Logic Algorithm(s) Transformers Estimators Tuning Parameters 6

Mlib k-means clustering model Parameters: rdd Training points as an RDD of Vector k Number of clusters to create. maxiterations Maximum number of iterations allowed. (default: 100) initializationmode The initialization algorithm. This can be either random or k-means. (default: k-means ) seed Random seed value for cluster initialization. Set as None to generate seed based on system time. (default: None) epsilon Distance threshold within which a center will be considered to have converged. If all centers move less than this Euclidean distance, iterations are stopped. (default: 1e-4) initialmodel Initial cluster centers can be provided as a KMeansModel object rather than using the random or k- means initializationmodel. (default: None) >>> text = sc.textfile("/gabriel/datapoints.txt") >>> text.collect() [u'1 1', u'2 2', u'3 3', u'4 4', u'3 4', u'4 3', u'1 2'] >>> data=text.map(parse) >>> data.collect() [array([ 1., 1.]), array([ 2., 2.]), array([ 3., 3.]), array([ 4., 4.]), array([ 3., 4.]), array([ 4., 3.]), array([ 1., 2.])] >>> model = KMeans.train(data, 2, maxiterations=2 ) >>> model.clustercenters [array([ 3.5, 3.5]), array([ 1.33333333, 1.66666667])] >>> model.predict([0, 4]) 1 >>> model.save(sc, "/gabriel/clustermodel") 7

whale:~> hdfs dfs -ls /gabriel/clustermodel/ /gabriel/clustermodel/data /gabriel/clustermodel/metadata whale:~> hdfs dfs -ls /gabriel/clustermodel/data/ _SUCCESS part-r-00000-42276161-3641-bab1bdd07c7325a3.snappy.parquet part-r-00001-42276161-3641-bab1-bdd07c7325a3.snappy.parquet part-r-00002-42276161-3641-bab1-bdd07c7325a3.snappy.parquet part-r-00003-42276161-3641-bab1-bdd07c7325a3.snappy.parquet whale:~> hdfs dfs -ls /gabriel/clustermodel/metadata/ _SUCCESS part-00000 whale:~> hdfs dfs -cat /gabriel/clustermodel/metadata/part- 00000 {"class":"org.apache.spark.mllib.clustering.kmeansmodel","ver sion":"1.0","k":2} Parquet is a columnar format that is supported by many data processing systems Spark provides support for both reading and writing Parquet files that automatically preserves the schema of the original data Snappy is a compression/decompression library developed by Google. very fast and reasonable compression. E.g. compared to zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger 8

DataFrames Distributed collection of rows under named columns Conceptually similar to a table in a relational database Can be constructed from a wide array of sources such as: structured data files, Hive tables, external databases, existing RDDs. The DataFrame API is available in Scala, Java, Python, and R Common Characteristics between RDDs and DataFrames Distributed Immutable Lazy Evaluation >>> rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10)]) >>> df=rdd.todf(['id','score']) >>> df.show() +---+-----+ id score +---+-----+ 0 1 0 1 0 2 1 2 1 10 +---+-----+ >>> df.printschema() root -- id: long (nullable = true) -- score: long (nullable = true) 9

Kmeans DataFrames example from pyspark.ml.clustering import KMeans from pyspark.sql import SparkSession if name == " main ": spark = SparkSession\.builder\.appName("KMeansExample")\.getOrCreate() dataset = \ spark.read.format("libsvm").load( /gabriel/datapoints.txt") kmeans = KMeans().setK(2).setSeed(1) model = kmeans.fit(dataset) centers = model.clustercenters() spark.stop() More information Project webpage http://spark.apache.org/ https://jaceklaskowski.gitbooks.io/mastering-apachespark/content/ 10