COSC 6339 Big Data Analytics Introduction to Spark (II) Edgar Gabriel Spring 2017 Pyspark standalone code from pyspark import SparkConf, SparkContext from operator import add conf = SparkConf() conf.setappname( Wordcount") conf.set("spark.executor.memory", 2g") sc = SparkContext(conf = conf) text=sc.textfile("/gabriel/simple-input.txt") words = text.flatmap(lambda line:line.split()) wcounts = words.map(lambda w: (w, 1) ) counts = wcounts.reducebykey(add, numpartitions=1) counts.saveastextfile( /gabriel/wordcount") 1
Submitting spark jobs For small test cases: spark-submit wordcount_pyspark2.py /gabriel/simpleinput.txt /gabriel/output Job will run on the front-end node locally! For anything non-trivial in size, submit the spark job through the yarn resource manager: will use the cluster! spark-submit --master yarn wordcount_pyspark2.py /gabriel/simple-input.txt /gabriel/output Other important options: --num-executors NUM Number of executors to launch --executor-cores NUM Number of cores per executor --executor-memory MEM Memory per executor --py-files add.py,.zip or.egg files to be distributed with your application. K-means example from future import print_function import sys import numpy as np from pyspark import SparkConf, SparkContext def parsevector(line): return np.array([float(x) for x in line.split(' ')]) def closestpoint(p, centers): bestindex = 0 closest = float("+inf") for i in range(len(centers)): tempdist = np.sum((p - centers[i]) ** 2) if tempdist < closest: closest = tempdist bestindex = i return bestindex 2
if name == " main ": K-means example conf = SparkConf() conf.setappname("kmeans_pyspark") sc = SparkContext(conf = conf) text=sc.textfile(sys.argv[1]) data = text.map(parsevector) K = 2 convergedist = 0.1 kpoints = data.takesample(false, K, 1) tempdist = 1.0 K-means example while tempdist > convergedist: closest = data.map( lambda p: (closestpoint(p, kpoints), (p, 1))) pointstats = closest.reducebykey( lambda p1, p2: (p1[0] + p2[0], p1[1] + p2[1])) newpoints = pointstats.map( lambda st: (st[0], st[1][0] / st[1][1])).collect() tempdist = sum(np.sum((kpoints[ik]-p)**2) for (ik, p) in newpoints) for (ik, p) in newpoints: kpoints[ik] = p print("final centers: " + str(kpoints)) sc.stop() 3
>>>text=sc.textfile("/gabriel/datapoints.txt") >>>text.collect() [u'1 1', u'2 2', u'3 3', u'4 4', u'3 4', u'4 3', u'1 2'] >>> data=text.map(parsevector) >>> data.collect() [array([ 1., 1.]), array([ 2., 2.]), array([ 3., 3.]), array([ 4., 4.]), array([ 3., 4.]), array([ 4., 3.]), array([ 1., 2.])] >>> kpoints = data.takesample(false, 2, 1) >>> kpoints [array([ 3., 3.]), array([ 1., 2.])] >>> closest = data.map(... lambda p: (closestpoint(p, kpoints), (p, 1))) >>> closest.collect() [(1, (array([ 1., 1.]), 1)), (1, (array([ 2., 2.]), 1)), (0, (array([ 3., 3.]), 1)), (0, (array([ 4., 4.]), 1)), (0, (array([ 3., 4.]), 1)), (0, (array([ 4., 3.]), 1)), (1, (array([ 1., 2.]), 1))] >>> pointstats = closest.reducebykey( lambda p1_c1, p2_c2: (p1_c1[0]+p2_c2[0], p1_c1[1]+p2_c2[1])) >>> pointstats.collect() [(0, (array([ 14., 14.]), 4)), (1, (array([ 4., 5.]), 3))] >>> newpoints = pointstats.map(... lambda st: (st[0], st[1][0] / st[1][1])).collect() >>> newpoints [(0, array([ 3.5, 3.5])), (1, array([ 1.33333333, 1.66666667]))] >>> tempdist = sum(np.sum((kpoints[ik]-p)**2) for (ik, p) in newpoints) >>> tempdist 0.7222222222222221 4
SPARK software MLib Spark s machine learning (ML) library. Provides support for Basic statistics Classification and Regression Clustering Feature extraction Frequent pattern mining Optimization Two sets of APIs available: RDD based: import pyspark.mllib DataFrames based (new): import pyspark.ml 5
import sys import numpy as np from pyspark import SparkContext from pyspark.mllib.clustering import KMeans def parsevector(line): return np.array([float(x) for x in line.split(' ')]) if name == " main ": sc = SparkContext(appName="KMeans") lines = sc.textfile(sys.argv[1]) data = lines.map(parsevector) model = KMeans.train(data, 2, maxiterations=10) sc.stop() What is a Model A model is a complex pipeline of components Data sources Joins Featurization Logic Algorithm(s) Transformers Estimators Tuning Parameters 6
Mlib k-means clustering model Parameters: rdd Training points as an RDD of Vector k Number of clusters to create. maxiterations Maximum number of iterations allowed. (default: 100) initializationmode The initialization algorithm. This can be either random or k-means. (default: k-means ) seed Random seed value for cluster initialization. Set as None to generate seed based on system time. (default: None) epsilon Distance threshold within which a center will be considered to have converged. If all centers move less than this Euclidean distance, iterations are stopped. (default: 1e-4) initialmodel Initial cluster centers can be provided as a KMeansModel object rather than using the random or k- means initializationmodel. (default: None) >>> text = sc.textfile("/gabriel/datapoints.txt") >>> text.collect() [u'1 1', u'2 2', u'3 3', u'4 4', u'3 4', u'4 3', u'1 2'] >>> data=text.map(parse) >>> data.collect() [array([ 1., 1.]), array([ 2., 2.]), array([ 3., 3.]), array([ 4., 4.]), array([ 3., 4.]), array([ 4., 3.]), array([ 1., 2.])] >>> model = KMeans.train(data, 2, maxiterations=2 ) >>> model.clustercenters [array([ 3.5, 3.5]), array([ 1.33333333, 1.66666667])] >>> model.predict([0, 4]) 1 >>> model.save(sc, "/gabriel/clustermodel") 7
whale:~> hdfs dfs -ls /gabriel/clustermodel/ /gabriel/clustermodel/data /gabriel/clustermodel/metadata whale:~> hdfs dfs -ls /gabriel/clustermodel/data/ _SUCCESS part-r-00000-42276161-3641-bab1bdd07c7325a3.snappy.parquet part-r-00001-42276161-3641-bab1-bdd07c7325a3.snappy.parquet part-r-00002-42276161-3641-bab1-bdd07c7325a3.snappy.parquet part-r-00003-42276161-3641-bab1-bdd07c7325a3.snappy.parquet whale:~> hdfs dfs -ls /gabriel/clustermodel/metadata/ _SUCCESS part-00000 whale:~> hdfs dfs -cat /gabriel/clustermodel/metadata/part- 00000 {"class":"org.apache.spark.mllib.clustering.kmeansmodel","ver sion":"1.0","k":2} Parquet is a columnar format that is supported by many data processing systems Spark provides support for both reading and writing Parquet files that automatically preserves the schema of the original data Snappy is a compression/decompression library developed by Google. very fast and reasonable compression. E.g. compared to zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger 8
DataFrames Distributed collection of rows under named columns Conceptually similar to a table in a relational database Can be constructed from a wide array of sources such as: structured data files, Hive tables, external databases, existing RDDs. The DataFrame API is available in Scala, Java, Python, and R Common Characteristics between RDDs and DataFrames Distributed Immutable Lazy Evaluation >>> rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10)]) >>> df=rdd.todf(['id','score']) >>> df.show() +---+-----+ id score +---+-----+ 0 1 0 1 0 2 1 2 1 10 +---+-----+ >>> df.printschema() root -- id: long (nullable = true) -- score: long (nullable = true) 9
Kmeans DataFrames example from pyspark.ml.clustering import KMeans from pyspark.sql import SparkSession if name == " main ": spark = SparkSession\.builder\.appName("KMeansExample")\.getOrCreate() dataset = \ spark.read.format("libsvm").load( /gabriel/datapoints.txt") kmeans = KMeans().setK(2).setSeed(1) model = kmeans.fit(dataset) centers = model.clustercenters() spark.stop() More information Project webpage http://spark.apache.org/ https://jaceklaskowski.gitbooks.io/mastering-apachespark/content/ 10