Lijuan Zhuge & Kailai Xu May 3, 2017 In this short article, we describe how to set up spark on clusters and the basic usage of pyspark.

Size: px

Start display at page:

Download "Lijuan Zhuge & Kailai Xu May 3, 2017 In this short article, we describe how to set up spark on clusters and the basic usage of pyspark."

Gregory Booker
6 years ago
Views:

1 Lijuan Zhuge & Kailai Xu May 3, 2017 In this short article, we describe how to set up spark on clusters and the basic usage of pyspark. Set up spark The key to set up sparks is to make several machines talk to each other. We need a master machine which manages the clusters and slave machine which provides extra workers. Typically, every machine can provide several cores, and when they work together, they can provide as many cores as we want. The bottleneck will not be the computation capacity but network bandwidth.[1] 1. To make the communication easy, we would like make the hostname of every machine meaningful. Assume we have three machines, we would rename the hostname to master, slave1, slave2. $ vim /etc/hostname and change ALL the content to master(and slave1, slave2). 2. Make the machines aware of each other $ vim /etc/hosts and add the following three lines to the file master slave slave2 3. Close the firewall. $ sudo ufw disable 4. Add the pub keys of every machine to ALL machines. In this way the machines can visit each other without passwords. We can test the communication between the three machines by ping command $ ping master $ ping slave1 $ ping slave2 5. Now install Java, Scala and Spark. There are several configurations to do in.bashrc file. 1

2 export JAVA_HOME=/usr/local/java/jdk1.8.0_131 export JRE_HOME=/usr/local/java/jdk1.8.0_131/jre export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$JAVA_HOME:$PATH export JDK_HOME=/usr/local/java/jdk1.8.0_131 export SCALA_HOME=/usr/local/scala export PATH=$PATH:$SCALA_HOME/bin export SPARK_HOME=/usr/local/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin Tune the paths according to your own installation paths. 6. Configure Spark. Extra work should be down to configure Spark. In the /usr/local/spark/conf directory, remove all the *.templates sufix. And edit spark-env.sh file export SCALA_HOME=/usr/local/scala # set JDK path export JAVA_HOME=/usr/local/java/jdk1.8.0_131 export PATH=$PATH:$JAVA_HOME/bin SPARK_MASTER_HOST= SPARK_LOCAL_IP= In addition, change the slaves file # A Spark Worker will be started on each of the machines listed below. #localhost master slave1 slave2 7. Start service. First, activate all the settings by $ source ~/.bashrc Then use start-all.sh to start the cluster. We can use jps to see the workers. 8. Right now you can access the Web UI 2

3 Assume we have a file computepi.py, we can run in two ways Local $ spark-submit --master local computepi.py Cluster $ spark-submit --master spark:// :7077 computepi.py Jupyter Notebook Pyspark can also be used in Jupyter notebook. To do that, just edit the $HOME/.jupyter/profile pyspark/startup/00-pyspark-setup.py file and add the following code[2] import os import sys spark_home = os.environ.get('spark_home', None) sys.path.insert(0, spark_home + "/python") sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j src.zip')) filename = os.path.join(spark_home, 'python/pyspark/shell.py') exec(compile(open(filename, "rb").read(), filename, 'exec')) spark_release_file = spark_home + "/RELEASE" if os.path.exists(spark_release_file) and "Spark 1.5" in\ open(spark_release_file).read(): 3

4 pyspark_submit_args = os.environ.get("pyspark_submit_args", "") if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell" os.environ["pyspark_submit_args"] = pyspark_submit_args and start jupyter notebook with $ jupyter notebook --profile=pyspark Basic Usage In spark, all the operations are done using RDD(resilient distributed dataset). It has three operations Creation Transformation Action RDD will not be evaluated until an action is called. sc will be available global variable in interactive mode. Otherwise we have to import from pyspark. The routine for usage in standalone program is 1. Import Spark module. from pyspark import SparkContext, SparkConf 2. Create a SparkContex object # sc = SparkContext(master, appname) sc = SparkContext("local","Page Rank") Here are some useful functions. 4

5 Function map flatmap filter reducebykey groupbykey groupbyvalue collect, take, takesample, first, count save sc.textfile repartition Description apply a function to all elements of the RDD same as map, but flatten the result to create a new RDD filter the RDD elements and keep those whose function value is True reduce RDD values according to keys group the RDD elements by keys group the RDD elements by values they are functions to peek data save data read data partition the data into several partitions, this will affect the number of jobs References [1] (Accessed on 05/03/2017). [2] Pyspark: How to install and integrate with the jupyter notebook. (Accessed on 05/03/2017). 5

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides

Apache Spark. CS240A T Yang. Some of them are based on P. Wendell s Spark slides Apache Spark CS240A T Yang Some of them are based on P. Wendell s Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming