Real-time Data Engineering in the Cloud Exercise Guide

Contents 1 Lab Notes 3 2 Kafka HelloWorld 6 3 Streaming ETL 8 4 Advanced Streaming 10 5 Spark Data Analysis 13 6 Real-time Dashboard 16 CONTENTS 2

EXERCISE 1 Lab Notes These notes will help you work through and understand the labs for this course. 1.1 General Notes Copying and pasting from this document may not work correctly in all PDF readers. We suggest you use Adobe Reader. 1.2 Command Line Examples Most labs contain commands that must be run from the command line. These commands will look like: $ cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 2001:4800:7810:0512:e2aa:bc1f:ff04:badc cdh5-cm-vm01 166.78.10.206 cdh5-cm-vm01 10.181.7.208 cdh5-cm-vm01 When running this command you will type in everything. You will only type in the portion after the $ prompt. In this example, you would only type in cat /etc/hosts. The rest of the command contains the output of the command. Sometimes the commands will contain multiple commands: $ chkconfig --list iptables iptables 0:off 1:off 2:on 3:on 4:on 5:on 6:off $ service iptables stop CONTENTS 3

iptables: Flushing firewall rules: [ OK ] iptables: Setting chains to policy ACCEPT: filter [ OK ] iptables: Unloading modules: [ OK ] There are two different commands to run from this section. First, you find every $ prompt to run all of the commands. In this example, the two commands are chkconfig --list iptables and service iptables stop. Other times commands will be on multiple lines: $ hadoop fs -put \ movies.dat /user/root/movielens/movies/ This command is too long to fit on one line in the lab manual and needs to be on two lines. In this example, you would type in hadoop fs -put \, then hit <enter>, and finish off the command with movies.dat /user/root/movielens/movies/. 1.3 VirtualBox Notes If your class is using a VirtualBox virtual machine, you can make certain changes to make it run faster or share the host s file system. If you have enough RAM, you can allocate more RAM to the virtual machine. By default, the VM uses 1 GB of RAM. Adding 2 or more GB will make the virtual machine perform faster. Virtual box can share a folder to the guest VM. Once the VM is shared, you can mount the directory with the following command: $ sudo mount -t vboxsf -o rw,uid=1001,gid=1001 \ shareddirectory ~/guestvmdirectory To always mount the directory in the guest, place this line in /etc/fstab shareddirectory /home/vmuser/guestvmdirectory vboxsf rw,uid=1000,gid=1000 0 0 Then run the command: $ sudo mount /home/vmuser/guestvmdirectory/ CONTENTS 4

VirtualBox has other advanced integrations such as a shared clipboard. This allows you to copy and paste information between the host and guest operating systems clipboards. See this documentation for more information. 1.4 Maven Offline Mode Maven is configured to be in offline mode. All dependencies for the class have already been loaded. If you add a new dependency, you may see a message like: Failed to retrieve org.slf4j:slf4j-api-1.7.14 Caused by: Cannot access confluent-repository (http://packages.confluent.io/maven/) in offline mode and the artifact org.slf4j:slf4j-api:jar:1.7.14 has not been downloaded from it before. To take Maven out of offline mode, run the maven_online.sh script that is on the path. Once you re done, you can put Maven back into offline mode by running the maven_offline.sh script that is on the path. You can learn more about Maven offline mode here. CONTENTS 5

EXERCISE 2 Kafka HelloWorld 2.1 Objective This 45 minute lab uses Kafka to ingest data. We will: Create a producer to import data Create a consumer to read the data Project Directory: helloworld 2.2 Starting Kafka Kafka is installed on your virtual machine, but the server processes aren t started to keep memory usage low. 1. Start the ZooKeeper service. $ sudo service zookeeper start 2. Start the Kafka Broker (Kafka Server) service. sudo service kafka-server start 3. Optionally, start the Kafka REST service. Start this service if you are going to use the REST interface for Kafka. sudo service kafka-rest start 4. Optionally, start the Schema Registry service. Start this service if you are going to use Avro for messages. sudo service schema-registry start CONTENTS 6

Shutdown Services Once you are done with Kafka, you will need to shut down the services to regain memory. $ sudo service schema-registry stop $ sudo service kafka-rest stop $ sudo service kafka-server stop $ sudo service zookeeper stop 2.3 Kafka HelloWorld Create a KafkaProducer with the following characteristics: Reads and sends the playing_cards_datetime.tsv dataset Connects to localhost:9092 Sends messages on the hello_topic Sends all messages as Strings Create a Consumer Group with the following characteristics: Consumes messages sent on the hello_topic topic Connects to ZooKeeper on localhost Consumes all data as Strings Outputs the contents of the messages to the screen When running, start your consumer first and then start the producer. 2.4 Advanced Optional Steps Add command line producer/consumer Use the REST API with a scripting language to send out the playing_cards_datetime.tsv dataset Use Avro with Kafka to send binary objects between the producer and consumer CONTENTS 7

EXERCISE 3 Streaming ETL 3.1 Objective This 60 minute lab uses Spark Streaming to ETL data. We will: Create an RDD from a socket ETL the data Do a simple real-time count on the data Project Directory: sparkstreamingetl 3.2 Cards Dataset For your Spark Streaming program, you will be working with the playing card dataset. The file is on the local filesystem at: /home/vmuser/training/datasets/playing_cards.tsv The data in the playing_cards.tsv file is made up of a card number, a tab separator and a card suit: 6 Diamond 3 Diamond 4 Club For this exercise, we won t be reading the file directly. We ll be using a pre-written Python script that writes out the file to a socket. 3.3 Streaming Program Create a Spark Streaming program with the follow characteristics: CONTENTS 8

Sets the master to local[2] or more threads Microbatches for 10 seconds Binds to localhost and port 9998 ETL s the incoming data into a Tuple2 of the suit and the card number Sums the cards by the suit Saves the sums to a realtimeoutput directory Prints out the first 10 elements 3.4 Starting the Socket Input Before starting to test your program, you will need to start program that provides the data. You can start it with: $./streamfile.py ~/training/datasets/playing_cards.tsv Once the program is started, run your Spark program. Log4J Output Levels log4j.properties is set to WARN. Change to INFO for more output and debugging. 3.5 Advanced Optional Steps 1. Save the ETL d RDD out to disk CONTENTS 9

EXERCISE 4 Advanced Streaming 4.1 Objective This 60 minute lab uses Spark to process data in Kafka. We will: Consume data from Kafka ETL the incoming data Count the cards per game ID Project Directory: sparkstreamingadvanced 4.2 Starting Services To save memory, the services needed by Kafka are not started. 1. You will need to start the ZooKeeper service. $ sudo service zookeeper start 2. After letting the ZooKeeper service start, you will need to start the Kafka service. $ sudo service kafka-server start If your programs report an error connecting to Kafka, you can check the status of them with: $ sudo service zookeeper status or: $ sudo service kafka-server status CONTENTS 10

If the processes crash consistently, your laptop may not have enough memory to run the various processes. You can view Kafka s log by running: $ tail /var/log/kafka/kafka-server.out 4.3 Dataset This exercise will use with a more complex playing card dataset. The file is on the local filesystem at: /home/vmuser/training/datasets/playing_cards_datetime.tsv The data in the playing_cards_datetime.tsv file is made up of a timestamp, a GUID to identify a game, the type of game, the suit, and the card. Each piece of data is tab separated. The cards are no longer solely numeric and include Jacks, Queens and Kings. Here is a an example of the data: 2015-01-10 00:00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Club Queen 2015-01-10 00:00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Club 5 2015-01-10 00:00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Heart 7 This dataset will not be read from the local filesystem. It will be read from a Kafka topic. The Kafka topic is cardsdatetime. The each message will be an individual line from the file. The key will be playing_cards_datetime and the value will be the line. 4.4 Starting the Producer Start the CardProducer class in the common package. That is the program that will read the file and produce it into Kafka. 4.5 Reading from Kafka Create a Spark Streaming program with the follow characteristics: Uses Spark Streaming with Kafka with a batch of 2 seconds CONTENTS 11

Creates a Kafka consumer on the cardsdatetime topic. ETLs the data by sending the GUID or game id as the key and the number as the value If the number is non-numeric, don t processes that event Sums the card numbers for a game Prints out the first 10 elements 4.6 Advanced Optional Steps Spark Streaming lacks a built-in way of producing into Kafka. Use the foreachrdd and foreachpartition methods to manually produce the data in an RDD to Kafka. Produce both the ETL d RDD and the counts RDD to Kafka. Produce the ETL RDD to the cardsetl topic and the counts RDD to the cardscounts topic. You use the built-in Kafka command line utilities to view the output. To view the ETL: $ kafka-console-consumer --bootstrap-server localhost:9092 --new-consumer \ --property print.key=true --topic cardsetl To view the counts: $ kafka-console-consumer --bootstrap-server localhost:9092 --new-consumer \ --property print.key=true --topic cardscounts CONTENTS 12

EXERCISE 5 Spark Data Analysis 5.1 Objective This 60 minute lab uses Spark, Spark SQL, or Apache Hive to analyze data. We will: Move the data from Kafka to the file system Prepare the data to be queried Query the data using our analytics tool of choice Project Directory: sparkanalysis Memory Limits This exercise will push the memory limits of the VM. We highly suggest you increase the VM s memory limit. If you still don t have enough memory, you may need to use a cloud resource with more memory. 5.2 Cards Dataset This exercise will use with a more complex playing card dataset. The file is on the local filesystem at: /home/vmuser/training/datasets/playing_cards_datetime.tsv The data in the playing_cards_datetime.tsv file is made up of a timestamp, a GUID to identify a game, the type of game, the suit, and the card. Each piece of data is tab separated. The cards are no longer solely numeric and include Jacks, Queens and Kings. Here is a an example of the data: CONTENTS 13

2015-01-10 00:00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Club Queen 2015-01-10 00:00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Club 5 2015-01-10 00:00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Heart 7 This dataset will be in Kafka in the cardsdatetime topic. If you did the advanced level for streaming, you will an ETL d topic named cardsetl. 5.3 Moving Data From Kafka You will need to move your data from the Kafka topic and place it into your local file system. To do this, you can use Kafka Connect. Kafka Connect allows you to move data from a Kafka topic into another system. This course doesn t focus on Kafka Connect. You can learn more about it at the Kafka Connect Documentation. 1. Change directories to the sparkanalysis directory. 2. Run $ connect-standalone /etc/kafka/connect-standalone.properties \ file-sink.properties 3. Let the connect-standalone process run for a few minutes. 4. Press Ctrl+C to stop the process. 5. Verify there is a file named cardsdatetime.txt and check that its contents look like the example data above. 5.4 Choosing an Analytics Framework Now that you ve moved the data to the file system, you ll need to choose a technology for querying the data. You have access to technologies like Apache Spark, Hadoop MapReduce, Spark SQL, Apache Hive, and Apache Impala on the VM to perform these analytics. Choose a framework that you are familiar with. CONTENTS 14

5.5 Analyzing the Data Once you ve chosen your analytics framework, you can start querying it. When querying and analyzing data, you re look for interesting patterns or information that will make a dashboard useful. As you re writing these queries: How will this data will be consumed by others? What will people need to know every day? Is there anything anomalous in the data? (hint: there is) As you find interesting queries or realizations, make notes about what you ve found. We re going to be using these ideas in the next exercise while creating the dashboard. Note: You may need to turn off some services you aren t using to do these analysis. CONTENTS 15

EXERCISE 6 Real-time Dashboard 6.1 Objective This 120 minute lab uses Spark Streaming, Kafka, and D3.js to create a real-time dashboard. We will: Create real-time analytics Consume the analytics Display the analytics on a web page with a chart Project Directory: realtimedashboard Memory Limits This exercise will push the memory limits of the VM. We highly suggest you increase the VM s memory limit. If you still don t have enough memory, you may need to use a cloud resource with more memory. 6.2 Cards Dataset This exercise will use with a more complex playing card dataset. The file is on the local filesystem at: /home/vmuser/training/datasets/playing_cards_datetime.tsv The data in the playing_cards_datetime.tsv file is made up of a timestamp, a GUID to identify a game, the type of game, the suit, and the card. Each piece of data is tab separated. The cards are no longer solely numeric and include Jacks, Queens and Kings. Here is a an example of the data: CONTENTS 16

2015-01-10 00:00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Club Queen 2015-01-10 00:00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Club 5 2015-01-10 00:00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Heart 7 This dataset will not be read from the local filesystem. It will be read from a Kafka topic. The Kafka topic is cardsdatetime. The each message will be an individual line from the file. The key will be playing_cards_datetime and the value will be the line. 6.3 Writing a Real-time Analysis Write your analytics using the framework of your choice. These analytics should be real-time representation of the ad-hoc analysis you did in the previous exercise. Publish the results of your analytics back into Kafka. For ease of ETL and moving data between RDDs, the common package has a Card class that can represent the data coming in. If you are using Spark, use the RDDProducer.produceValues helper method in the Common package to produce an RDD to Kafka. The parameter type for the RDD should be JavaPairDStream<String, String>. When converting the analytics to a string, we suggest you output as JSON. This will make it easier for the web page s AJAX calls and chart rendering. The output of the JSON string will vary depending on the analytics, but should look something like: [{"gametype":"paigow","count":3, "sum":10}] 6.4 Starting the CardProducer When you are running the analytics and dashboard code, make sure that you have the CardProducer running to add new data to Kafka. The CardProducer class is located in the Common package of the sparkstreamingadvanced project directory. CONTENTS 17

6.5 Running the Spark Analysis and CardProducer To keep resource usage down, you can run the CardProducer from the command line. You can run with Maven with: $ mvn exec:java -Dexec.mainClass="path.to.MainClass" You can pass in arguments to the program with: $ mvn exec:java -Dexec.mainClass="path.to.MainClass" -Dexec.args="myargs" 6.6 Writing the Dashboard The dashboard will be written using HTML and JavaScript. Depending on your familiarity with these technologies, you may or may not write this yourself. 6.6.1 Unfamiliar with HTML and JavaScript If you aren t familiar with HTML and JavaScript, you may just write the Spark side of things and use the solution s code to visualize the data. Please note, that the output of your JSON will need to match the solution s exactly. 6.6.2 Familiar with HTML and JavaScript If you are familiar with both, we have written some helper functions to make it easier to interact with Kafka s REST interface. Start off by importing the helper JavaScript module: <script src="kafkaresthelper.js"></script> In your code, you will need to instantiate the helper. After that, you can call the createconsumerinstance method and pass in the correct information. The last parameter is a number corresponding to your time interval. This interval will serve as the amount time between calls of the callback function. CONTENTS 18

var kafkaresthelper = new KafkaRESTHelper(); kafkaresthelper.createconsumerinstance("mygroupname", "mytopicname", mycallbackfunction, 10000) The callback function has a parameter for the data that was retrieve from Kafka over the REST interface. The data object will be an array containing all of the events in the time between the last callback and the current time. function bygametype(data) { // Do something with the data } As shown in the Spark section, this code is expecting data to be passed as JSON. All data is automatically coalesced and base 64 decoded for you. The JSON written out by the Spark analysis program should look like: [{"gametype":"paigow","count":3, "sum":10}] 6.7 Running the Dashboard When running the dashboard, you will need several service running. 1. Start the Kafka REST service. $ sudo service kafka-rest start 2. Start the web server. This should be started from the root of the realtimedashboard directory. This web server serves up the files, and more importantly, is a proxy for the Kafka REST service. To learn more about why a proxy is needed, read this article on CORS. $ ws --rewrite '/kafkarest/* -> http://localhost:8082/$1' 3. Finally, start your browser and go to http://localhost:8000/dashboard/. CONTENTS 19

Unexpected value NaN Message If you see you see this message in the console: Unexpected value NaN parsing x attribute. You can usually ignore it. This happens when a count is 0. 6.8 Deploying to the Cloud Once you have tested everything locally, you will need to deploy to the Cloud. Before you do this, take the following steps: 1. You need to make sure that two people aren t using the same topic name. Please do the following things: Prefix all topics with your name. Make prefix topic names a parameter that is passed in, instead of hard coded. This include the CardProducer program. 2. Change the broker DNS name to be a parameter that is passed in in all programs. 3. Use SCP to transfer your code (but not your binaries in the target directory!). 4. Build your code using Maven. 5. Start the programs with the correct topics names and broker DNS name. 6. Start your browser and go to the instance s DNS name and port. 7. Optionally, increase the volume of data for the CardProducer program to get more data going through the system. Do this by: Changing the Thread.sleep(500); to be a parameter. Decreasing the sleep amount something in the 50 to 100 ms range. CONTENTS 20