Real-time Data Engineering in the Cloud Exercise Guide
|
|
- Beverly Burns
- 5 years ago
- Views:
Transcription
1 Real-time Data Engineering in the Cloud Exercise Guide Jesse Anderson 2017 SMOKING HAND LLC ALL RIGHTS RESERVED Version 1.12.a
2 Contents 1 Lab Notes 3 2 Kafka HelloWorld 6 3 Streaming ETL 8 4 Advanced Streaming 10 5 Spark Data Analysis 13 6 Real-time Dashboard 16 CONTENTS 2
3 EXERCISE 1 Lab Notes These notes will help you work through and understand the labs for this course. 1.1 General Notes Copying and pasting from this document may not work correctly in all PDF readers. We suggest you use Adobe Reader. 1.2 Command Line Examples Most labs contain commands that must be run from the command line. These commands will look like: $ cat /etc/hosts localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 2001:4800:7810:0512:e2aa:bc1f:ff04:badc cdh5-cm-vm cdh5-cm-vm cdh5-cm-vm01 When running this command you will type in everything. You will only type in the portion after the $ prompt. In this example, you would only type in cat /etc/hosts. The rest of the command contains the output of the command. Sometimes the commands will contain multiple commands: $ chkconfig --list iptables iptables 0:off 1:off 2:on 3:on 4:on 5:on 6:off $ service iptables stop CONTENTS 3
4 iptables: Flushing firewall rules: [ OK ] iptables: Setting chains to policy ACCEPT: filter [ OK ] iptables: Unloading modules: [ OK ] There are two different commands to run from this section. First, you find every $ prompt to run all of the commands. In this example, the two commands are chkconfig --list iptables and service iptables stop. Other times commands will be on multiple lines: $ hadoop fs -put \ movies.dat /user/root/movielens/movies/ This command is too long to fit on one line in the lab manual and needs to be on two lines. In this example, you would type in hadoop fs -put \, then hit <enter>, and finish off the command with movies.dat /user/root/movielens/movies/. 1.3 VirtualBox Notes If your class is using a VirtualBox virtual machine, you can make certain changes to make it run faster or share the host s file system. If you have enough RAM, you can allocate more RAM to the virtual machine. By default, the VM uses 1 GB of RAM. Adding 2 or more GB will make the virtual machine perform faster. Virtual box can share a folder to the guest VM. Once the VM is shared, you can mount the directory with the following command: $ sudo mount -t vboxsf -o rw,uid=1001,gid=1001 \ shareddirectory ~/guestvmdirectory To always mount the directory in the guest, place this line in /etc/fstab shareddirectory /home/vmuser/guestvmdirectory vboxsf rw,uid=1000,gid= Then run the command: $ sudo mount /home/vmuser/guestvmdirectory/ CONTENTS 4
5 VirtualBox has other advanced integrations such as a shared clipboard. This allows you to copy and paste information between the host and guest operating systems clipboards. See this documentation for more information. 1.4 Maven Offline Mode Maven is configured to be in offline mode. All dependencies for the class have already been loaded. If you add a new dependency, you may see a message like: Failed to retrieve org.slf4j:slf4j-api Caused by: Cannot access confluent-repository ( in offline mode and the artifact org.slf4j:slf4j-api:jar: has not been downloaded from it before. To take Maven out of offline mode, run the maven_online.sh script that is on the path. Once you re done, you can put Maven back into offline mode by running the maven_offline.sh script that is on the path. You can learn more about Maven offline mode here. CONTENTS 5
6 EXERCISE 2 Kafka HelloWorld 2.1 Objective This 45 minute lab uses Kafka to ingest data. We will: Create a producer to import data Create a consumer to read the data Project Directory: helloworld 2.2 Starting Kafka Kafka is installed on your virtual machine, but the server processes aren t started to keep memory usage low. 1. Start the ZooKeeper service. $ sudo service zookeeper start 2. Start the Kafka Broker (Kafka Server) service. sudo service kafka-server start 3. Optionally, start the Kafka REST service. Start this service if you are going to use the REST interface for Kafka. sudo service kafka-rest start 4. Optionally, start the Schema Registry service. Start this service if you are going to use Avro for messages. sudo service schema-registry start CONTENTS 6
7 Shutdown Services Once you are done with Kafka, you will need to shut down the services to regain memory. $ sudo service schema-registry stop $ sudo service kafka-rest stop $ sudo service kafka-server stop $ sudo service zookeeper stop 2.3 Kafka HelloWorld Create a KafkaProducer with the following characteristics: Reads and sends the playing_cards_datetime.tsv dataset Connects to localhost:9092 Sends messages on the hello_topic Sends all messages as Strings Create a Consumer Group with the following characteristics: Consumes messages sent on the hello_topic topic Connects to ZooKeeper on localhost Consumes all data as Strings Outputs the contents of the messages to the screen When running, start your consumer first and then start the producer. 2.4 Advanced Optional Steps Add command line producer/consumer Use the REST API with a scripting language to send out the playing_cards_datetime.tsv dataset Use Avro with Kafka to send binary objects between the producer and consumer CONTENTS 7
8 EXERCISE 3 Streaming ETL 3.1 Objective This 60 minute lab uses Spark Streaming to ETL data. We will: Create an RDD from a socket ETL the data Do a simple real-time count on the data Project Directory: sparkstreamingetl 3.2 Cards Dataset For your Spark Streaming program, you will be working with the playing card dataset. The file is on the local filesystem at: /home/vmuser/training/datasets/playing_cards.tsv The data in the playing_cards.tsv file is made up of a card number, a tab separator and a card suit: 6 Diamond 3 Diamond 4 Club For this exercise, we won t be reading the file directly. We ll be using a pre-written Python script that writes out the file to a socket. 3.3 Streaming Program Create a Spark Streaming program with the follow characteristics: CONTENTS 8
9 Sets the master to local[2] or more threads Microbatches for 10 seconds Binds to localhost and port 9998 ETL s the incoming data into a Tuple2 of the suit and the card number Sums the cards by the suit Saves the sums to a realtimeoutput directory Prints out the first 10 elements 3.4 Starting the Socket Input Before starting to test your program, you will need to start program that provides the data. You can start it with: $./streamfile.py ~/training/datasets/playing_cards.tsv Once the program is started, run your Spark program. Log4J Output Levels log4j.properties is set to WARN. Change to INFO for more output and debugging. 3.5 Advanced Optional Steps 1. Save the ETL d RDD out to disk CONTENTS 9
10 EXERCISE 4 Advanced Streaming 4.1 Objective This 60 minute lab uses Spark to process data in Kafka. We will: Consume data from Kafka ETL the incoming data Count the cards per game ID Project Directory: sparkstreamingadvanced 4.2 Starting Services To save memory, the services needed by Kafka are not started. 1. You will need to start the ZooKeeper service. $ sudo service zookeeper start 2. After letting the ZooKeeper service start, you will need to start the Kafka service. $ sudo service kafka-server start If your programs report an error connecting to Kafka, you can check the status of them with: $ sudo service zookeeper status or: $ sudo service kafka-server status CONTENTS 10
11 If the processes crash consistently, your laptop may not have enough memory to run the various processes. You can view Kafka s log by running: $ tail /var/log/kafka/kafka-server.out 4.3 Dataset This exercise will use with a more complex playing card dataset. The file is on the local filesystem at: /home/vmuser/training/datasets/playing_cards_datetime.tsv The data in the playing_cards_datetime.tsv file is made up of a timestamp, a GUID to identify a game, the type of game, the suit, and the card. Each piece of data is tab separated. The cards are no longer solely numeric and include Jacks, Queens and Kings. Here is a an example of the data: :00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Club Queen :00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Club :00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Heart 7 This dataset will not be read from the local filesystem. It will be read from a Kafka topic. The Kafka topic is cardsdatetime. The each message will be an individual line from the file. The key will be playing_cards_datetime and the value will be the line. 4.4 Starting the Producer Start the CardProducer class in the common package. That is the program that will read the file and produce it into Kafka. 4.5 Reading from Kafka Create a Spark Streaming program with the follow characteristics: Uses Spark Streaming with Kafka with a batch of 2 seconds CONTENTS 11
12 Creates a Kafka consumer on the cardsdatetime topic. ETLs the data by sending the GUID or game id as the key and the number as the value If the number is non-numeric, don t processes that event Sums the card numbers for a game Prints out the first 10 elements 4.6 Advanced Optional Steps Spark Streaming lacks a built-in way of producing into Kafka. Use the foreachrdd and foreachpartition methods to manually produce the data in an RDD to Kafka. Produce both the ETL d RDD and the counts RDD to Kafka. Produce the ETL RDD to the cardsetl topic and the counts RDD to the cardscounts topic. You use the built-in Kafka command line utilities to view the output. To view the ETL: $ kafka-console-consumer --bootstrap-server localhost: new-consumer \ --property print.key=true --topic cardsetl To view the counts: $ kafka-console-consumer --bootstrap-server localhost: new-consumer \ --property print.key=true --topic cardscounts CONTENTS 12
13 EXERCISE 5 Spark Data Analysis 5.1 Objective This 60 minute lab uses Spark, Spark SQL, or Apache Hive to analyze data. We will: Move the data from Kafka to the file system Prepare the data to be queried Query the data using our analytics tool of choice Project Directory: sparkanalysis Memory Limits This exercise will push the memory limits of the VM. We highly suggest you increase the VM s memory limit. If you still don t have enough memory, you may need to use a cloud resource with more memory. 5.2 Cards Dataset This exercise will use with a more complex playing card dataset. The file is on the local filesystem at: /home/vmuser/training/datasets/playing_cards_datetime.tsv The data in the playing_cards_datetime.tsv file is made up of a timestamp, a GUID to identify a game, the type of game, the suit, and the card. Each piece of data is tab separated. The cards are no longer solely numeric and include Jacks, Queens and Kings. Here is a an example of the data: CONTENTS 13
14 :00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Club Queen :00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Club :00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Heart 7 This dataset will be in Kafka in the cardsdatetime topic. If you did the advanced level for streaming, you will an ETL d topic named cardsetl. 5.3 Moving Data From Kafka You will need to move your data from the Kafka topic and place it into your local file system. To do this, you can use Kafka Connect. Kafka Connect allows you to move data from a Kafka topic into another system. This course doesn t focus on Kafka Connect. You can learn more about it at the Kafka Connect Documentation. 1. Change directories to the sparkanalysis directory. 2. Run $ connect-standalone /etc/kafka/connect-standalone.properties \ file-sink.properties 3. Let the connect-standalone process run for a few minutes. 4. Press Ctrl+C to stop the process. 5. Verify there is a file named cardsdatetime.txt and check that its contents look like the example data above. 5.4 Choosing an Analytics Framework Now that you ve moved the data to the file system, you ll need to choose a technology for querying the data. You have access to technologies like Apache Spark, Hadoop MapReduce, Spark SQL, Apache Hive, and Apache Impala on the VM to perform these analytics. Choose a framework that you are familiar with. CONTENTS 14
15 5.5 Analyzing the Data Once you ve chosen your analytics framework, you can start querying it. When querying and analyzing data, you re look for interesting patterns or information that will make a dashboard useful. As you re writing these queries: How will this data will be consumed by others? What will people need to know every day? Is there anything anomalous in the data? (hint: there is) As you find interesting queries or realizations, make notes about what you ve found. We re going to be using these ideas in the next exercise while creating the dashboard. Note: You may need to turn off some services you aren t using to do these analysis. CONTENTS 15
16 EXERCISE 6 Real-time Dashboard 6.1 Objective This 120 minute lab uses Spark Streaming, Kafka, and D3.js to create a real-time dashboard. We will: Create real-time analytics Consume the analytics Display the analytics on a web page with a chart Project Directory: realtimedashboard Memory Limits This exercise will push the memory limits of the VM. We highly suggest you increase the VM s memory limit. If you still don t have enough memory, you may need to use a cloud resource with more memory. 6.2 Cards Dataset This exercise will use with a more complex playing card dataset. The file is on the local filesystem at: /home/vmuser/training/datasets/playing_cards_datetime.tsv The data in the playing_cards_datetime.tsv file is made up of a timestamp, a GUID to identify a game, the type of game, the suit, and the card. Each piece of data is tab separated. The cards are no longer solely numeric and include Jacks, Queens and Kings. Here is a an example of the data: CONTENTS 16
17 :00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Club Queen :00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Club :00:00 1ea7fc17-7cf0-486d-8b8b-ad905e0d7a7a PaiGow Heart 7 This dataset will not be read from the local filesystem. It will be read from a Kafka topic. The Kafka topic is cardsdatetime. The each message will be an individual line from the file. The key will be playing_cards_datetime and the value will be the line. 6.3 Writing a Real-time Analysis Write your analytics using the framework of your choice. These analytics should be real-time representation of the ad-hoc analysis you did in the previous exercise. Publish the results of your analytics back into Kafka. For ease of ETL and moving data between RDDs, the common package has a Card class that can represent the data coming in. If you are using Spark, use the RDDProducer.produceValues helper method in the Common package to produce an RDD to Kafka. The parameter type for the RDD should be JavaPairDStream<String, String>. When converting the analytics to a string, we suggest you output as JSON. This will make it easier for the web page s AJAX calls and chart rendering. The output of the JSON string will vary depending on the analytics, but should look something like: [{"gametype":"paigow","count":3, "sum":10}] 6.4 Starting the CardProducer When you are running the analytics and dashboard code, make sure that you have the CardProducer running to add new data to Kafka. The CardProducer class is located in the Common package of the sparkstreamingadvanced project directory. CONTENTS 17
18 6.5 Running the Spark Analysis and CardProducer To keep resource usage down, you can run the CardProducer from the command line. You can run with Maven with: $ mvn exec:java -Dexec.mainClass="path.to.MainClass" You can pass in arguments to the program with: $ mvn exec:java -Dexec.mainClass="path.to.MainClass" -Dexec.args="myargs" 6.6 Writing the Dashboard The dashboard will be written using HTML and JavaScript. Depending on your familiarity with these technologies, you may or may not write this yourself Unfamiliar with HTML and JavaScript If you aren t familiar with HTML and JavaScript, you may just write the Spark side of things and use the solution s code to visualize the data. Please note, that the output of your JSON will need to match the solution s exactly Familiar with HTML and JavaScript If you are familiar with both, we have written some helper functions to make it easier to interact with Kafka s REST interface. Start off by importing the helper JavaScript module: <script src="kafkaresthelper.js"></script> In your code, you will need to instantiate the helper. After that, you can call the createconsumerinstance method and pass in the correct information. The last parameter is a number corresponding to your time interval. This interval will serve as the amount time between calls of the callback function. CONTENTS 18
19 var kafkaresthelper = new KafkaRESTHelper(); kafkaresthelper.createconsumerinstance("mygroupname", "mytopicname", mycallbackfunction, 10000) The callback function has a parameter for the data that was retrieve from Kafka over the REST interface. The data object will be an array containing all of the events in the time between the last callback and the current time. function bygametype(data) { // Do something with the data } As shown in the Spark section, this code is expecting data to be passed as JSON. All data is automatically coalesced and base 64 decoded for you. The JSON written out by the Spark analysis program should look like: [{"gametype":"paigow","count":3, "sum":10}] 6.7 Running the Dashboard When running the dashboard, you will need several service running. 1. Start the Kafka REST service. $ sudo service kafka-rest start 2. Start the web server. This should be started from the root of the realtimedashboard directory. This web server serves up the files, and more importantly, is a proxy for the Kafka REST service. To learn more about why a proxy is needed, read this article on CORS. $ ws --rewrite '/kafkarest/* -> 3. Finally, start your browser and go to CONTENTS 19
20 Unexpected value NaN Message If you see you see this message in the console: Unexpected value NaN parsing x attribute. You can usually ignore it. This happens when a count is Deploying to the Cloud Once you have tested everything locally, you will need to deploy to the Cloud. Before you do this, take the following steps: 1. You need to make sure that two people aren t using the same topic name. Please do the following things: Prefix all topics with your name. Make prefix topic names a parameter that is passed in, instead of hard coded. This include the CardProducer program. 2. Change the broker DNS name to be a parameter that is passed in in all programs. 3. Use SCP to transfer your code (but not your binaries in the target directory!). 4. Build your code using Maven. 5. Start the programs with the correct topics names and broker DNS name. 6. Start your browser and go to the instance s DNS name and port. 7. Optionally, increase the volume of data for the CardProducer program to get more data going through the system. Do this by: Changing the Thread.sleep(500); to be a parameter. Decreasing the sleep amount something in the 50 to 100 ms range. CONTENTS 20
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationConfluent Developer Training for Apache Kafka Exercise Manual B7/801/A
Confluent Developer Training for Apache Kafka Exercise Manual B7/801/A Table of Contents Introduction................................................................ 1 Hands-On Exercise: Using Kafka s
More informationOracle SOA Suite VirtualBox Appliance. Introduction and Readme
Oracle SOA Suite 12.2.1.3.0 VirtualBox Introduction and Readme December 2017 Table of Contents 1 VirtualBox... 3 1.1 Installed Software... 3 1.2 Settings... 4 1.3 User IDs... 4 1.4 Domain Configurations...
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationDeveloper Training for Apache Spark and Hadoop: Hands-On Exercises
201709c Developer Training for Apache Spark and Hadoop: Hands-On Exercises Table of Contents General Notes... 1 Hands-On Exercise: Starting the Exercise Environment (Local VM)... 5 Hands-On Exercise: Starting
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationTalend Big Data Sandbox. Big Data Insights Cookbook
Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is
More informationCIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench
CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench Abstract Implementing a Hadoop-based system for processing big data and doing analytics is a topic which has been
More informationBlended Learning Outline: Cloudera Data Analyst Training (171219a)
Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills
More informationOracle Big Data Fundamentals Ed 2
Oracle University Contact Us: 1.800.529.0165 Oracle Big Data Fundamentals Ed 2 Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, you learn about big data, the technologies
More informationTalend Big Data Sandbox. Big Data Insights Cookbook
Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationto arrive at the system information display. In MacOS X use the menus
The Math/CS 466/666 Linux Image in VirtualBox This document explains how to install the Math/CS 466/666 Linux image onto VirtualBox to obtain a programming environment on your personal computer or laptop
More informationQuick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine
Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine Version 4.11 Last Updated: 1/10/2018 Please note: This appliance is for testing and educational purposes only;
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationApache Spark and Scala Certification Training
About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over
More informationiway iway Big Data Integrator New Features Bulletin and Release Notes Version DN
iway iway Big Data Integrator New Features Bulletin and Release Notes Version 1.5.0 DN3502232.1216 Active Technologies, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo,
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationDeploying to the Edge CouchDB
Deploying to the Edge CouchDB Apache Relax Who s Talking? J Chris Anderson / jchris@apache.org / @jchris PHP -> Rails -> JSON -> CouchDB Director, couch.io And You? Web developers? JavaScript coders? CouchDB
More informationHOMEWORK 9. M. Neumann. Due: THU 8 NOV PM. Getting Started SUBMISSION INSTRUCTIONS
CSE427S HOMEWORK 9 M. Neumann Due: THU 8 NOV 2018 4PM Getting Started Update your SVN repository. When needed, you will find additional materials for homework x in the folder hwx. So, for the current assignment
More informationSCCM 1802 Install Guide using Baseline Media
SCCM 1802 Install Guide using Baseline Media By Prajwal Desai This document is a Step-by-Step SCCM 1802 Install guide using Baseline Media. I was waiting for SCCM 1802 baseline media to be released so
More informationData Lake Based Systems that Work
Data Lake Based Systems that Work There are many article and blogs about what works and what does not work when trying to build out a data lake and reporting system. At DesignMind, we have developed a
More informationOperating Systems Lab 1. Class topic: Installation of the operating system. Install Ubuntu on Oracle VirtualBox
Operating Systems Lab 1 Class topic: Installation of the operating system. Install Ubuntu on Oracle VirtualBox Oracle VirtualBox is a cross-platform virtualization application. It installs on your existing
More informationLive Data Connection to SAP Universes
Live Data Connection to SAP Universes You can create a Live Data Connection to SAP Universe using the SAP BusinessObjects Enterprise (BOE) Live Data Connector component deployed on your application server.
More informationHOMEWORK 8. M. Neumann. Due: THU 29 MAR PM. Getting Started SUBMISSION INSTRUCTIONS
CSE427S HOMEWORK 8 M. Neumann Due: THU 29 MAR 2018 1PM Getting Started Update your SVN repository. When needed, you will find additional materials for homework x in the folder hwx. So, for the current
More informationLabtainer Student Guide
Labtainer Student Guide January 18, 2018 1 Introduction This manual is intended for use by students performing labs with Labtainers. Labtainers assume you have a Linux system, e.g., a virtual machine.
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationSandbox Setup Guide for HDP 2.2 and VMware
Waterline Data Inventory Sandbox Setup Guide for HDP 2.2 and VMware Product Version 2.0 Document Version 10.15.2015 2014-2015 Waterline Data, Inc. All rights reserved. All other trademarks are the property
More informationSummary. approximately too ). Download and. that appear in. the program. Browse to and
BlackPearl Virtual Machine Simulator Installation Instructionss Summary The Spectra Logic BlackPearl simulator is contained within a Virtual Machine ( VM) image. This allows us to simulate the underlying
More informationCERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)
CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationHands-on Exercise Hadoop
Department of Economics and Business Administration Chair of Business Information Systems I Prof. Dr. Barbara Dinter Big Data Management Hands-on Exercise Hadoop Building and Testing a Hadoop Cluster by
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationQuick Deployment Step- by- step instructions to deploy Oracle Big Data Lite Virtual Machine
Quick Deployment Step- by- step instructions to deploy Oracle Big Data Lite Virtual Machine Version 4.1.0 Please note: This appliance is for testing and educational purposes only; it is unsupported and
More informationInstalling an HDF cluster
3 Installing an HDF cluster Date of Publish: 2018-08-13 http://docs.hortonworks.com Contents Installing Ambari...3 Installing Databases...3 Installing MySQL... 3 Configuring SAM and Schema Registry Metadata
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationGreenplum-Spark Connector Examples Documentation. kong-yew,chan
Greenplum-Spark Connector Examples Documentation kong-yew,chan Dec 10, 2018 Contents 1 Overview 1 1.1 Pivotal Greenplum............................................ 1 1.2 Pivotal Greenplum-Spark Connector...................................
More informationiway Big Data Integrator New Features Bulletin and Release Notes
iway Big Data Integrator New Features Bulletin and Release Notes Version 1.5.2 DN3502232.0717 Active Technologies, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo, iway,
More informationDeveloper Training for Apache Spark and Hadoop: Hands-On Exercises
201611 Developer Training for Apache Spark and Hadoop: Hands-On Exercises General Notes... 3 Hands-On Exercise: Query Hadoop Data with Apache Impala... 6 Hands-On Exercise: Access HDFS with the Command
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationGetting started with System Center Essentials 2007
At a glance: Installing and upgrading Configuring Essentials 2007 Troubleshooting steps Getting started with System Center Essentials 2007 David Mills System Center Essentials 2007 is a new IT management
More informationReal Life Web Development. Joseph Paul Cohen
Real Life Web Development Joseph Paul Cohen joecohen@cs.umb.edu Index 201 - The code 404 - How to run it? 500 - Your code is broken? 200 - Someone broke into your server? 400 - How are people using your
More informationCloud Computing II. Exercises
Cloud Computing II Exercises Exercise 1 Creating a Private Cloud Overview In this exercise, you will install and configure a private cloud using OpenStack. This will be accomplished using a singlenode
More informationInstalling HDF Services on an Existing HDP Cluster
3 Installing HDF Services on an Existing HDP Cluster Date of Publish: 2018-08-13 http://docs.hortonworks.com Contents Upgrade Ambari and HDP...3 Installing Databases...3 Installing MySQL... 3 Configuring
More informationConfigure Windows Server 2003 Release 2 Server Network File Share (NFS) as an authenticated storage repository for XenServer
Summary This document outlines the process to perform the following tasks. 1. Configure Windows Server 2003 Release 2 Server Network File Share (NFS) as an authenticated storage repository for XenServer.
More informationInstallation and Configuration Guide for Windows and Linux
Installation and Configuration Guide for Windows and Linux vcenter Operations Manager 5.8.1 This document supports the version of each product listed and supports all subsequent versions until the document
More informationLinux Home Lab Environment
Environment Introduction Welcome! The best way to learn for most IT people is to actually do it, so that s the goal of this selfpaced lab workbook. The skills outlined here will begin to prepare you for
More information1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions
Big Data Hadoop Architect Online Training (Big Data Hadoop + Apache Spark & Scala+ MongoDB Developer And Administrator + Apache Cassandra + Impala Training + Apache Kafka + Apache Storm) 1 Big Data Hadoop
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationA Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers
A Distributed System Case Study: Apache Kafka High throughput messaging for diverse consumers As always, this is not a tutorial Some of the concepts may no longer be part of the current system or implemented
More informationQLIKVIEW ARCHITECTURAL OVERVIEW
QLIKVIEW ARCHITECTURAL OVERVIEW A QlikView Technology White Paper Published: October, 2010 qlikview.com Table of Contents Making Sense of the QlikView Platform 3 Most BI Software Is Built on Old Technology
More informationPolarion Enterprise Setup 17.2
SIEMENS Polarion Enterprise Setup 17.2 POL005 17.2 Contents Terminology......................................................... 1-1 Overview...........................................................
More informationHortonworks Data Platform
Hortonworks Data Platform Workflow Management (August 31, 2017) docs.hortonworks.com Hortonworks Data Platform: Workflow Management Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks
More informationTuning Enterprise Information Catalog Performance
Tuning Enterprise Information Catalog Performance Copyright Informatica LLC 2015, 2018. Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the United States
More informationOverview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::
Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationModern Data Warehouse The New Approach to Azure BI
Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationScalable Tools - Part I Introduction to Scalable Tools
Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session
More informationKafka Connect FileSystem Connector Documentation
Kafka Connect FileSystem Connector Documentation Release 0.1 Mario Molina Dec 25, 2017 Contents 1 Contents 3 1.1 Connector................................................ 3 1.2 Configuration Options..........................................
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationDeployment Guide. 3.1 For Windows For Linux Docker image Windows Installation Installation...
TABLE OF CONTENTS 1 About Guide...1 2 System Requirements...2 3 Package...3 3.1 For Windows... 3 3.2 For Linux... 3 3.3 Docker image... 4 4 Windows Installation...5 4.1 Installation... 5 4.1.1 Install
More informationCloudera Manager Quick Start Guide
Cloudera Manager Guide Important Notice (c) 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this
More informationHOMEWORK 8. M. Neumann. Due: THU 1 NOV PM. Getting Started SUBMISSION INSTRUCTIONS
CSE427S HOMEWORK 8 M. Neumann Due: THU 1 NOV 2018 4PM Getting Started Update your SVN repository. When needed, you will find additional materials for homework x in the folder hwx. So, for the current assignment
More informationBuilding LinkedIn s Real-time Data Pipeline. Jay Kreps
Building LinkedIn s Real-time Data Pipeline Jay Kreps What is a data pipeline? What data is there? Database data Activity data Page Views, Ad Impressions, etc Messaging JMS, AMQP, etc Application and
More informationexam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0
70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to
More informationHadoop Tutorial. General Instructions
CS246H: Mining Massive Datasets Hadoop Lab Winter 2018 Hadoop Tutorial General Instructions The purpose of this tutorial is to get you started with Hadoop. Completing the tutorial is optional. Here you
More informationInstructor : Dr. Sunnie Chung. Independent Study Spring Pentaho. 1 P a g e
ABSTRACT Pentaho Business Analytics from different data source, Analytics from csv/sql,create Star Schema Fact & Dimension Tables, kettle transformation for big data integration, MongoDB kettle Transformation,
More informationActivator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.
Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without
More informationHue Application for Big Data Ingestion
Hue Application for Big Data Ingestion August 2016 Author: Medina Bandić Supervisor(s): Antonio Romero Marin Manuel Martin Marquez CERN openlab Summer Student Report 2016 1 Abstract The purpose of project
More informationInformatica PowerExchange for Microsoft Azure Blob Storage 10.2 HotFix 1. User Guide
Informatica PowerExchange for Microsoft Azure Blob Storage 10.2 HotFix 1 User Guide Informatica PowerExchange for Microsoft Azure Blob Storage User Guide 10.2 HotFix 1 July 2018 Copyright Informatica LLC
More informationDeveloping a Web Server Platform with SAPI support for AJAX RPC using JSON
94 Developing a Web Server Platform with SAPI support for AJAX RPC using JSON Assist. Iulian ILIE-NEMEDI Informatics in Economy Department, Academy of Economic Studies, Bucharest Writing a custom web server
More informationLearning vrealize Orchestrator in action V M U G L A B
Learning vrealize Orchestrator in action V M U G L A B Lab Learning vrealize Orchestrator in action Code examples If you don t feel like typing the code you can download it from the webserver running on
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationAzure Data Factory. Data Integration in the Cloud
Azure Data Factory Data Integration in the Cloud 2018 Microsoft Corporation. All rights reserved. This document is provided "as-is." Information and views expressed in this document, including URL and
More informationBlueMix Hands-On Workshop
BlueMix Hands-On Workshop Lab E - Using the Blu Big SQL application uemix MapReduce Service to build an IBM Version : 3.00 Last modification date : 05/ /11/2014 Owner : IBM Ecosystem Development Table
More informationFIRST STEPS WITH SOFIA2
FIRST STEPS WITH SOFIA2 DECEMBER 2014 Version 5 1 INDEX 1 INDEX... 2 2 INTRODUCTION... 3 2.1 REQUIREMENTS... 3 2.2 CURRENT DOCUMENT GOALS AND SCOPE... 3 3 STEPS TO FOLLOW... ERROR! MARCADOR NO DEFINIDO.
More informationOracle Cloud Using Oracle Big Data Manager. Release
Oracle Cloud Using Oracle Big Data Manager Release 18.2.5 E91848-08 June 2018 Oracle Cloud Using Oracle Big Data Manager, Release 18.2.5 E91848-08 Copyright 2018, 2018, Oracle and/or its affiliates. All
More informationOracle Cloud Using Oracle Big Data Manager. Release
Oracle Cloud Using Oracle Big Data Manager Release 18.2.1 E91848-07 April 2018 Oracle Cloud Using Oracle Big Data Manager, Release 18.2.1 E91848-07 Copyright 2018, 2018, Oracle and/or its affiliates. All
More informationInstalling and Configuring VMware Identity Manager Connector (Windows) OCT 2018 VMware Identity Manager VMware Identity Manager 3.
Installing and Configuring VMware Identity Manager Connector 2018.8.1.0 (Windows) OCT 2018 VMware Identity Manager VMware Identity Manager 3.3 You can find the most up-to-date technical documentation on
More informationData Access 3. Managing Apache Hive. Date of Publish:
3 Managing Apache Hive Date of Publish: 2018-07-12 http://docs.hortonworks.com Contents ACID operations... 3 Configure partitions for transactions...3 View transactions...3 View transaction locks... 4
More informationCannot Create Index On View 'test' Because
Cannot Create Index On View 'test' Because The View Is Not Schema Bound Cannot create index on view AdventureWorks2012.dbo.viewTestIndexedView because it uses a LEFT, RIGHT, or FULL OUTER join, and no
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationDAITSS Demo Virtual Machine Quick Start Guide
DAITSS Demo Virtual Machine Quick Start Guide The following topics are covered in this document: A brief Glossary Downloading the DAITSS Demo Virtual Machine Starting up the DAITSS Demo Virtual Machine
More informationStorageTapper. Real-time MySQL Change Data Uber. Ovais Tariq, Shriniket Kale & Yevgeniy Firsov. October 03, 2017
StorageTapper Real-time MySQL Change Data Streaming @ Uber Ovais Tariq, Shriniket Kale & Yevgeniy Firsov October 03, 2017 Overview What we will cover today Background & Motivation High Level Features System
More informationSizing Guidelines and Performance Tuning for Intelligent Streaming
Sizing Guidelines and Performance Tuning for Intelligent Streaming Copyright Informatica LLC 2017. Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More information<Partner Name> <Partner Product> RSA Ready Implementation Guide for. MapR Converged Data Platform 3.1
RSA Ready Implementation Guide for MapR Jeffrey Carlson, RSA Partner Engineering Last Modified: 02/25/2016 Solution Summary RSA Analytics Warehouse provides the capacity
More informationControl for CloudFlare - Installation and Preparations
Control for CloudFlare - Installation and Preparations Installation Backup your web directory and Magento 2 store database; Download Control for CloudFlare installation package; Copy files to /app/firebear/cloudflare/
More informationPolarion 18 Enterprise Setup
SIEMENS Polarion 18 Enterprise Setup POL005 18 Contents Terminology......................................................... 1-1 Overview........................................................... 2-1
More informationMicrosoft Azure Stream Analytics
Microsoft Azure Stream Analytics Marcos Roriz and Markus Endler Laboratory for Advanced Collaboration (LAC) Departamento de Informática (DI) Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio)
More informationMicrosoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo
Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 HOTSPOT You install the Microsoft Hive ODBC Driver on a computer that runs Windows
More informationOpenStack Havana All-in-One lab on VMware Workstation
OpenStack Havana All-in-One lab on VMware Workstation With all of the popularity of OpenStack in general, and specifically with my other posts on deploying the Rackspace Private Cloud lab on VMware Workstation,
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationIBM Image-Analysis Node.js
IBM Image-Analysis Node.js Cognitive Solutions Application Development IBM Global Business Partners Duration: 90 minutes Updated: Feb 14, 2018 Klaus-Peter Schlotter kps@de.ibm.com Version 1 Overview The
More informationIBM Fluid Query for PureData Analytics. - Sanjit Chakraborty
IBM Fluid Query for PureData Analytics - Sanjit Chakraborty Introduction IBM Fluid Query is the capability that unifies data access across the Logical Data Warehouse. Users and analytic applications need
More information