/ Cloud Computing. Recitation 15 December 6 th 2016

Size: px

Start display at page:

Download "/ Cloud Computing. Recitation 15 December 6 th 2016"

Homer O’Neal’
5 years ago
Views:

1 / Cloud Computing Recitation 15 December 6 th 2016

2 Overview Last week s reflection Team project phase 3 Quiz 12 This week s schedule Phase3 report Deadline TODAY 12/6 Project 4.3 Deadline FRIDAY 12/9 Course survey (2% bonus!) Deadline Saturday 12/10

3 Module to Read Completed! Module 21: Distributed Analytics Engines for the Cloud: GraphLab Module 22: Message Queues and Stream Processing You can use Module 22 for P4.3: Module 22: Message Queues and Stream Processing

4 Project 4 Project 4.1 MapReduce Programming Using YARN Project 4.2 Iterative Programming Using Apache Spark Project 4.3 Stream Processing using Kafka & Samza

5 Stream vs Batch Processing Batch processing Data parallel, graph parallel Iterative, non-iterative Runs once in few hours/days Historical data analysis Unsuited for real time events streams Stream processing Process events as they come Real time decision making Sensor streams/ web event data

6 Typical batch processing job Input is collected into batches and processing is performed on the input data Output is consumed later at any point of time - the data does not lose much of its value with time Input HDFS Hadoop/Hive HDFS Output

7 Typical stream processing job Data is processed immediately (few seconds) The processed data is used by downstream consumers for real time decision/analytics immediately Stream 1 Stream 2 Stream processing job Stream consumer Stream 3...

8 Typical stream processing components An event producer - Sensors, web logs, web events A messaging service - Kafka, RabbitMQ, ActiveMQ A stream processing framework - Samza, Spark Streaming, Storm Sensor data Web logs Web events Messaging service (Kafka, RabbitMQ, ActiveMQ) Stream processing framework (Samza, Spark Streaming, Storm) Messaging service (Kafka, RabbitMQ, ActiveMQ)

9 Apache Kafka Developed at LinkedIn as a distributed messaging system.

10 Semantic partitioning in Kafka Each topic (stream) is partitioned for scalability across all nodes in the Kafka cluster Default partitioning attempts to load balance Streams can also be partitioned semantically by user - key of the message All messages with the same key come to the same partition

11 Apache Samza Stream processing framework developed at LinkedIn Consists of 3 layers: streaming, execution and processing (Samza) layer Most common use: Kafka for streaming, YARN for execution

12 Apache Samza Programmer uses Samza API to perform stream processing Semantic partitioning in Kafka => streaming MapReduce Each partition in Kafka is assigned to a single Samza task instance

13 Stateful stream processing in Apache Samza Calculate sum, avg, count etc. State in remote data store? - slow State in local memory? - machine might crash Solution - persistent KV store provided by Samza Changes to KV store persisted to a different stream (usually Kafka) - replay on failure RocksDB currently supported as a persistent KV store You MUST use a persistent KV store for P4.3!

14 Putting Kafka and Samza Together

15 Project 4.3-Three Tasks Use Kafka to produce streams and use Samza to join the streams and output client-driver match like Uber. Task1 Tracefile Kafka API Task2 Samza API Client-driver match Bonus Task Surge price

16 Project 4.3-Three Tasks Simulate the scenario that the drivers update their locations on a regular basis as they move in the city and the clients request rides at some time. Data Tracefile -> Two streams Type: DRIVER_LOCATION -> driver_locations stream LEAVING_BLOCK, ENTERING_BLOCK, RIDE_REQUEST, RIDE_COMPLETE -> events stream

17 Task1 Project Three Tasks

18 Project Three Tasks Task2 match_score = distance_score * gender_score * rating_score * salary_score * 0.2 Kafka Consumer Visualization

19 Project Three Tasks Debugging (IMPORTANT!) Use the YARN UI Yarn application commands yarn application -list YARN container logs on the machine where the YARN container is running. Output a kafka stream for debugging Read the debugging section in the writeup carefully! Include error message when you post on Piazza!

20 Project 4.3 Bonus task - implement dynamic (aka surge) pricing Same streams but different state and different calculation required Please make sure to use a persistent KV store in the bonus task as well Be careful when you move drivers around blocks! The bonus grader is more sensitive to sloppy state management

21 P4.3 Grading Skeleton code also provides the submitters Follow the instructions in the submitter Prompts for starting the Kafka Producer and Samza job We will look for usage of KV stores and reasonably efficient code Do not iterate through ALL drivers to find the best match!

22 twitter DATA ANALYTICS: TEAM PROJECT Please come to Thursday s Recitation to learn more! 22

23 Upcoming Deadlines Phase 3 report Due: TODAY 12/06/ :59 PM Pittsburgh Project 4.3 : Stream Processing with Kafka/Samza Due: FRIDAY 12/09/ :59 PM Pittsburgh Apply for S17 TA job, there is still time Complete the course survey (announced on Piazza) 2% bonus for the overall course grade (Don t miss it!!!) Cupcake Party (GHC 4307 Pittsburgh and SV 211) Thursday 12/08/2016 4:30 PM ET Pittsburgh, 1:30 PM PT SV

24 Questions? 24

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance