10/24/2017 Sangmi Lee Pallickara Week 10- A. CS535 Big Data Fall 2017 Colorado State University

Size: px

Start display at page:

Download "10/24/2017 Sangmi Lee Pallickara Week 10- A. CS535 Big Data Fall 2017 Colorado State University"

Cory Roberts
5 years ago
Views:

CS535 Big Data - Fall 2017 Week 10-A-1 CS535 BIG DATA FAQs Term project proposal Feedback for the most of submissions are available PA2 has been posted (11/6) PART 2.

1 CS535 Big Data - Fall 2017 Week 10-A-1 CS535 BIG DATA FAQs Term project proposal Feedback for the most of submissions are available PA2 has been posted (11/6) PART 2. SCALABLE FRAMEWORKS FOR REAL-TIME BIG DATA ANALYTICS 1. SPEED LAYER: APACHE STORM Computer Science, Colorado State University CS535 Big Data - Fall 2017 Week 10-A-2 CS535 Big Data - Fall 2017 Week 10-A-3 Today s topics Speed Layer Apache Storm Word count example Parallelism Storm model Architecture CS535 Big Data - Fall 2017 Week 10-A-4 CS535 Big Data - Fall 2017 Week 10-A-5 This material is built based on: Nathan Marz and James Warren, Big Data, Principles and Best Practices of Scalable Real- Time Data System, 2015, Manning Publications, ISBN Toshniwal, Ankit and Taneja, Siddarth and Shukla, Amit and Ramasamy, Karthik and Patel, Jignesh M. and Kulkarni, Sanjeev and Jackson, Jason and Gade, Krishna and Fu, Maosong and Donham, Jake and Bhagat, Nikunj and Mittal, Sailesh and Ryaboy, Dmitriy, Storm@twitter, Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD June 22-27, 2014, Snowbird, Utah Speed Layer P. Taylor Goetz, and Brian O Neill, Storm Blueprints: Patterns for Distributed Real-time Computation Packt Publishing (March 26, 2014) Apache s Storm Open source project 1

2 CS535 Big Data - Fall 2017 Week 10-A-6 CS535 Big Data - Fall 2017 Week 10-A-7 Where are we in the Lambda Architecture? We have focused on batch computing in the Lambda Architecture Computing framework Scalable algorithms Low-latency update Jobs of the speed layer Incremental computation Lambda architecture Realtime Realtime Realtime New Data Batch Batch layer Master dataset Serving layer Batch Batch Query: How many? CS535 Big Data - Fall 2017 Week 10-A-8 Communication between speed layer and batch layer Assume that the first batch layer will run with empty dataset (or, Wait) First 10 minutes Batch layer Data is processed in the speed layer 10 minutes Second run of the batch layer immediately commences to process 10 minutes of data that accumulated during the first run Assume that the second run takes 15 minutes After this run, the serving layer will represent the first 10 minutes of data The first 10 minutes can now be removed from the speed layer CS535 Big Data - Fall 2017 Week 10-A-9 Communication between speed layer and batch layer Previous slide 18 minutes Real-time data 30 next 0-10 mins 10 minutes Deleted Batch layer/ generation of serving 15 minutes Real-time data 10 minutes Serving layer Batch layer The third run of the batch layer takes 18 minutes for data that arrived between 10 ~ 20 minutes next 15 minutes Real-time data 10 minutes Batch layer/ generation of serving CS535 Big Data - Fall 2017 Week 10-A-10 CS535 Big Data - Fall 2017 Week 10-A-11 Queuing A system without persistent queuing Events would be handed directly to workers s processes each event independently Fire-and-forget Queuing and Stream processing Client Hand off and continue Can this scheme guarantee that all the data is successfully processed? No Why? 2

CS535 Big Data - Fall 2017 Week 10-A-12 Simple with poll() and peek() If a worker dies before completing its assigned task Is there any mechanism to detect or correct the error?

No Interface { void add(object item); Object poll(); Object peek(); Remove the item from the head of the queue Adds new item to the queue Inspects the item at the head of the queue without removing

an identifier and a binary payload Acknowledges successful processing Reports a failure When you read an event from the queue The event is not immediately removed The item is returned by the get()

3 CS535 Big Data - Fall 2017 Week 10-A-12 Simple with poll() and peek() If a worker dies before completing its assigned task Is there any mechanism to detect or correct the error? No If there is bursty traffic that exceeds the capacity of the processing cluster Is there any mechanism to process all of the events? No Interface { void add(object item); Object poll(); Object peek(); Remove the item from the head of the queue Adds new item to the queue Inspects the item at the head of the queue without removing it CS535 Big Data - Fall 2017 Week 10-A-13 Single consumer queue servers struct Item{ long id; byte[] item; Interface { Item get(); void ack(long id); void fail(long id); A generic Item consists of an identifier and a binary payload Acknowledges successful processing Reports a failure When you read an event from the queue The event is not immediately removed The item is returned by the get() function Only when an event is acked, will it be removed from the queue For failed retrieval, another client can retrieve via separate get() function The data is processed at least once CS535 Big Data - Fall 2017 Week 10-A-14 CS535 Big Data - Fall 2017 Week 10-A-15 Multiple application with a single queue (1/2) What if multiple applications want to consume the same stream? Approach 1. Wrap all the applications within the same consumer Consumer A Multiple application with a single queue (2/2) Approach 2: Maintaining a separate queue for each consumer application If you have three applications There are three separate copies of the queue on the queue server This increases the load on the queue server B Data cannot be processed independently C CS535 Big Data - Fall 2017 Week 10-A-16 Multi-consumer queues The applications track the consumed/unconsumed status of events from the queue Servers Guarantee that a certain amount of the stream is available E.g. all events from the past 12 hours or the last 50GB of events Multi-consumer queue Send 3 items starting from position 4 Send 2 items starting from position 2 A B CS535 Big Data - Fall 2017 Week 10-A-17 Stream processing Stream Processor One-at-a-time Processes streams with lower latency than micro-batched s-and-workers model Micro-batched Small batches of tuples are processed at one time One-at-a-time Lower latency Yes No Higher throughout No Yes At-lease-once semantics Yes Yes Exactly-once semantics In some cases Yes Simpler programming model Yes No Micro-batched Realtime s 3

4 CS535 Big Data - Fall 2017 Week 10-A-18 s-and-workers model (1/2) CS535 Big Data - Fall 2017 Week 10-A-19 s-and-workers model (2/2) Common approach to achieve one-at-a-time stream processing Divides processing pipeline into worker processes Places queues between them If a worker fails (or restarts) it can continue where it left off by reading from its queue If you count number of clicks for a set of Web pages s for the Click streams Filters Valid Click info queues Cassandra workers Storage CS535 Big Data - Fall 2017 Week 10-A-20 CS535 Big Data - Fall 2017 Week 10-A-21 s-and-workers pitfalls Race conditions Multiple workers should not attempt to update the count of page click of the same page at the same time To avoid this, partitioning over the entire set of URLs should be spread among the queues but any event(click) for the same URL should be delivered to the same queue Operational burden If you need to change the topology of your processing, intermediate queues should be cleared before you redeploy s add latency s decrease the throughput Managing overhead Storm model Complex to implement CS535 Big Data - Fall 2017 Week 10-A-22 Storm Model One-at-a-time stream processing Represents the entire stream processing pipeline as a graph of computation called a topology A single program is deployed across a cluster CS535 Big Data - Fall 2017 Week 10-A-23 Spout in the Storm model Spout A source of streams in a topology A spout can read from a Kestrel or Kafka queue Turns the data into a tuple stream Timer spout could emit a tuple into its output stream every 10 seconds A stream is represented an infinite sequence of tuples A tuple: a named list of values Tuple Tuple Tuple Tuple Tuple Tuple Data stream 4

CS535 Big Data - Fall 2017 Week 10-A-24 CS535 Big Data - Fall 2017 Week 10-A-25 Bolt in the Storm model Bolt Performs actions on streams Takes any number of streams as input and produces any number

5 CS535 Big Data - Fall 2017 Week 10-A-24 CS535 Big Data - Fall 2017 Week 10-A-25 Bolt in the Storm model Bolt Performs actions on streams Takes any number of streams as input and produces any number of streams as output Runs functions, filters data, computes aggregations, does streaming joins, updates database, etc. Topology in the Storm model Topology A network of spouts and bolts with each edge representing a bolt that processes the output stream of another spout or bolt Task Each instance of a spout or bolt Tuple Tuple Data stream CS535 Big Data - Fall 2017 Week 10-A-26 CS535 Big Data - Fall 2017 Week 10-A-27 Storm Scalability Nodes should be added or removed from the Storm cluster without disrupting existing data flows (standing query) Resiliency During hardware failures, existing topologies must continue processing with minimal performance impact Extensibility External functions should be compatible Efficiency Good performance characteristics must be provided for realtime applications Easy to Administer Failure or performance issues should be addressed immediately Word Count Example CS535 Big Data - Fall 2017 Week 10-A-28 CS535 Big Data - Fall 2017 Week 10-A-29 Word count topology: Sentence Spout Word count topology: Split Sentence Sentence Spout Split Sentence Bolt Word Count Bolt Report Bolt Sentence Spout Split Sentence Bolt Word Count Bolt Report Bolt Sentence spout Emits a stream of single-value tuples continuously with the key name sentence and a string value { sentence : my dog has fleas Split Sentence Bolt Subscribes to the sentence spout s tuple stream { word : my { word : dog { word : has { word : fleas 5

CS535 Big Data - Fall 2017 Week 10-A-30 Word count topology: Word Count CS535 Big Data - Fall 2017 Week 10-A-31 Word count topology: Report Sentence Spout Split Sentence Bolt Word Count Bolt Report

6 CS535 Big Data - Fall 2017 Week 10-A-30 Word count topology: Word Count CS535 Big Data - Fall 2017 Week 10-A-31 Word count topology: Report Sentence Spout Split Sentence Bolt Word Count Bolt Report Bolt Sentence Spout Split Sentence Bolt Word Count Bolt Report Bolt Word count bolt Subscribes to the output of the SplitSentenceBolt class Keeps a count of how many times it has seen a particular word Whenever it receives a tuple, it will increment the counter and emit { word : dog, count :5 Report bolt Subscribes to the output of the WordCountBolt class Keeps a count of how many times it has seen a particular word Whenever it receives a tuple, it will update the table and print the contents to the console CS535 Big Data - Fall 2017 Week 10-A-32 SentenceSpout.java CS535 Big Data - Fall 2017 Week 10-A-33 SentenceSpout.java: Continued public class SentenceSpout extends BaseRichSpout { private SpoutOutputCollector collector; private String[] sentences = { "my dog has fleas", "i like cold beverages", "the dog ate my homework", "don't have a truck", "i don't think i like fleas" ; private int index = 0; public void declareoutputfields(outputfieldsdeclarer declarer) { declarer.declare(new Fields( sentence")); public void open(map config, TopologyContext context, SpoutOutputCollector collector) { this.collector = collector; public void nexttuple() { this.collector.emit(new Values(sentences[index])); index + +; if (index > = sentences.length) { index = 0; Utils.waitForMillis(1); CS535 Big Data - Fall 2017 Week 10-A-34 SplitSentenceBolt.java public class SplitSentenceBolt extends BaseRichBolt{ private OutputCollector collector; public void prepare( Map config, TopologyContext context, OutputCollector collector) { this.collector = collector; public void execute(tuple tuple) { String sentence = tuple.getstringbyfield( sentence"); String[] words = sentence.split(" "); for(string word : words) { this.collector.emit(new Values(word)); public void declareoutputfields(outputfieldsdeclarer declarer) { declarer.declare(new Fields( word")); CS535 Big Data - Fall 2017 Week 10-A-35 WordCountBolt.java public class WordCountBolt extends BaseRichBolt{ private OutputCollector collector; private HashMap < String, Long > counts = null; public void prepare( Map config, TopologyContext context, OutputCollector collector) { this.collector = collector; this.counts = new HashMap < String, Long >(); public void execute( Tuple tuple) { String word = tuple.getstringbyfield( word"); Long count = this.counts.get(word); if(count = = null){ count = 0; count + +; this.counts.put( word, count); this.collector.emit(new Values(word, count)); public void declareoutputfields( OutputFieldsDeclarer declarer) { declarer.declare(new Fields( word", "count")); 6

7 CS535 Big Data - Fall 2017 Week 10-A-36 ReportBolt.java public class ReportBolt extends BaseRichBolt { private HashMap < String, Long > counts = null; public void prepare( Map config, TopologyContext context, OutputCollector collector) { this.counts = new HashMap < String, Long >(); public void execute( Tuple tuple) { String word = tuple.getstringbyfield( word"); Long count = tuple.getlongbyfield( count"); this.counts.put(word, count); public void declareoutputfields( OutputFieldsDeclarer declarer) { // this bolt does not emit anything public void cleanup() { System.out.println("--- FINAL COUNTS ---"); List < String > keys = new ArrayList < String >(); keys.addall( this.counts.keyset()); Collections.sort( keys); for (String key : keys) { System.out.println( key + " : " + this.counts.get( key)); System.out.println(" "); CS535 Big Data - Fall 2017 Week 10-A-37 WordCountTopology.java public class WordCountTopology { private static final String SENTENCE_SPOUT_ID = "sentencespout"; private static final String SPLIT_BOLT_ID = "split-bolt"; private static final String COUNT_BOLT_ID = "count-bolt"; private static final String REPORT_BOLT_ID = "report-bolt"; private static final String TOPOLOGY_NAME = "word-counttopology"; public static void main( String[] args) throws Exception { SentenceSpout spout = new SentenceSpout(); SplitSentenceBolt splitbolt = new SplitSentenceBolt(); WordCountBolt countbolt = new WordCountBolt(); ReportBolt reportbolt = new ReportBolt(); TopologyBuilder builder = new TopologyBuilder(); builder.setspout(sentence_spout_id, ; // SentenceSpout --> SplitSentenceBolt builder.setbolt(split_bolt_id, splitbolt).shufflegrouping(sentence_spout_id); CS535 Big Data - Fall 2017 Week 10-A-38 CS535 Big Data - Fall 2017 Week 10-A-39 WordCountTopology.java Continued Results // SplitSentenceBolt--> WordCountBolt builder.setbolt(count_bolt_id, countbolt).fieldsgrouping(split_bolt_id, new Fields( word")); // WordCountBolt --> ReportBolt builder.setbolt(report_bolt_id, reportbolt).globalgrouping(count_bolt_id); Config config = new Config(); LocalCluster cluster = new LocalCluster(); cluster.submittopology(topology_name, config, builder.createtopology()); waitforseconds(10); cluster.killtopology(topology_name); cluster.shutdown(); --- FINAL COUNTS --- a : 1426 ate : 1426 beverages : 1426 cold : 1426 cow : 1426 dog : 2852 don't : 2851 fleas : 2851 has : 1426 have : 1426 homework : 1426 i : 4276 like : 2851 man : 1426 my : 2852 the : 1426 think : CS535 Big Data - Fall 2017 Week 10-A-40 CS535 Big Data - Fall 2017 Week 10-A-41 Components of the Storm cluster Parallelism in Storm Nodes (machines) Executes portions of a topology s (JVMs) Independent JVM processes running on a node Each node is configured to run one or more workers A topology may request one or more workers to be assigned to it s (threads) Java threads running within a worker JVM process Multiple tasks can be assigned to a single executor Unless explicitly overridden, Storm will assign one task to each executor Tasks (bolt/ spout instances) Instances of spouts and bolts whose nexttuple() and execute() methods are called by executor threads 7

CS535 Big Data - Fall 2017 Week 10-A-42 Parallelism in the WordCount topology In our example, we have NOT used any of Storm s parallelism Default setting is a factor of one Topology execution flow

8 CS535 Big Data - Fall 2017 Week 10-A-42 Parallelism in the WordCount topology In our example, we have NOT used any of Storm s parallelism Default setting is a factor of one Topology execution flow Node (JVM) CS535 Big Data - Fall 2017 Week 10-A-43 Adding workers to a topology Through configuration Through APIs Passing Config object to the submittopology() method Bolts and spouts do not have to change Config config = new Config(); Config.setNums(2); Task (Report Bolt) CS535 Big Data - Fall 2017 Week 10-A-44 CS535 Big Data - Fall 2017 Week 10-A-45 Adding executors and tasks Specify the number of executors when defining a stream grouping builder.setspout(sentence_spout_id, spout, 2); Assigns two tasks and each task is assigned its own executor thread Two spout tasks (if we are using one worker) Node (JVM) Task (Report Bolt) CS535 Big Data - Fall 2017 Week 10-A-46 In SplitSentenceBolt and WordCountBolt, Set up the split sentence bolt to execute as 4 tasks and 2 executors Each executor thread will be assigned two tasks to execute builder.setbolt(split_bolt_id, splitbolt, 2).setNumTasks(4).shuffleGrouping(SENTENCE_SPOUT_ID); Set up the Word count bolt to execute as 4 tasks each with its own executor thread builder.setbolt(count_bolt_id, countbolt, 4).fieldsGrouping(SPLIT_BOLT_ID, new Fields( word )); CS535 Big Data - Fall 2017 Week 10-A-47 With two workers Node (JVM) Task (Report Bolt) (JVM) 8

9 CS535 Big Data - Fall 2017 Week 10-A-48 CS535 Big Data - Fall 2017 Week 10-A-49 What will be the results with given parallelism? --- FINAL COUNTS --- a : 1426 ate : 1426 beverages : 1426 cold : 1426 cow : 1426 dog : 2852 don't : 2851 fleas : 2851 has : 1426 have : 1426 homework : 1426 i : 4276 like : 2851 man : 1426 my : 2852 the : 1426 think : Increased counts --- FINAL COUNTS --- a : 2726 ate : 2722 beverages : 2723 cold : 2723 cow : 2726 dog : 5445 don't : 5444 fleas : 5451 has : 2723 have : 2722 homework : 2722 i : 8175 like : 5449 man : 2722 my : 5445 the : 2727 think : Spout emits data indefinitely Stops when the topology is killed Having multiple workers has no effect when running a topology in local mode Only task and executor parallelism settings have effect A topology running in local mode always runs within a single JVM process Use your application in a cluster for true parallelism 9

10/26/2017 Sangmi Lee Pallickara Week 10- B. CS535 Big Data Fall 2017 Colorado State University

10/26/2017 Sangmi Lee Pallickara Week 10- B. CS535 Big Data Fall 2017 Colorado State University CS535 Big Data - Fall 2017 Week 10-A-1 CS535 BIG DATA FAQs Term project proposal Feedback for the most of submissions are available PA2 has been posted (11/6) PART 2. SCALABLE FRAMEWORKS FOR REAL-TIME