Streaming vs. batch processing

Size: px

Start display at page:

Download "Streaming vs. batch processing"

Sandra Jenkins
5 years ago
Views:

1 COSC 6339 Big Data Analytics Introduction to Spark (III) 2 nd homework assignment Edgar Gabriel Fall 2018 Streaming vs. batch processing Batch processing: Execution of a compute job without manual intervention ( non-interactive) Best suited for solving problems that are based on large, but static data Streaming: Continuous execution of an application to analyze an incoming stream of data Best suited for instant analysis if data, often with realtime constraints (low-latency processing ) 1

2 Streaming reliability models Every data item can be analyzed: At most once Message may be lost and never delivered At least once Messages will never be lost but could be redelivered Exactly once Messages are never lost Messages are never redelivered Stream Processing System 2

on micro-batches of RDD s Input DStreams Represents the stream of raw data received from the streaming source Data can come from many

3 Spark streaming Extension of the core Spark API to enable streaming applications Runs a streaming computation as a series of very small batches (micro-batch) Supports exactly-once semantics Spark Streaming Discretized Streams (DStream) Core Spark streaming abstraction Based on micro-batches of RDD s Input DStreams Represents the stream of raw data received from the streaming source Data can come from many sources (e.g. TCP sockets, Twitter, Flume, ) Operations Same as on regular Spark RDD Some additional transformations (e.g. window based transformations) 3

4 from pyspark import SparkContext from pyspark.streaming import StreamingContext sc = SparkContext(appName="NetworkWordCount") # Create a local StreamingContext with # a batch interval of 2 second ssc = StreamingContext(sc, 2) # Create a DStream that will connect to hostname:port lines = ssc.sockettextstream("localhost", 9999) words = lines.flatmap(lambda line: line.split(" ")) pairs = words.map(lambda word: (word, 1)) wordcounts = pairs.reducebykey(lambda x, y: x + y) wordcounts.print() ssc.start() # Start the computation ssc.awaittermination() # Wait for the computation to # terminate Spark Streaming Windows windowed computations: apply transformations over a sliding window of data any window operation needs to specify two parameters. window length: duration of the window sliding interval: interval at which the window operation is performed reducebywindow(func, windowlength, slideinterval) Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. reducebykeyandwindow(func, windowlength, slideinterval, [numtasks]) When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. 4

5 Streaming and Checkpointing A streaming application must be resilient to failures unrelated to the application logic (e.g., system failures, JVM crashes, etc.). Need to checkpoint enough information to reliable storage system to recover from failures Metadata checkpointing - Saving of the information defining the streaming computation. Metadata includes: Configuration - The configuration that was used to create the streaming application. DStream operations - The set of DStream operations that define the streaming application. Incomplete batches - Batches whose jobs are queued but have not completed yet. Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary fpr stateful transformations that combine data across multiple batches. Spark Streaming Other considerations: Executors must be configured with sufficient memory to hold the received data dependent on the time interval! Configuring automatic restart of the application driver to recover from a driver failure Write ahead logs: all data received from a receiver can be written into a write ahead log in the configuration checkpoint directory. Prevents data loss on driver recovery Setting the max receiving rate: receivers can be rate limited by setting a maximum rate limit in terms of records / sec 5

6 Data Source Reliability A data source is considered unreliable if there is no means to replay a previously received message A data source is considered reliable if it can somehow replay a message if processing fails at any point in time A data source is considered durable if it can replay any message or set of messages given a selection criteria 2 nd Homework Rules Each student should deliver Source code (.py files) compressed to a zip or tar.gz file Source code has to be using python , spark Documentation (.pdf,.doc, or.txt file) explanations to the code answers to questions Deliver electronically on blackboard Expected by Sunday, October 21, 11.59pm In case of questions: ask, ask, ask! 6

Duplicate detection Discovery of multiple representations of the same realworld object Problems: Representations are not identical (Fuzzy Duplicates) Data sets are large: quadratic complexity if

7 Duplicate detection Discovery of multiple representations of the same realworld object Problems: Representations are not identical (Fuzzy Duplicates) Data sets are large: quadratic complexity if comparing every pair of records Similarity measures: Domain-dependant vs. domain independent solutions Avoid comparisons by partitioning Parallel computing Slide based on a lecture by Felix Nauman (University of Potsdam): Origins of duplicates Slide based on a lecture by Felix Nauman (University of Potsdam): 7

Ironically, Duplicate Detection has many Duplicates Slide based on a lecture by Felix Nauman (University of Potsdam): https://hpi.

8 Ironically, Duplicate Detection has many Duplicates Slide based on a lecture by Felix Nauman (University of Potsdam): Part 1: a. Write a pyspark code to determine the 1,000 most popular words in the document collection provided. Please ensure that your code removes any special symbols, converts everything to lower case (and if possible: remove stop words, destemming, etc.). Stop word removal can be done either by using some nltk or creating your own list of words to be removed (e.g. and,or,the,it, ) 8

9 Part 2: a. Write a pyspark code to create an inverted index for the 1,000 words determined in Part 1. The inverted index is supposed to be of the form term1: doc1:weight 1_1,doc2:weight 2_1,doc3:weight 3_1, term2: doc1:weight 1_2,doc2:weight 2_2,doc3:weight 3_2, where weight x_y is: no. of occurrences of termx in document y / total number of words in document y b. Measure the execution time of the code for the large data set for 5, 10 and 15 executors Notes: revisit the Advanced MapReduce lecture for the inverted index. Part 3: a. Write a pyspark code to calculate the similarity matrix S with each entry of S being S(docx, docy) = tϵv(weight t_docx weight t_docy ) With V being the vocabulary (determined in part 1) and the weights having been determined in part 2 b. Measure the execution time for the large data set for 5, 10, and 15 executors Notes: See the following paper for the full algorithm: Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Pairwise document similarity in large collections with MapReduce 9

10 Part4: Provide a list of the 10 most similar (or identical) pair of documents from the large data set Input files Small input ( 22 short documents) available in hdfs in /cosc6339_hw2/gutenberg-22/ The small dataset contains 2 pair of documents that are identical, and 2 pair of documents that only have very minor differences. Your code should be able to identify them! Large data set available in hdfs in /cosc6339_hw2/large-dataset/ Only use large input file after you have confirmed that your code runs correctly with the small input file Dataset not yet there, but will be within the next 48 hours. Output: remember to write your results in the /bigdxy directory, not directly to / 10

11 Documentation The Documentation should contain (Brief) Problem description Solution strategy Description of how to run your code Results section Description of resources used Description of measurements performed Results (graphs/tables + findings) The document should not contain Replication of the entire source code that s why you have to deliver the sources Screen shots of every single measurement you made Actually, no screen shots at all. The output files 11

12 Additional resources Python: Spark: Whale cluster: 12

Spark Streaming. Professor Sasu Tarkoma.

Spark Streaming. Professor Sasu Tarkoma. Spark Streaming 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Spark Streaming Spark extension of accepting and processing of streaming high-throughput live data streams Data is accepted from various sources