Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Similar documents
Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Apache Storm. Hortonworks Inc Page 1

STORM AND LOW-LATENCY PROCESSING.

Streaming & Apache Storm

Tutorial: Apache Storm

Data Analytics with HPC. Data Streaming

Scalable Streaming Analytics

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

10/24/2017 Sangmi Lee Pallickara Week 10- A. CS535 Big Data Fall 2017 Colorado State University

Big Data Infrastructures & Technologies

REAL-TIME ANALYTICS WITH APACHE STORM

10/26/2017 Sangmi Lee Pallickara Week 10- B. CS535 Big Data Fall 2017 Colorado State University

Flying Faster with Heron

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Twitter Heron: Stream Processing at Scale

Real-time data processing with Apache Flink

Hadoop ecosystem. Nikos Parlavantzas

Apache Storm: Hands-on Session A.A. 2016/17

Batch Processing Basic architecture

Hadoop. copyright 2011 Trainologic LTD

Webinar Series TMIP VISION

MapReduce Design Patterns

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

An Efficient Execution Scheme for Designated Event-based Stream Processing

Over the last few years, we have seen a disruption in the data management

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Large-Scale Data Engineering. Data streams and low latency processing

Spark Streaming. Guido Salvaneschi

Lecture 11 Hadoop & Spark

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework

Spark, Shark and Spark Streaming Introduction

Apache Storm. A framework for Parallel Data Stream Processing

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Big Data Hadoop Course Content

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Apache Flink Big Data Stream Processing

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

CS 398 ACC Streaming. Prof. Robert J. Brunner. Ben Congdon Tyler Kim

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

A Decision Support System for Automated Customer Assistance in E-Commerce Websites

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

MI-PDB, MIE-PDB: Advanced Database Systems

Huge market -- essentially all high performance databases work this way

TSAR A TimeSeries AggregatoR. Anirudh Todi TSAR

StormCrawler. Low Latency Web Crawling on Apache Storm.

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

The Stream Processor as a Database. Ufuk

Distributed Systems CS6421

HADOOP FRAMEWORK FOR BIG DATA

A Whirlwind Tour of Apache Mesos

Functional Comparison and Performance Evaluation. Huafeng Wang Tianlun Zhang Wei Mao 2016/11/14

Real-time Data Stream Processing Challenges and Perspectives

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

A New Architecture for Real Time Data Stream Processing

Distributed computing: index building and use

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Data-Intensive Distributed Computing

Big Data. Introduction. What is Big Data? Volume, Variety, Velocity, Veracity Subjective? Beyond capability of typical commodity machines

Streaming data Model is opposite Queries are usually fixed and data are flows through the system.

Kafka Streams: Hands-on Session A.A. 2017/18

An Introduction to Apache Spark

ASTRI Distributed Stream Computing Platform

Building a Data-Friendly Platform for a Data- Driven Future

Cloud Computing CS

MapReduce, Hadoop and Spark. Bompotas Agorakis

Chapter 5. The MapReduce Programming Model and Implementation

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES

Self Regulating Stream Processing in Heron

deconstructing LAMBDA Philly ETE Darach Ennis

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Stream Processing on IoT Devices using Calvin Framework

Processing of big data with Apache Spark

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Resilient Distributed Datasets

Data Partitioning and MapReduce

HadoopDB: An open source hybrid of MapReduce

Data Informatics. Seon Ho Kim, Ph.D.

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Map-Reduce. Marco Mura 2010 March, 31th

Introduction to Data Management CSE 344

Data Acquisition. The reference Big Data stack

Working with Storm Topologies

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

MapReduce Algorithm Design

Introduction to MapReduce

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Parallel Computing: MapReduce Jin, Hai

Transcription:

Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter

Storm at Twitter Twitter Web Analytics

Before Storm Queues Workers

Example (simplified)

Example Workers schemify tweets and append to Hadoop

Example Workers update statistics on URLs by incrementing counters in Cassandra

Example Distribute tweets randomly on multiple queues

Example Workers share the load of schemifying tweets

Example Desire all updates for same URL go to same worker

Message locality Because: No transactions in Cassandra (and no atomic increments at the time) More effective batching of updates

Implementing message locality Have a queue for each consuming worker Choose queue for a URL using consistent hashing

Example Workers choose queue to enqueue to using hash/mod of URL

Example All updates for same URL guaranteed to go to same worker

Adding a worker

Adding a worker Deploy Reconfigure/redeploy

Problems Scaling is painful Poor fault-tolerance Coding is tedious

What we want Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing Just works

Storm Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing Just works

Use cases Stream processing Distributed RPC Continuous computation

Storm Cluster

Storm Cluster Master node (similar to Hadoop JobTracker)

Storm Cluster Used for cluster coordination

Storm Cluster Run worker processes

Starting a topology

Killing a topology

Concepts Streams Spouts Bolts Topologies

Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples

Spouts Source of streams

Spout examples Read from Kestrel queue Read from Twitter streaming API

Bolts Processes input streams and produces new streams

Bolts Functions Filters Aggregation Joins Talk to databases

Topology Network of spouts and bolts

Tasks Spouts and bolts execute as many tasks across the cluster

Stream grouping When a tuple is emitted, which task does it go to?

Stream grouping Shuffle grouping: pick a random task Fields grouping: consistent hashing on a subset of tuple fields All grouping: send to all tasks Global grouping: pick task with lowest id

Topology shuffle [ id1, id2 ] shuffle [ url ] shuffle all

Streaming word count TopologyBuilder is used to construct topologies in Java

Streaming word count Define a spout in the topology with parallelism of 5 tasks

Streaming word count Split sentences into words with parallelism of 8 tasks

Streaming word count Consumer decides what data it receives and how it gets grouped Split sentences into words with parallelism of 8 tasks

Streaming word count Create a word count stream

Streaming word count splitsentence.py

Streaming word count

Streaming word count Submitting topology to a cluster

Streaming word count Running topology in local mode

Demo

Traditional data processing

Traditional data processing Intense processing (Hadoop, databases, etc.)

Traditional data processing Light processing on a single machine to resolve queries

Distributed RPC Distributed RPC lets you do intense processing at query-time

Game changer

Distributed RPC Data flow for Distributed RPC

DRPC Example Computing reach of a URL on the fly

Reach Reach is the number of unique people exposed to a URL on Twitter

Computing reach Tweeter Follower Follower Distinct follower URL Tweeter Follower Follower Distinct follower Count Reach Tweeter Follower Follower Distinct follower

Reach topology

Guaranteeing message processing Tuple tree

Guaranteeing message processing A spout tuple is not fully processed until all tuples in the tree have been completed

Guaranteeing message processing If the tuple tree is not completed within a specified timeout, the spout tuple is replayed

Guaranteeing message processing Reliability API

Guaranteeing message processing Anchoring creates a new edge in the tuple tree

Guaranteeing message processing Marks a single node in the tree as complete

Guaranteeing message processing Storm tracks tuple trees for you in an extremely efficient way

Storm UI

Storm UI

Storm UI

Storm on EC2 https://github.com/nathanmarz/storm-deploy One-click deploy tool

Documentation

State spout (almost done) Synchronize a large amount of frequently changing state into a topology

State spout (almost done) Optimizing reach topology by eliminating the database calls

State spout (almost done) Each GetFollowers task keeps a synchronous cache of a subset of the social graph

State spout (almost done) This works because GetFollowers repartitions the social graph the same way it partitions GetTweeter s stream

Future work Storm on Mesos Swapping Auto-scaling Higher level abstractions

Questions? http://github.com/nathanmarz/storm

What Storm does Distributes code and configurations Robust process management Provides reliability by tracking tuple trees Routing and partitioning of streams Serialization Fine-grained performance stats of topologies Monitors topologies and reassigns failed tasks