Apache Storm. Hortonworks Inc Page 1

Size: px

Start display at page:

Download "Apache Storm. Hortonworks Inc Page 1"

Neil Rice
5 years ago
Views:

1 Apache Storm Page 1

2 What is Storm? Real time stream processing framework Scalable Up to 1 million tuples per second per node Fault Tolerant Tasks reassigned on failure Guaranteed Processing At least once processing Exactly once processing with some more work Relatively language agnostic Primarily JVM based Thrift API for defining and submitting topologies JSON based protocol for defining components in other languages Page 2

3 Motivation Process large amount of incoming data real time Classic use case is processing streams of tweets Calculate trending users Calculate reach of a tweet Data cleansing and normalization Personalization and recommendation Log processing Page 3

4 Lambda Architecture Most useful when Batch & speed layers do essentially the same computation Sample use case: KPI dashboard Less useful when When batch & speed layers do different computation Sample use case: Realtime model scoring Source: Page 4

Basic Concepts Tuple: Most fundamental data structure and is a named list of values that can be of any datatype Streams: Groups of tuples Spouts: Generate streams.

5 Basic Concepts Tuple: Most fundamental data structure and is a named list of values that can be of any datatype Streams: Groups of tuples Spouts: Generate streams. Bolts: Contain data processing, persistence and alerting logic. Can also emit tuples for downstream bolts Tuple Tree: First tuple and all the tuples that were emitted by the bolts that processed it Topology: Group of spouts and bolts wired together into a workflow Page 5

Architecture Nimbus(Management server) Similar to job tracker Distributes code around cluster Assigns tasks Handles failures Supervisor(Worker nodes): Similar

6 Architecture Nimbus(Management server) Similar to job tracker Distributes code around cluster Assigns tasks Handles failures Supervisor(Worker nodes): Similar to task tracker Run bolts and spouts as tasks ZooKeeper: Cluster co-ordination Nimbus HA Stores cluster metrics Consumption related metadata for Trident topologies

7 Relationship Between Supervisors, Workers, Executors & Tasks supervisor Each supervisor machine in storm has specific Predefined ports to which a worker process is assigned Source: Page 7

8 Tuple Routing Fields grouping provides various ways to control tuple routing to bolts. Grouping type What it does When to use Shuffle Grouping Fields Grouping All grouping Custom grouping Direct grouping Global grouping Sends tuple to a bolt in random round robin sequence Sends tuples to a bolt based on one or or more field's in the tuple Sends a single copy of each tuple to all instances of a receiving bolt Implement your own field grouping so tuples are routed based on custom logic Source decides which bolt will receive tuple Global Grouping sends tuples generated by all instances of the source to a single target instance (specifically, the task with lowest ID) - Doing atomic operations eg. math operations. - Segmentation of the incoming stream. - Counting tuples of a certain type. - Send some signal to all bolts like clear cache or refresh state etc. - Send ticker tuple to signal bolts to save state etc. - Used to get max flexibility to change processing sequence, logic etc. based on different factors like data types, load, seasonality etc. - Depends. - Global counts. Page 8

9 Topology creation example Get Tweet Find Hashtags Count Hashtags Report Findings Kafka Spout "reader" Bolt "normalizer" Removes nonalphanumeric characters, extracts hashtag values and emits them. Bolt "enumerator" Keeps track of how many instances of each hashtag have occurred. Bolt "reporter" Regularly creates report and uploads it to Amazon S3. TopologyBuilder builder = new TopologyBuilder(); builder.setspout("spout", kafkaspout); builder.setbolt("normalizer", new HashTagNormalizer(),2).shuffleGrouping("spout"); builder.setbolt("enumerator", new HashTagEnumerator(),2).fieldsGrouping("normalizer", new Fields("hashtag")); builder.setbolt("reporter", new ResultsReporter(),1).globalGrouping("enumerator"); Page 9

10 What happens on failure? Run everything with monitoring E.g. daemontools or monit Restarts Nimbus and Supervisors on failure Nimbus Stateless (kept in either ZooKeeper or on disk) Single Point of Failure, Sort Of Supervisor Workers still function, but can t be reassigned when a node fails Supervisors continue as normal Stateless Entire Node Nimbus reassigns tasks on that machine after timeout Page 10

11 Guaranteed Processing Tuples from Spout are tagged with a message ID Each of these tuples can result in a tuple tree Once every tuple in the tuple tree is processed, the original tuple is considered to be processed. Requires two pieces from the user Explicitly anchoring an emitted tuple to the input tuple(s) Ack or fail every tuple. If a tuple isn t processed quickly enough, a timeout value will cause a failure. Spouts like the Kafka spout can replay tuples on failure, either as explicitly indicated by bolts or from timeouts. At least once processing! Page 11

12 What is Trident? Provides exactly once processing semantics in Storm Core concept is to process a group of tuples as a batch rather than process tuple at a time like core Storm does. Higher level API for defining topologies. All Trident topologies under the covers are automatically converted into Spouts and Bolts. Page 12

13 Parallelism Three basic variables: # Slots, # Workers, # Tasks No general way to answer beyond profiling and adjusting. Can set the number of executors (threads) Can set the number of tasks Tasks are NOT parallel within an executor More than one task for executor is useful for rebalancing while the topology is running Number of workers Increase when bottlenecked on CPU and each worker has many tuples to process Page 13

14 Patterns Streaming Joins Combine two or more data streams Unlike database join, streaming join has infinite input, and unclear semantics. Different types of joins for different use cases Partition input streams the same way Fields groupbuilder.setbolt("join", new MyJoiner(), parallelism).fieldsgrouping("1", new Fields("joinfield1", "joinfield2")).fieldsgrouping("2", new Fields("joinfield1", "joinfield2")).fieldsgrouping("3", new Fields("joinfield1", "joinfield2")); Page 14

15 Patterns Batching For efficiency E.g. Elasticsearch bulk API Hold on to tuples in instance variable Process tuples Ack all the instance tuples When emitting, consider multi-anchored tuple to ensure reliability. Anchor to batched tuples to ensure all batched tuples are replayed. Page 15

16 Patterns Streaming Top N Simplest way is to have a bolt that does global grouping on stream and maintains list in memory of top N items Doesn t scale because whole stream goes through one task Alternative: Do many top N s across partitions of stream Merge each partition top N to get global top N Use fields grouping to get partitioning builder.setbolt("rank", new RankObjects(), parallelism).fieldsgrouping("objects", new Fields("value")); builder.setbolt("merge", new MergeObjects()).globalGrouping("rank"); Page 16

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors. About the Tutorial Storm was originally created by Nathan Marz and team at BackType. BackType is a social analytics company. Later, Storm was acquired and open-sourced by Twitter. In a short time, Apache