Apache Storm. A framework for Parallel Data Stream Processing

Size: px

Start display at page:

Download "Apache Storm. A framework for Parallel Data Stream Processing"

Bryce Hicks
5 years ago
Views:

1 Apache Storm A framework for Parallel Data Stream Processing

2 Storm Storm is a distributed real- ;me computa;on pla<orm Provides abstrac;ons for implemen;ng event- based computa;ons on a cluster of physical nodes Performs parallel computa;ons on data streams Manages high throughput data streams It can be used to design complex event- driven applica;ons on intense streams of data

3 Introduc;on Began as a project of BackType, a marke;ng intelligence company bought by TwiFer in 2011 TwiFer open- sourced the project and became an Apache project in 2014 Storm = the Hadoop for Real- Time processing "Storm makes it easy to reliably process unbounded streams of data, doing for real8me processing what Hadoop did for batch processing. Has been designed for massive scalability, supports fault- tolerance with a fail fast, auto restart approach to processes, and provides the guarantee that every data of the stream will be processed. Its default is at least once processing seman;cs, but offers the ability to implement also the exactly once processing seman;cs (transac;onal)

4 Design Goals Guaranteed Data processing no data is lost Impera;ve descrip;on of a streaming workflow (through stream manipula;on classes) Horizontal Scalability Fault- Tolerance Programmable in different languages

5 Main Concepts: Spouts and Bolts Any Storm processing is defined as a Directed Acyclic Graph (DAG) of Spouts and Bolts, which is called a topology. In the topology, Spouts and Bolts produce and consume a streams of tuples. Tuple:: are generic objects without any schema, but can have named fields Spouts:: are the tuple input modules; can be unreliable (fire- and- forget) or reliable (replay failed tuples) Bolts:: are the tuple processing or output modules, consume streams and poten;ally produce new streams Stream:: a poten;ally infinite sequence of Tuple objects that Storm serializes and passes to the next bolts in the topology. Complex stream transforma;ons o]en require mul;ple steps (a chain of mul;ple bolts) Storm topologies run on clusters and the Storm scheduler distributes work to nodes around the cluster, based on the topology configura;on.

6 Applica;on represented as a topology Source: Heinze, Aniello, Querzoni, Jerzak, Cloud- based Data Stream Processing, DEBS 2014

7 Unlike Map- Reduce jobs, topologies run forever or un;l manually terminated. Spouts: bring data into the system and hand the data off to bolts (which may in turn hand data to subsequent bolts) Bolts: do the processing on the stream. may write data out to a database or file system, send a message to another external system, or make the results of the computa;on available to the users.

8 Typical Bolts Func;ons tuple transforma;ons Filters Aggrega;on Joins Storage/retrieval from persistent stores

9 Applica;on represented as a topology Storm developer may set parallelism hints at elements of the topology. Source: Heinze, Aniello, Querzoni, Jerzak, Cloud- based Data Stream Processing, DEBS 2014

10 Storm strengths a rich array of available spouts specialized for receiving data from all types of sources (e.g. from the TwiFer streaming API to Apache Kaea to JMS brokers, etc.) it is straigh<orward to integrate with HDFS file systems, meaning Storm can easily interoperate with Hadoop, if needed. Storm has support for mul;- language programming, and spouts and bolts can be wrifen in almost any language. Storm is a very scalable, fast, fault- tolerant open source system for distributed computa;on, with a special focus on calcula;ng rolling metrics in real ;me over streams of data.

11 Data Par;;oning Schemes When a tuple is emifed, to which task does it go? Storm offers some flexibility to define the data par;;oning/ shuffling method Stream groupings define the data flow in the topology This is set for every spout and bold through the grouping method when defining the topology Topology view Task view

12 Types of Stream Grouping Shuffle grouping - random distribu;on of tuples to the next downstream bolt tasks Fields grouping uses one/more named elements of the tuples to determine the des;na;on task (by mod hashing) All grouping sends all tuples to all all tasks Global grouping all tuples go to the bolt task with the lowest Id Direct grouping explicit defini;on of the target bolt Custom grouping define a custom grouping method by implemen;ng the CustomStreamGrouping interface LocalOrShuffle grouping: if the target bolt has >1 tasks in the same worker process, tuples will be shuffled to just those in- process tasks. Otherwise, it is the same as normal shuffle

13 Topology with Grouping op;ons shuffle bolt [ id1, id2 ] spout global bolt [ url ] bolt all bolt

14 A Prac;cal Example: Word Count Word count: the HelloWorld Input: stream of text (e.g. from documents) Output: number of appearance for each word

15 A Prac;cal Example: Hello Storm A simple word count The Strom Topology

16 Topology descrip;on Using the Topologybuilder class and its methods setspout() and setbolt() the spouts and bolts are declared and instan;ated. setbolt returns an InputDeclarer object that is used to define the inputs to the bolt. With this a bolt explicitly subscribes to a specific stream of another component (spout or bolt), and chooses the data shuffling/par;;oning op;on the paralleliza;on hint for spouts and bolts is op;onal The cluster class (its submittopology method) is then used to map the topology to a cluster

17 HelloStorm: contains the topology defini;on

18 IRichSpout IRichSpout: is the interface that any spout must implement. open method:: allows the spout to configure any connec;ons to the outside world (e.g. connec;ons to queue servers) and to receive the SpoutOutputCollector) nexttuple method:: will emit (send) the next tuple downstream the topology, it is called repeatedly by the Storm infra- structure declareoutputfields defies the fields of the tuples of the output streams Methods ack and fail are called when Storm detects that a tuple emifed from the Spout either successfully completed the topology, or failed to be completed.

19 LineReaderSpout: reads docs and creates tuples

20 BaseRichBolt Extend the abstract class BaseRichBolt or implement the irichbolt interface Prepare method:: passes to the bolt informa;on about the topology. The Outputcollector object manages the interac;on between the bolt and the topology (e.g. transmiong and acknowledging tuples) Execute method:: does the processing of incoming tuples The collector.emit() method is used to send the transformed/new tuple to the next bolt. Through collector.ack() and collector.fail() the bolt can no;fy Storm if the processing of the tuple was successful or if it failed, and for which reason (collector.reporterror()) declareoutputfields method:: is used do declare the fields of the output tuples or to define new named output streams.

21 BaseRichBolt Bolts can emit more than one stream. To make use of this, declare mul;ple named streams using the declarestream method of OutputFieldsDeclarer interface Name of the stream public void declareoutputfields (OutputFieldsDeclarer d) {!!d.declare (new Fields ( first, second, third ))!!d.declarestream( car, new Fields( first ));!!d.declarestream( cdr, new Fields( second, third ))! }! Name of the fields And then specify the named output streams using the emit method on SpoutOutputCollector! public void execute(tuple input) {! List<Object> objs = input.select( new Fields( first, second, third ) );!!collector emit(objs);!!collector emit( car, new Values(objs.get(0)));!!collector.emit( cdr, new Values(objs.get(1), objs.get(2)));!!collector.ack(input);! }! Access to the tuple fields

22 WordSpliFerBolt: cuts lines into words

23 WordCounterBolt: counts word occurrences

24 Topology Execu;on A Topology processes tuples forever (un;l you kill it). It consists of many worker processes spread across many machines (managed by a supervisor) A machine in a Cluster may run one or more worker processes. It is either idle or being used by a single topology. Each worker node may run one or more tasks of the same component. Storm s default scheduler applies a simple round- robin strategy to assign tasks to worker processes

25 Architecture of a Storm Cluster Nimbus: distributes code around the cluster Assigns tasks to machines/supervisors (i.e. allocates the execu;on of components - spouts and bolts) - to the worker processes Failure monitoring Is fail- fast and stateless Zookeeper: Keeps the informa;on of which supervisor machines are execu;ng (for discovery and coordina;on purposes) and if Nimbus machine is up. Supervisor: Listens to work assigned to its machine Starts and stops worker processes based on Nimbus commands Is fast- fail and stateless

26 Tuple Tree Storm considers a tuple coming off a spout "fully processed" when the tuple tree has been exhausted and every message in the tree has been processed. A tuple is considered failed when its tree of messages fails to be fully processed within a specified ;meout. This ;meout can be configured (default is 30 seconds) Tuple emifed by a spout The tuple tree generated by the processing of a sentence

27 Anchoring A tuple tree is defined by specifying the input tuple as the first argument of emit. If the new tuple fails to be processed downstream, the root tuple can be iden;fied.

28 At- least- once processing guarantee With anchoring, Storm can guarantee at- least- once seman;cs (in the presence of failures reported by bolts) without using intermediate queues. Instead of retrying from the point that a failure has been reported, retries happen from the root of the tuple tree - spouts will simply re- emit the root tuple again. Intermediate stages of bolt processing that had been completed successfully will be re- done. This is a waste of processing, But has the advantage is there is no need to synchronize the processing of the tuples by the parallel tasks. And if the opera;on of the bolts is idempotent (no side effects) the re- processing actually defines exactly- once processing guarantee.

29 Transac;onal Exactly- once processing guarantee But bolts may not do idempotent processing and processing may require exactly- once seman;cs: e.g. if a bolt holds some state that is updated as tuples are processed (e.g. a counter) and which is sensi;ve to repeated processing, or if state must be restored from a failed bolt. exactly- once seman;cs requires that data sources be fault- tolerant and can re- emit tuples (aka, tuple replay)

30 Transac;onal Exactly- once processing guarantee Storm handles this by using the following processing protocol: Tuples are grouped into micro- batches and each batch is associated with a transac;on ID. A transac;on ID is a monotonically growing numerical value (e.g. the first batch has ID 1, the second ID 2, etc.). If the topology fails to process a batch, this batch is re- emifed with the same transac;on ID. Before sending the batch through the pipeline, Storm announces to the nodes (bolts) that a new transac;on is been afempted. If it is successful, all nodes can commit their state. Storm guarantees that commit phases are globally ordered across all transac;ons i.e. a transac;on n+1 can never be commifed before the transac;on n.

31 Each processing node executes the following logic for state updates: The latest transac;on ID is persisted along with the state. If the framework requests to commit the current transac;on with a ID that differs from the ID value persisted, the state can be updated e.g. a counter can be incremented (Assuming a strong ordering of transac;ons, such update will happen exactly one for each batch). If the current transac;on ID equals to the persisted value, the node skips the commit because this is a batch replay. The node must have processed the batch earlier and updated the state accordingly, but the transac;on failed due to an error somewhere else in the pipeline. the strict order of commits is important to achieve exactly- once processing seman;cs.

32 Storm s Transac;on Processing A topology Note: transac;onal processing can cause serious performance degrada;on even if large batches are used.

33 Spouts Re- emiong tuples When emiong a tuple, the Spout provides a "message id" that will be used to iden;fy the tuple later. The tuple gets sent to consuming bolts and Storm takes care of tracking the tree of messages that is created. If a failure (or ;meout) is detected, Storm calls the fail method only on the specific Spout task that emifed the failed tuple informing its message id. Other parallel spout tasks will not be affected. The need to re- emit root tuples in case of failure requires a persistent queue the message is not de- queued but placed on a pending state, wai;ng for the acknowledgement that the message processing is completed by the topology. Therefore, spouts are o]en connected to Kaea clusters.

34 Storm Opera;on Modes Local mode: simulates the execu;on of a Storm cluster in a single process (useful for debugging) Distributed mode: execu;on in a cluster of machines. Submiong a topology to the master it also submits the code necessary to run the topology. Nimbus will take care of distribu;ng your code and alloca;ng workers to run your topology. If workers go down, it will reassign them somewhere else.

35 Exercício Fazer um primeiro programa Storm (em modo de local) que consuma um stream de dados e faça alguma transformação, contagem e/ou classificação das tuplas segundo algum critério pré- estabelecido. Sua topologia deve ter pelo menos 1 spout e 2 bolts.

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Basic info Open sourced September 19th Implementation is 15,000 lines of code Used by over 25 companies >2700 watchers on Github