Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Size: px

Start display at page:

Download "Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter"

Trevor Owen
6 years ago
Views:

1 Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter

2 Storm at Twitter Twitter Web Analytics

3 Before Storm Queues Workers

4 Example (simplified)

5 Example Workers schemify tweets and append to Hadoop

6 Example Workers update statistics on URLs by incrementing counters in Cassandra

7 Example Distribute tweets randomly on multiple queues

8 Example Workers share the load of schemifying tweets

9 Example Desire all updates for same URL go to same worker

10 Message locality Because: No transactions in Cassandra (and no atomic increments at the time) More effective batching of updates

11 Implementing message locality Have a queue for each consuming worker Choose queue for a URL using consistent hashing

12 Example Workers choose queue to enqueue to using hash/mod of URL

13 Example All updates for same URL guaranteed to go to same worker

14 Adding a worker

15 Adding a worker Deploy Reconfigure/redeploy

16 Problems Scaling is painful Poor fault-tolerance Coding is tedious

17 What we want Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing Just works

18 Storm Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing Just works

19 Use cases Stream processing Distributed RPC Continuous computation

20 Storm Cluster

21 Storm Cluster Master node (similar to Hadoop JobTracker)

22 Storm Cluster Used for cluster coordination

23 Storm Cluster Run worker processes

24 Starting a topology

25 Killing a topology

26 Concepts Streams Spouts Bolts Topologies

27 Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples

28 Spouts Source of streams

29 Spout examples Read from Kestrel queue Read from Twitter streaming API

30 Bolts Processes input streams and produces new streams

31 Bolts Functions Filters Aggregation Joins Talk to databases

32 Topology Network of spouts and bolts

33 Tasks Spouts and bolts execute as many tasks across the cluster

34 Stream grouping When a tuple is emitted, which task does it go to?

35 Stream grouping Shuffle grouping: pick a random task Fields grouping: consistent hashing on a subset of tuple fields All grouping: send to all tasks Global grouping: pick task with lowest id

36 Topology shuffle [ id1, id2 ] shuffle [ url ] shuffle all

37 Streaming word count TopologyBuilder is used to construct topologies in Java

38 Streaming word count Define a spout in the topology with parallelism of 5 tasks

39 Streaming word count Split sentences into words with parallelism of 8 tasks

40 Streaming word count Consumer decides what data it receives and how it gets grouped Split sentences into words with parallelism of 8 tasks

41 Streaming word count Create a word count stream

42 Streaming word count splitsentence.py

43 Streaming word count

44 Streaming word count Submitting topology to a cluster

45 Streaming word count Running topology in local mode

46 Demo

47 Traditional data processing

48 Traditional data processing Intense processing (Hadoop, databases, etc.)

49 Traditional data processing Light processing on a single machine to resolve queries

50 Distributed RPC Distributed RPC lets you do intense processing at query-time

51 Game changer

52 Distributed RPC Data flow for Distributed RPC

53 DRPC Example Computing reach of a URL on the fly

54 Reach Reach is the number of unique people exposed to a URL on Twitter

55 Computing reach Tweeter Follower Follower Distinct follower URL Tweeter Follower Follower Distinct follower Count Reach Tweeter Follower Follower Distinct follower

56 Reach topology

57 Guaranteeing message processing Tuple tree

58 Guaranteeing message processing A spout tuple is not fully processed until all tuples in the tree have been completed

59 Guaranteeing message processing If the tuple tree is not completed within a specified timeout, the spout tuple is replayed

60 Guaranteeing message processing Reliability API

61 Guaranteeing message processing Anchoring creates a new edge in the tuple tree

62 Guaranteeing message processing Marks a single node in the tree as complete

63 Guaranteeing message processing Storm tracks tuple trees for you in an extremely efficient way

64 Storm UI

65 Storm UI

66 Storm UI

67 Storm on EC2 One-click deploy tool

68 Documentation

69 State spout (almost done) Synchronize a large amount of frequently changing state into a topology

70 State spout (almost done) Optimizing reach topology by eliminating the database calls

71 State spout (almost done) Each GetFollowers task keeps a synchronous cache of a subset of the social graph

72 State spout (almost done) This works because GetFollowers repartitions the social graph the same way it partitions GetTweeter s stream

73 Future work Storm on Mesos Swapping Auto-scaling Higher level abstractions

74 Questions?

75 What Storm does Distributes code and configurations Robust process management Provides reliability by tracking tuple trees Routing and partitioning of streams Serialization Fine-grained performance stats of topologies Monitors topologies and reassigns failed tasks

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Basic info Open sourced September 19th Implementation is 15,000 lines of code Used by over 25 companies >2700 watchers on Github