Coroutines & Data Stream Processing

Size: px

Start display at page:

Download "Coroutines & Data Stream Processing"

Junior Butler
6 years ago
Views:

1 Coroutines & Data Stream Processing An application of an almost forgotten concept in distributed computing Zbyněk

2 Agenda "Strange" Iterator Example Coroutines MapReduce & Coroutines Clockwork Conclusion Q&A

3 Joining Multiple Iterators To One

4 Joining Iterators - A Traditional Approach

5 Joining Iterators - A Less-Traditional Approach

6 Joining Iterators To One - A Pipe

7 Joining Iterators To One - Coroutines de.matthiasmann.continuations.coiterator

8 Coroutines

9 Coroutines - Introduction A generalization of subroutines AKA green threads, co-expressions, fibers, generators Some may remember the old windows event loop Behavior similar to that of subroutines Coroutines can call other coroutines Execution may but may not later return to the point of invocation Often demonstrated on Producer/Consumer scenario

10 Coroutines Conceptual Example - Producer/Consumer (

11 Programming Language Support None of the top TIOBE languages (Java, C, C++, PHP, Basic) Go, Icon, Lua, Perl, Prolog, Ruby, Tcl, Simula, Python, Modula-2

12 Coroutines in Java Buffers, phases (BufferedJoinedIterator) Emulations by threads (PipeJoinedIterator) Byte-code manipulation (CoJoinedIterator) A need for JVM support, some future JSR? For some implementations see the references In this presentation I use Continuations library developed by Matthias Mann ( See:

13 Coroutines Use Case Scenarios Iterators Producer/Consumer chains State machines Visitors with loops instead of callbacks Pull parsers Loggers Observers, listeners, notifications Generally capable of converting PUSH algorithms to PULL

14 MapReduce & Coroutines

15 Map-Reduce Overview PULL, batch processing, offline Two-phase computational mode: Map and Reduce Map - filters, cleans or parses the input records Reduce - aggregates the records obtained from Map Easily distributable Inspired by functional programming Benefits - scalable, thread-safe (no race conditions), simple computational model Rich libraries of algorithms - e.g. machine learning (Mahout)

16 Map-Reduce Overview - Word Count example (

17 Map-Reduce and Data Streams Could we write the same code for data streams? We could keep thinking in MR paradigm Perhaps, it would inherit the nice MR properties Sadly, streaming algorithms are inherently PUSH, incremental or event-driven, i.e. callbacks instead of loops WordCount MR solution has 2 simple loops It is a typical Producer/Consumer problem Coroutines should help us!

18 Clockwork Adoption of MR to stream data processing See

19 Clockwork - Adoption of MR for data stream processing Adoption of MR paradigm to data stream processing Built on top of coroutines (Producer/Consumer) Goes further than the original MR concepts Easy for anyone familiar with Hadoop (or other) Open-sourced recently by AVAST The core is practically production-ready (RC) A lot of work to be done (networking, RPC)

20 Clockwork Execution Model Execution: Input Transformer 1 Transformer 2 Transformer 3 Transformer N Accumulator

21 Clockwork Execution Model - Transformers and Accumuators Transformer Mapper (function) Reducer (loop) Accumulator Tank Partitioner

22 Clockwork Distributed Execution Execution A Execution A Execution B

23 Clockwork Components Reducer aggregations a coroutine component (loop) Mapper filtering transforming expanding Tank key-value storage key-values storage flushing buffer Partitioner routing output to another nodes partitioning broadcasting

24 Word Counter in Clockwork Construction: WordSplitter Word Coun ter Feeding: Table Accum Note: The program must run with -javaagent:continuations.jar or transformed by means of the Continuation ANT task. See

25 Word Counter Mapper And Reducer

26 Reducer Per-Key Instances "Counting words cannot be easier" WordSplitter WordCounter reducer instances (one coroutine instance per key) Countin g words cannot be easier Table Accum

27 Word Counter on One Node WordSplitter Word Coun ter Table Accum

28 Distributed Word Counter- Many Mappers One Reducer f f WordSplitter WordSplitter Router Router 2*f Word Coun ter Table Accum

29 Distributed Word Counter- Many Mappers with Combiners One Reducer f f WordSplitter WordSplitter Word Coun ter Word Coun ter Router Router a*f 0 < a <= 2 Word Coun ter Table Accum

30 Distributed Word Counter- Many Mappers with Combiners Many Reducers f f WordSplitter WordSplitter Word Coun ter Word Coun ter Partitioner Partitioner a*f 0 < a <= 2 Word Coun ter Word Coun ter Table Accum Table Accum

31 Reduce-only Setup - The "Nerdiest" clock in the world 1-msec ticks Construction: Sec Wheel Minute Wheel Hour Wheel Feeding: Day Wheel Dummy Accum

32 Reduce-only Setup - Reducer as a Clock Wheel

33 Map-only Setup - HTTP pipeline HttpRequest Decoder Construction: MyHandler HttpResponse Encoder Channel Writer Feeding: Type-safe execution construction:

34 Many-Maps-One-Reduce Setup - HTTP pipeline HttpRequest Decoder Construction: Http Chunk Aggre gator MyHandler HttpResponse Encoder Feeding: Channel Writer

35 Naive Bayes Classification Map Reduce Job (

36 Naive Bayes Classification - Mapper

37 Naive Bayes Classification - Reducer

38 Naive Bayes Classification - Deployment learning - (weight, height, foot) -> sex InstanceMap per Instan cered ucer StatAccu mulator guessing - (weight, height, foot) ->?

39 Conclusion

40 Conclusion Distributed stream processing easier Some techniques known from offline MR can be adopted more or less directly Requires incremental algorithms and models Many deployment options, flushing strategies A lot of work to be done: communication protocol, machine learning and statistics algorithms, management tools, documentation...

41 Thanks for your attention! Q&A

42 Abstract This presentation deals with the concept of coroutines and its applicability in the world of stream data processing. Although it is rarely used in the todays applications, the coroutines have been here since the early days of digital computing. Surprisingly, coroutines can be nicely combined with the map-reduce paradigm that is used frequently in the world of cloud computing and big data processing. In contrast to the traditional map-reduce concept, which is designed for offline job processing, the coroutines&map-reduce hybrid is primarily targeted at real-time event processing. Clockwork, an open-source library developed at Avast, combines these two concepts and allows a programmer to write a real-time stream analysis as if he wrote a traditional map-reduce job for Hadoop, for instance. The presentation is focused mainly on coding and samples and will show how to program applications ranging from simple real-time statistics to more advanced tasks.

43 References

Processing of big data with Apache Spark

Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT