Challenges in Data Stream Processing

Size: px

Start display at page:

Download "Challenges in Data Stream Processing"

Suzanna Cook
5 years ago
Views:

1 Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Challenges in Data Stream Processing Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini

2 Challenge 1: Optimize the DSP application Apply some transformation to streaming graph At design time or run-time Operator reordering To avoid unnecessary data transfers A B B A Redundancy elimination B C C A A B B D D 1

3 Challenge 1: Optimize the DSP application Operator separation A A1 A2 Fusion A B AB 2

4 Challenge 2: Place the operators Operator placement decision: a complex problem Trade communication cost against resource utilization When Initial (static) operator placement Can be more expensive and comprehensive Can also be at run-time Move only relocatable operators Require operator migration We will focus on this issue in a next lesson 3

5 Challenge 3: Manage load variations Typical stream processing workloads are: with high volume and high rates bursty and with workload spikes not known in advance Twitter in 2013: rate of tweets per second = 5700 but significant peak of 144,000 tweets per second 4

6 Challenge 3: Manage load variations Possible approaches: Admission control Static reservation Reserve specific resources in advance Cons: over-provisioning and cost increase Apply dynamic techniques such as load shedding Selectively drop tuples at strategic points (e.g., when CPU usage exceeds a specific limit) Cons: sacrifice accuracy and completeness A Shedder A 5

7 Challenge 3: Manage load variations Possible approaches (continued): Use adaptive rate allocation E.g., backpressure : the upstream operator that precedes the bottleneck stores data in an internal buffer to reduce the pressure; backpressure recursively propagates up to the source operators Redistribute load, e.g., determine new operator placement and relocate operators on computing nodes Cons: available resources could be insufficient What else? 6

8 Exploit data parallelism Alternative solution: Detect bottleneck Use data-parallelism (aka operator fission) Apply SIMD paradigm: concurrent execution of multiple replicas of the same operator on different data portions By hand: possible, but cumbersome A A Split A Merge A 7

9 Elastic stream processing Exploit elasticity: acquire and release resources when needed Where? At application layer (i.e., data parallelism) Scale out (or scale in) operators by adding (removing) operator replicas Activate (or deactivate) already replicated operators At infrastructure layer Scale out (or scale in) computing nodes 8

10 Elastic stream processing When and how to scale? Open issues that deserve investigation Some simple example: When: threshold-based How: add/remove one replica at time, but where to place it? Be careful: elasticity overhead is not zero! In most streaming systems: run a new placement decision to take the new replicas into account Dynamic scaling impacts stateful operators 9

11 Challenge 4: Self-adapt at run-time To cope with highly dynamic operative environment Unpredictable workload Computational characteristics of operators not known a-priori Need to sustained load for long provisioning times Node availability, network congestion, Exploit run-time adaptation capabilities of DSP systems What adaption actions? Scale the number of operator instances, relocate the operators, 10

12 Self-adaptive deployment MAPE (Monitor, Analyze, Plan and Execute) Plan phase: how to reconfigure the application deployment 11

13 Distributed Storm We developed an extension of Storm Goals: to provide distributed monitoring distributed placement and adaptation capabilities Where: large-scale environment Code available on GitHub matnar.github.io/uniroma2-storm/ V. Cardellini, V. Grassi, F. Lo Presti, M. Nardelli, Distributed QoS-aware scheduling in Storm, ACM DEBS

14 Distributed Storm architecture 13

gossip-based Monitor QoS attributes Node utilization and availability Worker

15 Distributed Storm: monitoring QoSMonitor (for each worker node) Estimate network latencies Use a network coordinate system Vivaldi s algorithm: decentralized and gossip-based Monitor QoS attributes Node utilization and availability Worker Monitor (for each worker process) Monitor exchanged data rate among the operators 14

16 Distributed Storm: performance Load spike on a subset of nodes ~50% 15

17 Reconfiguration challenges Reconfiguring the deployment has a non negligible cost! Can affect negatively application performance in the short term Application freezing times caused by operator migration and scaling, especially for stateful operators Perform reconfiguration only when needed Take into account the overhead for migrating and scaling the operators 16

18 Challenge 5: stateful operators State complicates things 1. Dynamic scaling 2. Operator re-placement 3. Recovery from failure impact state Loss of state! 17

19 Approaches for stateful migration Most of streaming systems do not support stateful processing and migration (e.g., Storm) Developers manage state Typically combine with external system to store state Design complexity Requirements for stateful operatior migration Safety (i.e., to preserve the consistency of the operations) Application transparency Minimal footprint 18

20 Stateful operator migration Two approaches: Pause-and-resume Parallel-track Pause-and-resume approach Terminate migrating task and start it on new node Stop migrating task Save state Restore state Resume stream processing 19

21 Stateful operator migration Pause-and-resume drawback Peak in the application latency during the migration Parallel-track approach Old and new operator instances run concurrently until the state of both is synchronized and the new instance can safely take over Drawback: requires enhanced mechanisms No clear winner 20

22 Issues for stateful migration How to identify the portion of state to migrate? Expose an API to let the user manually manage the state Support only partitioned stateful operators Partitioned stateful operators store independent state for each sub-stream identified by a partitioning key Automatically determine, on the basis of a partitioning key, the optimal number of state partitions to be used and migrate 21

23 Issues for stateful migration How to balance the load among multiple stateful replicas? Can use consistent hashing Can use partial key grouping Uses two hash functions where a key can be sent to two different replicas instead of one Only available in research prototypes 22

24 Elastic stateful migration in Storm We developed mechanisms for elastic stateful migration in Storm worker process worker process worker slot worker process worker slot worker slot worker process worker process worker process worker process worker process worker slot DDS DDS DDS DDS Supervisor Supervisor Supervisor Supervisor Network Nimbus ElasticityManager scheduler MigrationNotifier ZooKeeper V. Cardellini, M. Nardelli, D. Luzi, "Elastic stateful stream processing in Storm", HPCS

25 Elastic stateful migration in Storm Scaling decisions at the framework level Adapt the number of parallel instances for each application operator Simple threshold-based scaling policy Relocate the operator internal state on a different node and enable Storm to change the application deployment at run-time DDS first synchronization barrier DDS second synchronization barrier MIGRATION NOTIFIED MIGRATION MODE SAVE STATE new task MIGRATION MODE RESTORE STATE (if any) OPERATIONAL MODE time the migrating task can be terminated streams are resumed 24

26 Performance results DSP app: frequent pattern detection 1600 Elastic scaling and stateful migration improves the application latency 120 tweets/s 350 tweets/s 900 tweets/s 250 tweets/s 120 tweets/s tweets/s 350 tweets/s 900 tweets/s 250 tweets/s 120 tweets/s Application Latency (ms) Data rate Scaling Scheduling with E+SM w/o E+SM Number of Executors Data rate Scaling Scheduling with E+SM Time (s) Time (s) 25

27 Challenge 6: guarantee fault tolerance DSP applications run for long time intervals Possible solutions: Active replication Check-pointing Replay logs Hybrid solutions failures are unavoidable Having different trade-offs between runtime cost in absence of failures and recovery cost Large-scale complicates things Network partitions and CAP theorem 26

References M. Hirzel, R. Soulé, S. Schneider, B. Gedik, R. Grimm, A catalog of stream processing optimizations, ACM Comput. Surv., 2014. http://bit.ly/2rtlljf T. Heinze T, L. Aniello, L. Querzoni, Z.

28 References M. Hirzel, R. Soulé, S. Schneider, B. Gedik, R. Grimm, A catalog of stream processing optimizations, ACM Comput. Surv., T. Heinze T, L. Aniello, L. Querzoni, Z. Jerzak, Cloud-based data stream processing, Proc. ACM DEBS V. Cardellini, V. Grassi, F. Lo Presti, M. Nardelli, Distributed QoS-aware scheduling in Storm, Proc. ACM DEBS V. Cardellini, M. Nardelli, D. Luzi, Elastic stateful stream processing in Storm, Proc. HPCS B. Gedik, S. Schneider, M. Hirzel, and K.-L. Wu, Elastic scaling for data stream processing, IEEE Trans. Parallel Distrib. Syst. 25, 6,

NewSQL Databases. The reference Big Data stack

NewSQL Databases. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica NewSQL Databases Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference