Building Interactive Distributed Processing Applications at a Global Scale

Size: px

Start display at page:

Download "Building Interactive Distributed Processing Applications at a Global Scale"

Jesse Hensley
5 years ago
Views:

1 Building Interactive Distributed Processing Applications at a Global Scale Shadi A. Noghabi Prelim Committee: Prof. Roy Campbell (advisor), Prof. Indy Gupta (advisor), Prof. Klara Nahrstedt, Dr. Victor Bahl 1

2 Many Interactive Applications < few seconds 100s of PBs data < few minutes Trillions of events < few milliseconds Millions - billions of devices 2

3 Low Latency at Large Scale Latency-sensitive, reaching low latency is an important for interactivity 100ms latency (Amazon) 1% loss in revenue $4M per millisec! * Large scale processing, at large and global scales My focus: Building Interactive systems at a production scale * 3

4 At Scale, Low Latency is Challenging Heterogeneity Data Machines Operations Dynamism Workload Environment Distribution State & Replication Imbalance 4

5 Thesis Statement Latency-driven designs can be used to build production infrastructures for interactive applications at a global scale while addressing myriad challenges on heterogeneity, dynamism, state, and imbalance. 5

6 Latency-driven Design Achieving latency has the highest priority PACELC 1,2 Latency Latency Latency Consistency Availability Throughput Isolation 1. Abadi, Consistency Tradeoffs in Modern Distributed Database System Design, Computer Brewer, Gilbert, and Lynch Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services, Sigact News

7 Related Work Latency-Driven in other frameworks Trend of NoSQL systems: Cassandra, Amazon s DynamoDB, Pileus. Latency-Driven does not fit all Consistency driven: Consistent Storage: BigTable, Spanner, Yahoo s PNUTS, Espresso Exactly-once Stream processing: Flink, Millwheel, Spark Streaming, Trident Throughput driven: Batch processing: Hadoop, Spark, Tez, Pig and Hive Micro-batch streaming: Spark Streaming, media processing High-throughput Storage: HDFS and GFS 7

8 Latency-driven Techniques Background processing Locality Partitioning & parallelism Tiered data access Load balancing Latency Driven Adaptive Scaling Opportunistic processing 8

9 Latency-driven at All Layers Edge Edge DC 1 DC 2 Storage DC n Processing Networking Edge Ambry [SIGMOD 16] Samza [VLDB 17] Steel Edge [HotCloud 18] FreeFlow [HotNets 17] General Used Main Integrated container production ~15 Companies, networking with system production with >4 applicable years 100s Azure & of 500M applications to and Docker, users Windows at Kubernetes, at LinkedIn IoT etc. 9

10 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 10

11 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 11

12 Samza: Stateful Scalable Stream Processing at LinkedIn VLDB 17 Shadi A. Noghabi*, Kartik Paramasivam^, Yi Pan^, Navina Ramesh^, Jon Bringhurst^, Indranil Gupta*, Roy Campbell* * University of Illinois at Urbana-Champaign ^ LinkedIn Corp. +

13 Stream Processing* Interactive feeds or ads Security & Monitoring Internet of Things *Stream processing is the processing of data in motion, i.e., computing on data immediately as it is produced 13

14 Stream Applications need State Access Stream (IP, access time) DoS Identifier IP Count DoS Stream (Attacker IP) State: durable data read/written along with the processing Many cases need large state Aggregation: aggregating counts over a window Join: two (or more) streams over a window Other: machine learning model 14

15 Scale of Processing Large scale of input, in Kafka alone: 2.1 Trillion msg/day, 16 Million msg/sec peaks 0.5 PB in, 2 PB out per day Many applications: 100s of production applications Over 10,000 containers Large scale of state Several 100s of TBs for a single application 15

Stateful Stream Processing Need to handle state with low latency, at large-scale, and with correct (consistent) results http://samza.apache.

16 Stateful Stream Processing Need to handle state with low latency, at large-scale, and with correct (consistent) results Opensource Apache project Powers hundreds of apps in LinkedIn s production In use at LinkedIn, Uber, Metamarkets, Netflix, Intuit, TripAdvisor, VMware, Optimizely, Redfin, etc. 16

17 Stateful Stream Processing Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction 17

18 Stateful Stream Processing Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction 18

19 Processing from a Partitioned Source Stream of events Ex: page access, ad views, etc. 19

20 Processing from a Partitioned Source Stream of events Partition 1 Partition 2 Ex: page access, ad views, etc. Partition k 20

21 Processing from a Partitioned Source Stream of events Partition 1 Partition 2 Ex: page access, ad views, etc. Partition k 21

22 Processing from a Partitioned Source Stream of events Partition 1 Task 1 Ex: page access, ad views, etc. Partition 2 Partition k Task 2 Partitioning & parallelism Task k 22

23 Multi-Stage Dataflow Locality Application logic: Count number of Page accesses for each ip domain in a 5 minute window Page Access in stream Application DAG in tasks DoS Stream out stream Window count Filter SendTo Data Parallelism: each task performs the entire pipeline on its own chunk of input data 23

24 Stateful Stream Processing Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction 24

25 Stateful Stream Processing Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction 25

26 Stateful Processing Samza Application Task 1 Task 2 Task k Store count of access per ip domain State: Ip access count But, using a remote store is slow 26

27 Stateful Processing Samza Application Task 1 Task 2 Task k State: Ip access count 27

28 Stateful Processing Samza Application Task 1 Task 2 Task k [1-100] [ ] [ ] Local State: use the local storage of tasks State: Ip access count Each task, state corresponding to their partition. Ex: task 1, processing input [1-100], state [1-100] Locality 28

29 Stateful Processing Samza Application Task 1 Task 2 Task k [1-100] [ ] [ ] Local Storage: in-memory: fast, but low capacity (GBs) on-disk: high capacity (TBs) and persistent, but slow Disk-mem: Disk (persistent store) + memory (cache) Fast, high capacity, and persistent Tiered data access 29

30 Stateful Processing Samza Application Task 1 Task 2 Task k [1-100] [ ] [ ] What about failures? How to not lose state? 30

31 Fault-Tolerance Task 1 Task 2 Task k Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... partition k partition 2 partition 1

32 Fault-Tolerance X Task 1 Task 2 Task k Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... partition k partition 2 partition 1

33 Fault-Tolerance Task 1 Task 2 Task k X = 10 Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... partition k partition 2 partition 1

34 Fault-Tolerance Background processing Task 1 Task 2 Task k Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... X = 10 partition k partition 2 partition 1

35 But, what is the cost? Latency Consistency Availability 35

36 But, what is the cost? Latency Consistency Replayable input + Consistent snapshot Availability Longer failure recovery 36

37 Consistency Guarantees Task 1 Offset 1005 Task 2 Task k Offsets e.g., Kafka topic Offset 1005 Changelog e.g., Kafka topic... X = 10 partition k partition 2 partition 1

38 Availability Optimizations Availability is compromised. To improve we use: Parallel recovery Background compaction Partitioning & parallelism Background processing Host stickiness Locality 38

39 Fast Restarts with Host Stickiness Input Stream Job Change-log Task-1 Task-4 Task-2 Task-3 Host-A Host-B Host-C Durable : Task-Container-Host Mapping Task 1, Task 4 -> Host-A Task 2 -> Host-B Task 3 -> Host-C Host Stickiness in YARN : - Try to place task on same host after restart - Minimize state rebuilding Overhead

40 Related Work Stateless: do not support state Apache Storm [SIGMOD 14], Heron [SIGMOD 15], S4 [ICDM 10] Remote Store: relying on external storage Trident, MillWheel [VLDB 13], Dataflow [VLDB 15] High latency overhead per input message but, fast recovery (better availability) Local state, but, full state checkpointing: Flink [TCDE 15], Spark Streaming [HotCloud 12], IBM Streams [VLDB 16], Streamscope [NDSI 16], System S [DEBS 11] Does not scale for large application state but, easier repeatable results at failure 40

41 Evaluation 41

42 Evaluation Setup Production Cluster 500 node YARN cluster real world applications Small Cluster 8 node cluster: 64GB RAM, 24 core CPUs, a 1.6 TB SSD micro-benchmarks Read-only workload ~ join with table Read-write workload ~ aggregation over time 42

43 Local State -- Latency > 2 orders of magnitude slower compared to local state Stores: in-mem on-disk disk-mem remote changelog adds minimal overhead on disk w/ caching comparable with in memory Fault-tolerance: None changelog (Clog) remote 43

44 Failure Recovery Sticky Sticky Growing linearly with size of state Almost constant overhead with host stickiness Overhead independent of % of failures (parallel recovery) 44

45 Summary of State in Samza Pros 100x better latency Better resource utilization no need for remote DB resources Parallel and independent processing Cons Dependent to input partitioning Does NOT work when state is large and not co-partitionable in input stream Auto-scaling becomes harder Recovery is slower (vs remote)

46 Conclusion Need to handle state with low latency, at large-scale, and with correct (consistent) results Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction Locality Background processing Partitioning & parallelism 46

47 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 47

48 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 48

49 Steel: Unified and Optimized Edge- Cloud Environment HotCloud 18 Shadi A. Noghabi*, John Kolb^, Peter Bodik**, Eduardo Cuervo**, Indranil Gupta*, Roy Campbell* * University of Illinois at Urbana-Champaign ^ University of California, Berkeley ** Microsoft Research 49

50 Cloud is Not an One-Size-Fits-All Latency < 80ms < 20ms The Cloud is simply too far! > 70 ms round trip time < 10 ms 50

51 Cloud is Not an One-Size-Fits-All Latency Resources < 80ms < 10 ms < 20ms Bandwidth 10s 100s GB/s Y Battery Energy Heating Not enough resources to use the Cloud 51

52 Cloud is Not an One-Size-Fits-All < 80ms Latency Resources < 20ms Bandwidth 10s 100s GB/s Y Privacy & Security < 10 ms Battery Energy Heating 52

53 Edge Computing Edge 3 Cloud Edge 1 Edge 2 Satyanarayanan, M., Bahl, P., et. al. The Case for VM-based Cloudlets in Mobile Computing, Pervasive Computing

54 However Industry is at its infancy of building Edge-Cloud applications No proper unification of Edge & Cloud Hard to configure, deploy, and monitor Manual and redundant optimizations Integrated with production Azure services Steel: A unified edge cloud framework with modular and automated optimizations 54

55 Applications Steel Abstraction (Logical Spec) A B C D Fabric Compile Deploy Monitor & Analyze Optimization Modules Load Placement Communication Balancing Edge-Cloud Ecosystem 55

56 Optimizer Modules Placement Where (edge/cloud) to place? Adapt to long-term changes Communication configure edge-cloud links Adapt to short-term spikes Location and vicinity aware Background profiling & what-if analysis Background batching & compression Partitioned and parallel optimizations Change strategy opportunistically (available resources) Locality Background processing Partitioning & parallelism Opportunistic processing 56

57 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 57

58 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 58

59 Ambry: LinkedIn s Scalable Geo- Distributed Object Store SIGMOD 16 Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil Gupta*, Roy H. Campbell* * University of Illinois at Urbana-Champaign SRG: and DPRG: + LinkedIn Corporation Data Infrastructure team 59

60 Massive Media Objects are Everywhere 100s of Millions of users 100s of TBs to PBs From all around the world 60

retrieves objects in an low-latency and scalable manner Ambry

61 How to handle all these objects? We need a geographically distributed system that stores and retrieves objects in an low-latency and scalable manner Ambry In production for >4 years and >500 M users! Open source: 61

62 Geo-Distribution Locality Count success Alice Asynchronous writes Write synchronously only to local datacenter Asynchronously replicate others Reduce user perceived latency Minimize cross-dc traffic Datacenter 1 Datacenter 2 Datacenter 3 item_i 62

63 Geo-Distribution Locality Background processing Alice Asynchronous writes Write synchronously only to local datacenter Asynchronously replicate others Count success Background Replication Reduce user perceived latency Minimize cross-dc traffic Datacenter 1 Datacenter 2 Datacenter 3 item_i item_i item_i 63

64 Other Techniques Ambry is a large production system with many other techniques: Partitioning & Logical data partitioning parallelism Caching, indexing, bloom filters Background load balancing. Tiered data access Load balancing 64

65 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 65

66 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 66

67 FreeFlow: High performance container networking HotNets 16 Tianlong Yu ^, Shadi A. Noghabi *, Shachar Raindel, Hongqiang Liu, Jitu Padhye, and Vyas Sekar ^ * University of Illinois at Urbana-Champaign Microsoft Research ^ Carnegie Mellon University 67

68 Containers are popular Containers provide good portability and isolation 68

69 FreeFlow Opportunistic processing Node 1 Node 2 Locality Container 1 Container 2 Container 3 Shared memory RDMA FreeFlow Network 69

70 Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 70

71 Future Directions One global system seamlessly across the globe Holistic and cross-layer approach Latency sensitivity-aware approaches (not treating all objects/apps equally) E.g., dynamic use of compression, erasure coding, compaction in storage for old objects E.g., multi-tenant stream processing environments Auto-pilot systems coping with changes automatically and dynamically 71

72 Thanks to Many Collaborators Advisors: Indranil Gupta and Roy Campbell LinkedIn: Sriram Subramanian, Priyesh Narayanan, Kartik Paramasivam, Yi Pan, Navina Ramesh, et al. Microsoft Research: Victor Bahl, Peter Bodik, Eduardo Cuervo, Hongqiang Liu, Jitu Padhye, et. al. 72

73 Publications 1. Steel: Unified Management and Optimization of Edge-Cloud Applications, SA Noghabi, J Kolb, P Bodik, E Cuervo, I Gupta, RH Campbell, (in progress) 2. Steel: Simplified Development and Deployment of Edge-Cloud Applications, SA Noghabi, J Kolb, P Bodik, E Cuervo, HotCloud Samza: Stateful Scalable Stream Processing at LinkedIn, SA Noghabi, K Paramasivam, Y Pan, N Ramesh, J Bringhurst, I Gupta, RH Campbell, VLDB Ambry: LinkedIn s Scalable Geo-Distributed Object Store, SA Noghabi, S Subramanian, P Narayanan, S Narayanan, G Holla, M Zadeh, T Li, I Gupta, RH Campbell, SIGMOD Freeflow: High performance container networking, T Yu, SA Noghabi, S Raindel, H Liu, J Padhye, V Sekar, HotNets Steel: Unified Management and Optimization of Edge-Cloud IoT Applications, NSDI poster To edge or not to edge?, F Kalim, SA Noghabi, S Verma, SoCC poster Building a Scalable Distributed Online Media Processing Environment, SA Noghabi, PhD workshop at VLDB Toward Fabric: A Middleware Implementing High-level Description Languages on a Fabric-like Network, SH Hashemi, SA Noghabi, J Bellessa, RH Campbell, ANCS Towards enabling cooperation between scheduler and storage layer to improve job performance, M Pundir, J Bellessa, SA Noghabi, CL Abad, RH Campbell, PDSW

74 Summary of Thesis Latency-driven designs can be used to build production infrastructures for interactive applications at a global scale Background processing Locality Partitioning & parallelism Processing Samza [VLDB 17] Steel [HotCloud 18] Tiered data access Load balancing Latency Driven Adaptive Scaling Opportunisti c processing Storage Ambry [SIGMOD 16] Networking FreeFlow [HotNets 17] 74

75 Back up Slides 75

76 Theoretical Background Latency-driven design can be modeled theoretically with Queueing Theory* System: A hierarchical queuing system Latency: Response time (response_processed - req_enter_time) Techniques: queueing theory models/techniques Parallelism & Partitioning multiple queues, parallel servers Adaptive scaling adding more servers * Gross, Donald. Fundamentals of queueing theory. John Wiley & Sons,

77 Other Techniques Batch + Stream in one System Reprocess parts/all stream or Database bugs, upgrades, logic changes Batch as a finite stream + conflict resolution, scaling & throttling Adaptive Scaling 77

Local State -- Throughput remote state 30-150x worse than local state on

78 Local State -- Throughput remote state x worse than local state on disk w/ caching comparable with in memory changelog adds minimal overhead 78

79 Snapshot vs Changelog 79

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Ambry: LinkedIn s Scalable Geo- Distributed Object Store Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil