Building Interactive Distributed Processing Applications at a Global Scale Shadi A. Noghabi Prelim Committee: Prof. Roy Campbell (advisor), Prof. Indy Gupta (advisor), Prof. Klara Nahrstedt, Dr. Victor Bahl 1
Many Interactive Applications < few seconds 100s of PBs data < few minutes Trillions of events < few milliseconds Millions - billions of devices 2
Low Latency at Large Scale Latency-sensitive, reaching low latency is an important for interactivity 100ms latency (Amazon) 1% loss in revenue $4M per millisec! * Large scale processing, at large and global scales My focus: Building Interactive systems at a production scale * https://blog.gigaspaces.com/amazon-found-every-100ms-of-latency-cost-them-1-in-sales/ 3
At Scale, Low Latency is Challenging Heterogeneity Data Machines Operations Dynamism Workload Environment Distribution State & Replication Imbalance 4
Thesis Statement Latency-driven designs can be used to build production infrastructures for interactive applications at a global scale while addressing myriad challenges on heterogeneity, dynamism, state, and imbalance. 5
Latency-driven Design Achieving latency has the highest priority PACELC 1,2 Latency Latency Latency Consistency Availability Throughput Isolation 1. Abadi, Consistency Tradeoffs in Modern Distributed Database System Design, Computer 2012 2. Brewer, Gilbert, and Lynch Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services, Sigact News 2002 6
Related Work Latency-Driven in other frameworks Trend of NoSQL systems: Cassandra, Amazon s DynamoDB, Pileus. Latency-Driven does not fit all Consistency driven: Consistent Storage: BigTable, Spanner, Yahoo s PNUTS, Espresso Exactly-once Stream processing: Flink, Millwheel, Spark Streaming, Trident Throughput driven: Batch processing: Hadoop, Spark, Tez, Pig and Hive Micro-batch streaming: Spark Streaming, media processing High-throughput Storage: HDFS and GFS 7
Latency-driven Techniques Background processing Locality Partitioning & parallelism Tiered data access Load balancing Latency Driven Adaptive Scaling Opportunistic processing 8
Latency-driven at All Layers Edge Edge DC 1 DC 2 Storage DC n Processing Networking Edge Ambry [SIGMOD 16] Samza [VLDB 17] Steel Edge [HotCloud 18] FreeFlow [HotNets 17] General Used Main Integrated container production ~15 Companies, networking with system production with >4 applicable years 100s Azure & of 500M applications to and Docker, users Windows at Kubernetes, at LinkedIn IoT etc. 9
Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 10
Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 11
Samza: Stateful Scalable Stream Processing at LinkedIn VLDB 17 Shadi A. Noghabi*, Kartik Paramasivam^, Yi Pan^, Navina Ramesh^, Jon Bringhurst^, Indranil Gupta*, Roy Campbell* * University of Illinois at Urbana-Champaign ^ LinkedIn Corp. +
Stream Processing* Interactive feeds or ads Security & Monitoring Internet of Things *Stream processing is the processing of data in motion, i.e., computing on data immediately as it is produced 13
Stream Applications need State Access Stream (IP, access time) DoS Identifier IP Count 1.1.1.1 2 2.3.2.1 30 DoS Stream (Attacker IP) State: durable data read/written along with the processing Many cases need large state Aggregation: aggregating counts over a window Join: two (or more) streams over a window Other: machine learning model 14
Scale of Processing Large scale of input, in Kafka alone: 2.1 Trillion msg/day, 16 Million msg/sec peaks 0.5 PB in, 2 PB out per day Many applications: 100s of production applications Over 10,000 containers Large scale of state Several 100s of TBs for a single application 15
Stateful Stream Processing Need to handle state with low latency, at large-scale, and with correct (consistent) results http://samza.apache.org/ Opensource Apache project Powers hundreds of apps in LinkedIn s production In use at LinkedIn, Uber, Metamarkets, Netflix, Intuit, TripAdvisor, VMware, Optimizely, Redfin, etc. 16
Stateful Stream Processing Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction 17
Stateful Stream Processing Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction 18
Processing from a Partitioned Source Stream of events Ex: page access, ad views, etc. 19
Processing from a Partitioned Source Stream of events Partition 1 Partition 2 Ex: page access, ad views, etc. Partition k 20
Processing from a Partitioned Source Stream of events Partition 1 Partition 2 Ex: page access, ad views, etc. Partition k 21
Processing from a Partitioned Source Stream of events Partition 1 Task 1 Ex: page access, ad views, etc. Partition 2 Partition k Task 2 Partitioning & parallelism Task k 22
Multi-Stage Dataflow Locality Application logic: Count number of Page accesses for each ip domain in a 5 minute window Page Access in stream Application DAG in tasks DoS Stream out stream Window count Filter SendTo Data Parallelism: each task performs the entire pipeline on its own chunk of input data 23
Stateful Stream Processing Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction 24
Stateful Stream Processing Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction 25
Stateful Processing Samza Application Task 1 Task 2 Task k Store count of access per ip domain State: Ip access count But, using a remote store is slow 26
Stateful Processing Samza Application Task 1 Task 2 Task k State: Ip access count 27
Stateful Processing Samza Application Task 1 Task 2 Task k [1-100] [100-200] [900-1000] Local State: use the local storage of tasks State: Ip access count Each task, state corresponding to their partition. Ex: task 1, processing input [1-100], state [1-100] Locality 28
Stateful Processing Samza Application Task 1 Task 2 Task k [1-100] [100-200] [900-1000] Local Storage: in-memory: fast, but low capacity (GBs) on-disk: high capacity (TBs) and persistent, but slow Disk-mem: Disk (persistent store) + memory (cache) Fast, high capacity, and persistent Tiered data access 29
Stateful Processing Samza Application Task 1 Task 2 Task k [1-100] [100-200] [900-1000] What about failures? How to not lose state? 30
Fault-Tolerance Task 1 Task 2 Task k Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... partition k partition 2 partition 1
Fault-Tolerance X Task 1 Task 2 Task k Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... partition k partition 2 partition 1
Fault-Tolerance Task 1 Task 2 Task k X = 10 Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... partition k partition 2 partition 1
Fault-Tolerance Background processing Task 1 Task 2 Task k Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... X = 10 partition k partition 2 partition 1
But, what is the cost? Latency Consistency Availability 35
But, what is the cost? Latency Consistency Replayable input + Consistent snapshot Availability Longer failure recovery 36
Consistency Guarantees Task 1 Offset 1005 Task 2 Task k Offsets e.g., Kafka topic Offset 1005 Changelog e.g., Kafka topic... X = 10 partition k partition 2 partition 1
Availability Optimizations Availability is compromised. To improve we use: Parallel recovery Background compaction Partitioning & parallelism Background processing Host stickiness Locality 38
Fast Restarts with Host Stickiness Input Stream Job Change-log Task-1 Task-4 Task-2 Task-3 Host-A Host-B Host-C Durable : Task-Container-Host Mapping Task 1, Task 4 -> Host-A Task 2 -> Host-B Task 3 -> Host-C Host Stickiness in YARN : - Try to place task on same host after restart - Minimize state rebuilding Overhead
Related Work Stateless: do not support state Apache Storm [SIGMOD 14], Heron [SIGMOD 15], S4 [ICDM 10] Remote Store: relying on external storage Trident, MillWheel [VLDB 13], Dataflow [VLDB 15] High latency overhead per input message but, fast recovery (better availability) Local state, but, full state checkpointing: Flink [TCDE 15], Spark Streaming [HotCloud 12], IBM Streams [VLDB 16], Streamscope [NDSI 16], System S [DEBS 11] Does not scale for large application state but, easier repeatable results at failure 40
Evaluation 41
Evaluation Setup Production Cluster 500 node YARN cluster real world applications Small Cluster 8 node cluster: 64GB RAM, 24 core CPUs, a 1.6 TB SSD micro-benchmarks Read-only workload ~ join with table Read-write workload ~ aggregation over time 42
Local State -- Latency > 2 orders of magnitude slower compared to local state Stores: in-mem on-disk disk-mem remote changelog adds minimal overhead on disk w/ caching comparable with in memory Fault-tolerance: None changelog (Clog) remote 43
Failure Recovery Sticky Sticky Growing linearly with size of state Almost constant overhead with host stickiness Overhead independent of % of failures (parallel recovery) 44
Summary of State in Samza Pros 100x better latency Better resource utilization no need for remote DB resources Parallel and independent processing Cons Dependent to input partitioning Does NOT work when state is large and not co-partitionable in input stream Auto-scaling becomes harder Recovery is slower (vs remote)
Conclusion Need to handle state with low latency, at large-scale, and with correct (consistent) results Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction Locality Background processing Partitioning & parallelism 46
Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 47
Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 48
Steel: Unified and Optimized Edge- Cloud Environment HotCloud 18 Shadi A. Noghabi*, John Kolb^, Peter Bodik**, Eduardo Cuervo**, Indranil Gupta*, Roy Campbell* * University of Illinois at Urbana-Champaign ^ University of California, Berkeley ** Microsoft Research 49
Cloud is Not an One-Size-Fits-All Latency < 80ms < 20ms The Cloud is simply too far! > 70 ms round trip time < 10 ms 50
Cloud is Not an One-Size-Fits-All Latency Resources < 80ms < 10 ms < 20ms Bandwidth 10s 100s GB/s Y Battery Energy Heating Not enough resources to use the Cloud 51
Cloud is Not an One-Size-Fits-All < 80ms Latency Resources < 20ms Bandwidth 10s 100s GB/s Y Privacy & Security < 10 ms Battery Energy Heating 52
Edge Computing Edge 3 Cloud Edge 1 Edge 2 Satyanarayanan, M., Bahl, P., et. al. The Case for VM-based Cloudlets in Mobile Computing, Pervasive Computing 2009 53
However Industry is at its infancy of building Edge-Cloud applications No proper unification of Edge & Cloud Hard to configure, deploy, and monitor Manual and redundant optimizations Integrated with production Azure services Steel: A unified edge cloud framework with modular and automated optimizations 54
Applications Steel Abstraction (Logical Spec) A B C D Fabric Compile Deploy Monitor & Analyze Optimization Modules Load Placement Communication Balancing Edge-Cloud Ecosystem 55
Optimizer Modules Placement Where (edge/cloud) to place? Adapt to long-term changes Communication configure edge-cloud links Adapt to short-term spikes Location and vicinity aware Background profiling & what-if analysis Background batching & compression Partitioned and parallel optimizations Change strategy opportunistically (available resources) Locality Background processing Partitioning & parallelism Opportunistic processing 56
Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 57
Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 58
Ambry: LinkedIn s Scalable Geo- Distributed Object Store SIGMOD 16 Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil Gupta*, Roy H. Campbell* * University of Illinois at Urbana-Champaign SRG: http://srg.cs.illinois.edu and DPRG: http://dprg.cs.uiuc.edu + LinkedIn Corporation Data Infrastructure team 59
Massive Media Objects are Everywhere 100s of Millions of users 100s of TBs to PBs From all around the world 60
How to handle all these objects? We need a geographically distributed system that stores and retrieves objects in an low-latency and scalable manner Ambry In production for >4 years and >500 M users! Open source: https://github.com/linkedin/ambry 61
Geo-Distribution Locality Count success Alice Asynchronous writes Write synchronously only to local datacenter Asynchronously replicate others Reduce user perceived latency Minimize cross-dc traffic Datacenter 1 Datacenter 2 Datacenter 3 item_i 62
Geo-Distribution Locality Background processing Alice Asynchronous writes Write synchronously only to local datacenter Asynchronously replicate others Count success Background Replication Reduce user perceived latency Minimize cross-dc traffic Datacenter 1 Datacenter 2 Datacenter 3 item_i item_i item_i 63
Other Techniques Ambry is a large production system with many other techniques: Partitioning & Logical data partitioning parallelism Caching, indexing, bloom filters Background load balancing. Tiered data access Load balancing 64
Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 65
Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 66
FreeFlow: High performance container networking HotNets 16 Tianlong Yu ^, Shadi A. Noghabi *, Shachar Raindel, Hongqiang Liu, Jitu Padhye, and Vyas Sekar ^ * University of Illinois at Urbana-Champaign Microsoft Research ^ Carnegie Mellon University 67
Containers are popular Containers provide good portability and isolation https://clusterhq.com/assets/pdfs/state-of-container-usage-june-2016.pdf 68
FreeFlow Opportunistic processing Node 1 Node 2 Locality Container 1 Container 2 Container 3 Shared memory RDMA FreeFlow Network 69
Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 70
Future Directions One global system seamlessly across the globe Holistic and cross-layer approach Latency sensitivity-aware approaches (not treating all objects/apps equally) E.g., dynamic use of compression, erasure coding, compaction in storage for old objects E.g., multi-tenant stream processing environments Auto-pilot systems coping with changes automatically and dynamically 71
Thanks to Many Collaborators Advisors: Indranil Gupta and Roy Campbell LinkedIn: Sriram Subramanian, Priyesh Narayanan, Kartik Paramasivam, Yi Pan, Navina Ramesh, et al. Microsoft Research: Victor Bahl, Peter Bodik, Eduardo Cuervo, Hongqiang Liu, Jitu Padhye, et. al. 72
Publications 1. Steel: Unified Management and Optimization of Edge-Cloud Applications, SA Noghabi, J Kolb, P Bodik, E Cuervo, I Gupta, RH Campbell, (in progress) 2. Steel: Simplified Development and Deployment of Edge-Cloud Applications, SA Noghabi, J Kolb, P Bodik, E Cuervo, HotCloud 2018 3. Samza: Stateful Scalable Stream Processing at LinkedIn, SA Noghabi, K Paramasivam, Y Pan, N Ramesh, J Bringhurst, I Gupta, RH Campbell, VLDB 2017 4. Ambry: LinkedIn s Scalable Geo-Distributed Object Store, SA Noghabi, S Subramanian, P Narayanan, S Narayanan, G Holla, M Zadeh, T Li, I Gupta, RH Campbell, SIGMOD 2016 5. Freeflow: High performance container networking, T Yu, SA Noghabi, S Raindel, H Liu, J Padhye, V Sekar, HotNets 2016 6. Steel: Unified Management and Optimization of Edge-Cloud IoT Applications, NSDI poster 2018 7. To edge or not to edge?, F Kalim, SA Noghabi, S Verma, SoCC poster 2017 8. Building a Scalable Distributed Online Media Processing Environment, SA Noghabi, PhD workshop at VLDB 2016 9. Toward Fabric: A Middleware Implementing High-level Description Languages on a Fabric-like Network, SH Hashemi, SA Noghabi, J Bellessa, RH Campbell, ANCS 2016. 10. Towards enabling cooperation between scheduler and storage layer to improve job performance, M Pundir, J Bellessa, SA Noghabi, CL Abad, RH Campbell, PDSW 2015 73
Summary of Thesis Latency-driven designs can be used to build production infrastructures for interactive applications at a global scale Background processing Locality Partitioning & parallelism Processing Samza [VLDB 17] Steel [HotCloud 18] Tiered data access Load balancing Latency Driven Adaptive Scaling Opportunisti c processing Storage Ambry [SIGMOD 16] Networking FreeFlow [HotNets 17] 74
Back up Slides 75
Theoretical Background Latency-driven design can be modeled theoretically with Queueing Theory* System: A hierarchical queuing system Latency: Response time (response_processed - req_enter_time) Techniques: queueing theory models/techniques Parallelism & Partitioning multiple queues, parallel servers Adaptive scaling adding more servers * Gross, Donald. Fundamentals of queueing theory. John Wiley & Sons, 2008. 76
Other Techniques Batch + Stream in one System Reprocess parts/all stream or Database bugs, upgrades, logic changes Batch as a finite stream + conflict resolution, scaling & throttling Adaptive Scaling 77
Local State -- Throughput remote state 30-150x worse than local state on disk w/ caching comparable with in memory changelog adds minimal overhead 78
Snapshot vs Changelog 79