Building Interactive Distributed Processing Applications at a Global Scale

Building Interactive Distributed Processing Applications at a Global Scale Shadi A. Noghabi Prelim Committee: Prof. Roy Campbell (advisor), Prof. Indy Gupta (advisor), Prof. Klara Nahrstedt, Dr. Victor Bahl 1

Many Interactive Applications < few seconds 100s of PBs data < few minutes Trillions of events < few milliseconds Millions - billions of devices 2

Low Latency at Large Scale Latency-sensitive, reaching low latency is an important for interactivity 100ms latency (Amazon) 1% loss in revenue $4M per millisec! * Large scale processing, at large and global scales My focus: Building Interactive systems at a production scale * https://blog.gigaspaces.com/amazon-found-every-100ms-of-latency-cost-them-1-in-sales/ 3

At Scale, Low Latency is Challenging Heterogeneity Data Machines Operations Dynamism Workload Environment Distribution State & Replication Imbalance 4

Thesis Statement Latency-driven designs can be used to build production infrastructures for interactive applications at a global scale while addressing myriad challenges on heterogeneity, dynamism, state, and imbalance. 5

Latency-driven Design Achieving latency has the highest priority PACELC 1,2 Latency Latency Latency Consistency Availability Throughput Isolation 1. Abadi, Consistency Tradeoffs in Modern Distributed Database System Design, Computer 2012 2. Brewer, Gilbert, and Lynch Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services, Sigact News 2002 6

Related Work Latency-Driven in other frameworks Trend of NoSQL systems: Cassandra, Amazon s DynamoDB, Pileus. Latency-Driven does not fit all Consistency driven: Consistent Storage: BigTable, Spanner, Yahoo s PNUTS, Espresso Exactly-once Stream processing: Flink, Millwheel, Spark Streaming, Trident Throughput driven: Batch processing: Hadoop, Spark, Tez, Pig and Hive Micro-batch streaming: Spark Streaming, media processing High-throughput Storage: HDFS and GFS 7

Latency-driven Techniques Background processing Locality Partitioning & parallelism Tiered data access Load balancing Latency Driven Adaptive Scaling Opportunistic processing 8

Latency-driven at All Layers Edge Edge DC 1 DC 2 Storage DC n Processing Networking Edge Ambry [SIGMOD 16] Samza [VLDB 17] Steel Edge [HotCloud 18] FreeFlow [HotNets 17] General Used Main Integrated container production ~15 Companies, networking with system production with >4 applicable years 100s Azure & of 500M applications to and Docker, users Windows at Kubernetes, at LinkedIn IoT etc. 9

Networking Processing Storage Latency-driven Techniques Background processing Locality Partitioning & parallelism Ambry [SIGMOD 16] Samza [VLDB 17] Tiered data access Latency Driven Adaptive Scaling Steel [HotCloud 18] FreeFlow Load balancing Opportunistic processing [HotNets 17] 10

Samza: Stateful Scalable Stream Processing at LinkedIn VLDB 17 Shadi A. Noghabi*, Kartik Paramasivam^, Yi Pan^, Navina Ramesh^, Jon Bringhurst^, Indranil Gupta*, Roy Campbell* * University of Illinois at Urbana-Champaign ^ LinkedIn Corp. +

Stream Processing* Interactive feeds or ads Security & Monitoring Internet of Things *Stream processing is the processing of data in motion, i.e., computing on data immediately as it is produced 13

Stream Applications need State Access Stream (IP, access time) DoS Identifier IP Count 1.1.1.1 2 2.3.2.1 30 DoS Stream (Attacker IP) State: durable data read/written along with the processing Many cases need large state Aggregation: aggregating counts over a window Join: two (or more) streams over a window Other: machine learning model 14

Scale of Processing Large scale of input, in Kafka alone: 2.1 Trillion msg/day, 16 Million msg/sec peaks 0.5 PB in, 2 PB out per day Many applications: 100s of production applications Over 10,000 containers Large scale of state Several 100s of TBs for a single application 15

Stateful Stream Processing Need to handle state with low latency, at large-scale, and with correct (consistent) results http://samza.apache.org/ Opensource Apache project Powers hundreds of apps in LinkedIn s production In use at LinkedIn, Uber, Metamarkets, Netflix, Intuit, TripAdvisor, VMware, Optimizely, Redfin, etc. 16

Stateful Stream Processing Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction 17

Processing from a Partitioned Source Stream of events Ex: page access, ad views, etc. 19

Processing from a Partitioned Source Stream of events Partition 1 Partition 2 Ex: page access, ad views, etc. Partition k 20

Processing from a Partitioned Source Stream of events Partition 1 Partition 2 Ex: page access, ad views, etc. Partition k 21

Processing from a Partitioned Source Stream of events Partition 1 Task 1 Ex: page access, ad views, etc. Partition 2 Partition k Task 2 Partitioning & parallelism Task k 22

Multi-Stage Dataflow Locality Application logic: Count number of Page accesses for each ip domain in a 5 minute window Page Access in stream Application DAG in tasks DoS Stream out stream Window count Filter SendTo Data Parallelism: each task performs the entire pipeline on its own chunk of input data 23

Stateful Processing Samza Application Task 1 Task 2 Task k Store count of access per ip domain State: Ip access count But, using a remote store is slow 26

Stateful Processing Samza Application Task 1 Task 2 Task k State: Ip access count 27

Stateful Processing Samza Application Task 1 Task 2 Task k [1-100] [100-200] [900-1000] Local State: use the local storage of tasks State: Ip access count Each task, state corresponding to their partition. Ex: task 1, processing input [1-100], state [1-100] Locality 28

Stateful Processing Samza Application Task 1 Task 2 Task k [1-100] [100-200] [900-1000] Local Storage: in-memory: fast, but low capacity (GBs) on-disk: high capacity (TBs) and persistent, but slow Disk-mem: Disk (persistent store) + memory (cache) Fast, high capacity, and persistent Tiered data access 29

Stateful Processing Samza Application Task 1 Task 2 Task k [1-100] [100-200] [900-1000] What about failures? How to not lose state? 30

Fault-Tolerance Task 1 Task 2 Task k Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... partition k partition 2 partition 1

Fault-Tolerance X Task 1 Task 2 Task k Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... partition k partition 2 partition 1

Fault-Tolerance Task 1 Task 2 Task k X = 10 Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... partition k partition 2 partition 1

Fault-Tolerance Background processing Task 1 Task 2 Task k Changes saved to a durable change log Recovery by replay change-log Periodically batch & flush changes Changelog e.g., Kafka log compacted topic... X = 10 partition k partition 2 partition 1

But, what is the cost? Latency Consistency Availability 35

But, what is the cost? Latency Consistency Replayable input + Consistent snapshot Availability Longer failure recovery 36

Consistency Guarantees Task 1 Offset 1005 Task 2 Task k Offsets e.g., Kafka topic Offset 1005 Changelog e.g., Kafka topic... X = 10 partition k partition 2 partition 1

Availability Optimizations Availability is compromised. To improve we use: Parallel recovery Background compaction Partitioning & parallelism Background processing Host stickiness Locality 38

Fast Restarts with Host Stickiness Input Stream Job Change-log Task-1 Task-4 Task-2 Task-3 Host-A Host-B Host-C Durable : Task-Container-Host Mapping Task 1, Task 4 -> Host-A Task 2 -> Host-B Task 3 -> Host-C Host Stickiness in YARN : - Try to place task on same host after restart - Minimize state rebuilding Overhead

Related Work Stateless: do not support state Apache Storm [SIGMOD 14], Heron [SIGMOD 15], S4 [ICDM 10] Remote Store: relying on external storage Trident, MillWheel [VLDB 13], Dataflow [VLDB 15] High latency overhead per input message but, fast recovery (better availability) Local state, but, full state checkpointing: Flink [TCDE 15], Spark Streaming [HotCloud 12], IBM Streams [VLDB 16], Streamscope [NDSI 16], System S [DEBS 11] Does not scale for large application state but, easier repeatable results at failure 40

Evaluation 41

Evaluation Setup Production Cluster 500 node YARN cluster real world applications Small Cluster 8 node cluster: 64GB RAM, 24 core CPUs, a 1.6 TB SSD micro-benchmarks Read-only workload ~ join with table Read-write workload ~ aggregation over time 42

Local State -- Latency > 2 orders of magnitude slower compared to local state Stores: in-mem on-disk disk-mem remote changelog adds minimal overhead on disk w/ caching comparable with in memory Fault-tolerance: None changelog (Clog) remote 43

Failure Recovery Sticky Sticky Growing linearly with size of state Almost constant overhead with host stickiness Overhead independent of % of failures (parallel recovery) 44

Summary of State in Samza Pros 100x better latency Better resource utilization no need for remote DB resources Parallel and independent processing Cons Dependent to input partitioning Does NOT work when state is large and not co-partitionable in input stream Auto-scaling becomes harder Recovery is slower (vs remote)

Conclusion Need to handle state with low latency, at large-scale, and with correct (consistent) results Processing Model Input partitioning Parallel and independent tasks Locality aware data passing Stateful Local state Incremental checkpointing 3-Tier caching Parallel recovery Host Stickiness Compaction Locality Background processing Partitioning & parallelism 46

Steel: Unified and Optimized Edge- Cloud Environment HotCloud 18 Shadi A. Noghabi*, John Kolb^, Peter Bodik**, Eduardo Cuervo**, Indranil Gupta*, Roy Campbell* * University of Illinois at Urbana-Champaign ^ University of California, Berkeley ** Microsoft Research 49

Cloud is Not an One-Size-Fits-All Latency < 80ms < 20ms The Cloud is simply too far! > 70 ms round trip time < 10 ms 50

Cloud is Not an One-Size-Fits-All Latency Resources < 80ms < 10 ms < 20ms Bandwidth 10s 100s GB/s Y Battery Energy Heating Not enough resources to use the Cloud 51

Cloud is Not an One-Size-Fits-All < 80ms Latency Resources < 20ms Bandwidth 10s 100s GB/s Y Privacy & Security < 10 ms Battery Energy Heating 52

Edge Computing Edge 3 Cloud Edge 1 Edge 2 Satyanarayanan, M., Bahl, P., et. al. The Case for VM-based Cloudlets in Mobile Computing, Pervasive Computing 2009 53

However Industry is at its infancy of building Edge-Cloud applications No proper unification of Edge & Cloud Hard to configure, deploy, and monitor Manual and redundant optimizations Integrated with production Azure services Steel: A unified edge cloud framework with modular and automated optimizations 54

Applications Steel Abstraction (Logical Spec) A B C D Fabric Compile Deploy Monitor & Analyze Optimization Modules Load Placement Communication Balancing Edge-Cloud Ecosystem 55

Optimizer Modules Placement Where (edge/cloud) to place? Adapt to long-term changes Communication configure edge-cloud links Adapt to short-term spikes Location and vicinity aware Background profiling & what-if analysis Background batching & compression Partitioned and parallel optimizations Change strategy opportunistically (available resources) Locality Background processing Partitioning & parallelism Opportunistic processing 56

Ambry: LinkedIn s Scalable Geo- Distributed Object Store SIGMOD 16 Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil Gupta*, Roy H. Campbell* * University of Illinois at Urbana-Champaign SRG: http://srg.cs.illinois.edu and DPRG: http://dprg.cs.uiuc.edu + LinkedIn Corporation Data Infrastructure team 59

Massive Media Objects are Everywhere 100s of Millions of users 100s of TBs to PBs From all around the world 60

How to handle all these objects? We need a geographically distributed system that stores and retrieves objects in an low-latency and scalable manner Ambry In production for >4 years and >500 M users! Open source: https://github.com/linkedin/ambry 61

Geo-Distribution Locality Count success Alice Asynchronous writes Write synchronously only to local datacenter Asynchronously replicate others Reduce user perceived latency Minimize cross-dc traffic Datacenter 1 Datacenter 2 Datacenter 3 item_i 62

Geo-Distribution Locality Background processing Alice Asynchronous writes Write synchronously only to local datacenter Asynchronously replicate others Count success Background Replication Reduce user perceived latency Minimize cross-dc traffic Datacenter 1 Datacenter 2 Datacenter 3 item_i item_i item_i 63

Other Techniques Ambry is a large production system with many other techniques: Partitioning & Logical data partitioning parallelism Caching, indexing, bloom filters Background load balancing. Tiered data access Load balancing 64

FreeFlow: High performance container networking HotNets 16 Tianlong Yu ^, Shadi A. Noghabi *, Shachar Raindel, Hongqiang Liu, Jitu Padhye, and Vyas Sekar ^ * University of Illinois at Urbana-Champaign Microsoft Research ^ Carnegie Mellon University 67

Containers are popular Containers provide good portability and isolation https://clusterhq.com/assets/pdfs/state-of-container-usage-june-2016.pdf 68

FreeFlow Opportunistic processing Node 1 Node 2 Locality Container 1 Container 2 Container 3 Shared memory RDMA FreeFlow Network 69

Future Directions One global system seamlessly across the globe Holistic and cross-layer approach Latency sensitivity-aware approaches (not treating all objects/apps equally) E.g., dynamic use of compression, erasure coding, compaction in storage for old objects E.g., multi-tenant stream processing environments Auto-pilot systems coping with changes automatically and dynamically 71

Thanks to Many Collaborators Advisors: Indranil Gupta and Roy Campbell LinkedIn: Sriram Subramanian, Priyesh Narayanan, Kartik Paramasivam, Yi Pan, Navina Ramesh, et al. Microsoft Research: Victor Bahl, Peter Bodik, Eduardo Cuervo, Hongqiang Liu, Jitu Padhye, et. al. 72

Publications 1. Steel: Unified Management and Optimization of Edge-Cloud Applications, SA Noghabi, J Kolb, P Bodik, E Cuervo, I Gupta, RH Campbell, (in progress) 2. Steel: Simplified Development and Deployment of Edge-Cloud Applications, SA Noghabi, J Kolb, P Bodik, E Cuervo, HotCloud 2018 3. Samza: Stateful Scalable Stream Processing at LinkedIn, SA Noghabi, K Paramasivam, Y Pan, N Ramesh, J Bringhurst, I Gupta, RH Campbell, VLDB 2017 4. Ambry: LinkedIn s Scalable Geo-Distributed Object Store, SA Noghabi, S Subramanian, P Narayanan, S Narayanan, G Holla, M Zadeh, T Li, I Gupta, RH Campbell, SIGMOD 2016 5. Freeflow: High performance container networking, T Yu, SA Noghabi, S Raindel, H Liu, J Padhye, V Sekar, HotNets 2016 6. Steel: Unified Management and Optimization of Edge-Cloud IoT Applications, NSDI poster 2018 7. To edge or not to edge?, F Kalim, SA Noghabi, S Verma, SoCC poster 2017 8. Building a Scalable Distributed Online Media Processing Environment, SA Noghabi, PhD workshop at VLDB 2016 9. Toward Fabric: A Middleware Implementing High-level Description Languages on a Fabric-like Network, SH Hashemi, SA Noghabi, J Bellessa, RH Campbell, ANCS 2016. 10. Towards enabling cooperation between scheduler and storage layer to improve job performance, M Pundir, J Bellessa, SA Noghabi, CL Abad, RH Campbell, PDSW 2015 73

Summary of Thesis Latency-driven designs can be used to build production infrastructures for interactive applications at a global scale Background processing Locality Partitioning & parallelism Processing Samza [VLDB 17] Steel [HotCloud 18] Tiered data access Load balancing Latency Driven Adaptive Scaling Opportunisti c processing Storage Ambry [SIGMOD 16] Networking FreeFlow [HotNets 17] 74

Back up Slides 75

Theoretical Background Latency-driven design can be modeled theoretically with Queueing Theory* System: A hierarchical queuing system Latency: Response time (response_processed - req_enter_time) Techniques: queueing theory models/techniques Parallelism & Partitioning multiple queues, parallel servers Adaptive scaling adding more servers * Gross, Donald. Fundamentals of queueing theory. John Wiley & Sons, 2008. 76

Other Techniques Batch + Stream in one System Reprocess parts/all stream or Database bugs, upgrades, logic changes Batch as a finite stream + conflict resolution, scaling & throttling Adaptive Scaling 77

Local State -- Throughput remote state 30-150x worse than local state on disk w/ caching comparable with in memory changelog adds minimal overhead 78

Snapshot vs Changelog 79