Data Driven Networks Sachin Katti
Is it possible for to Learn the control planes of networks and applications? Operators specify what they want, and the system learns how to deliver
CAN WE LEARN THE CONTROL PLANE OF THE NETWORK? DRONE ATC 4K VIDEO VIRTUAL REALITY Applications with Access to Network Control Northbound Application Interfaces Operator Policies Operator Operator & Policies, Intents Intent Policies, & SLA Intent & SLA Real-time Network Telemetry Control & Orchestration Plane automatically synthesized by learning from telemetry data Southbound Infrastructure Interfaces Edge Cloud RAN Infrastructure Control Points You specify what you want, and the network figures out how to deliver it! Result: Dynamic per user/flow control implemented to meet the policy and deliver on the SLA
DATA DRIVEN CONTROL LOOP Operator Policies Intent & SLAs Service KPI Apply Controls Realtime Data Prediction Network Telemetry Will Current Controls Meet KPI? Y Recommend Current Control Settings N What-If Adjust Control Knobs Will Adjusted Controls Meet KPI? Y Meets Operator Policies? Y Recommend Adjusted Control Settings N N
TWO LEARNING APPROACHES: Classical Machine Learning Deep Reinforcement Learning Reward Function Neural Network Reinforcement Learning (Prediction & What-if) Requires extensive domain knowledge Largely blind, truly approximates intent based system design
Classical Machine Learning Deep Reinforcement Learning Build Rules-based Algorithm / Model Learning Agent Deploy State Reward Actions Evaluate Results Application and Network state variables Reward values to achieve desired outcomes Controllable actions Tune / Update Algorithm / Model Environment Millions of Training Cycles per hour
DEEP RL IS A GREAT FIT FOR NETWORK CONTROL LEARN DIRECTLY FROM EXPERIENCE NO MODELS, BRITTLE ASSUMPTIONS POLICIES ADAPT TO ACTUAL ENVIRONMENT & WORKLOAD OPTIMIZE A VARIETY OF OBJECTIVES END-TO-END PROVIDE RAW OBSERVATIONS OF NETWORK CONDITIONS EXPRESS GOALS THROUGH REWARD SYSTEM DOES THE REST NETWORK CONTROL DECISIONS ARE OFTEN HIGHLY REPETITIVE LOTS OF TRAINING DATA SAMPLE COMPLEXITY LESS OF A CONCERN
LEARNING & CONTROL PIPELINE Inference Prediction Control Machine Learning Deep Learning Feature Extraction Neural Networks
INFERENCE Inference Understand how known network input variables affect network KPI (e.g throughput) Leverage traditional supervised machine learning and statistical analysis (random forest, etc.) to develop function f(x t+1 ) to be used as input for deep learning agent X = Average Throughput per user, per cell Current and past collisions per cell (a) Current and past number of users per cell (b) Current and past signal strength per user, per cell (c) X t = f(a, b, c)
PREDICTION Prediction Extracting features from Training Data (offline) Network State Variables a, b, c Input Features from realtime data feed Prediction Neural Network (real-time) X t+1 Throughput KPI Prediction
CONTROL Control Network State Variables Prediction Neural Network (real-time) Deep Learning (unsupervised, reinforced learning) Operator Policies Client Conditions App specific inputs Control Decision
USE CASE: VIDEO STREAMING Network Assisted Mobile Video Streaming CDN STREAM Optimization Objective: Maximize video quality and Minimize Stall Time in challenging network scenarios Next Chunk ABR Result: 50-75% Higher Video Quality 75%-100% Lower Stall Time Policy: Maintain minimum average throughput of 500Kbps per user for non-video traffic Data: Real-time Network State Data Feed Real-time Mobile Application State Data Control: Next Chunk Adaptive Bit Rate (ABR)
API SYSTEM ARCHITECTURE CDN Video Stream ABR Request LTE Video Encoded for ABR Streaming API Real-time Network Conditions ABR guidance request Network Assisted ABR Video ABR recommendation for next chunk 13
CELL DATA INGESTION IN TWO DEPLOYMENTS Melbourne Business District ~90 enodebs, ~270 cells Silicon Valley Corridor ~250 enodebs, ~1300 cells
-6-4 -2 0 2 4 6-6 -4-2 0 2 4 6-6 -4-2 0 2 4 6 INFER Video QoE depends on user throughput. What does user throughput depend on? 58% error 0 20 40 60 80 100 USER SESSION # 40% error 0 20 40 60 80 100 USER SESSION # Throughput bits per time 12% error 0 20 40 60 80 100 USER SESSION # Spectral efficiency bits per resource element Resource allocation rate resource elements per time CQI Rank MCS X Σ Active users Σ Pkt rate, size Cell bandwidth Cell contention User&total demand? User performance can be modeled fairly accurately using features extracted from real time network data feeds
remainder trend -20 20 60 seasonal data -20 20 60-20 20 60-20 20 60 15 20 25 30 35 40 PREDICT Predicting the network state variables and then throughput Actual Foreca st 4% error 00:00 01:00 02:00 03:00 04:00 05:00 0 20 40 60 80 100 time Number of users on a cell Test for seasonality and trend, learning ARIMA using a neural network
Predict + Control inputs Hidden layers past b_app ts conv1d relu past b_seg ts RSRP/RSRQ ts CELT ts relu relu linear thpt prediction video buffer states B R C A relu softmax ABR Policy LSTM thpt forecaster A3C ABR
BUFFER (seconds) BUFFER (seconds) BITRATE (kbps) BITRATE (kbps) WHAT BEHAVIOR DOES THE RL AGENT FOR CONTROLLING VIDEO ABR LEARN? VIDEO ON DEMAND LIVE VIDEO 2000 2000 ON DEMAND VIDEO (buffer limit up to 4 minutes) 1500 1000 1500 1000 LIVE VIDEO (buffer limit less than 30s) Build a safe buffer, then safely play the highest quality without stalling 60 500 0 13:52 13:53 13:54 13:55 13:56 LOCAL TIME 500 0 60 11:14 11:15 11:16 11:17 11:18 LOCAL TIME Track throughput predictions to play as high quality as possible without stalling 40 40 20 20 0 0 13:52 13:53 13:54 13:55 13:56 LOCAL TIME 11:14 11:15 11:16 11:17 11:18 LOCAL TIME Network predictions become more valuable as buffer limit grows smaller
DEPLOYMENT RESULTS FROM A US MOBILE NETWORK 83% 65% 58% 60% 59% 42% 33% 27% 22% BIT RATE 0% 4% -2% STALLS STARTUP BIT RATE STALLS STARTUP BIT RATE STALLS STARTUP BIT RATE STALLS STARTUP PERFECT CONDITIONS CELL EDGE (LOW SIGNAL QUALITY) HIGH CELL CONGESTION HIGH USER MOBILITY COMPARED TO CURRENT ABR SOLUTION 19 Operator Capped Max Bit Rates (1.8 Mbps)
OPEN DRBS CUMULATIVE STALL (seconds) (Mbps) WHY FINE-GRAINED NETWORK PREDICTION MATTERS FOR QOE 1.5 1 No context awareness at the start. Huge 1.0 0.5 0.0 20 10 0 17:39 17:40 17:41 17:42 17:43 17:44 17:45 2 30 startup time! 150 17:39 17:40 17:41 17:42 17:43 17:44 17:45 colour colour BITRATE THP CUM STALL Cell congestion can spike transiently (~10s of seconds). Huge stalls! 100 50 0 TYPE ERAB HANDOVER NORMAL -50 17:39 17:40 17:41 17:42 17:43 17:44 17:45
LINK QUALITY (db) CUMULATIVE STALL (seconds) (Mbps) WHY FINE-GRAINED NETWORK PREDICTION MATTERS FOR QOE 2.0 1.5 colour 1.0 BITRATE 0.5 0.0 20 10 13:11 13:12 13:13 13:14 13:15 13:16 THP colour CUM STALL 3 Downlink interference can spike transiently (~10s of seconds). Huge stalls! 0 13:11 13:12 13:13 13:14 13:15 13:16-12 -14 colour -16 NEIGHBOR SERVING -18-20 13:11 13:12 13:13 13:14 13:15 13:16
WHY FINE-GRAINED NETWORK PREDICTION MATTERS FOR QOE Peaks correspond directly with red lights on Univ Ave, as users get handed off
USE CASE: CONTENT PREPOSITIONING Content Pre-Positioning CDN CONTENT Optimization Objective: Maximize content downloads, while maintaining operator defined minimum throughput requirements When and how much data to download Result: More than double data downloaded on existing infrastructure with deterministic impact on other subscribers Policy: Maintain minimum average throughput of 500Kbps per user Data: Real-time Network State Data Feed Control: When and how much data to download
USE CASE: MOBILE NETWORK LOAD BALANCING RAN Load Prediction RAN Controller A SIGNAL LOAD SIGNAL 1 RAN Handoff Control B LOAD RAN Load Balancing Optimization Objective: In cases where signal strength is sufficient on two or more cells, manage connectivity to balance load across available cells. Policy: Maintain minimum average throughput of 500Kbps per user Data: Real-time Network State Data Feed 2 Control: When and which user to handoff to neighboring cell?
API API API PLATFORM ARCHITECTURE DEEP LEARNING ENGINE Dynamic Network Conditions Uhana Managed Service (Cloud Infrastructure) Application Control Points App Specific Neural Networks App Control CTR/CTUM Feed Over-the-Air Real-time Per-User Fine-Grained Monitoring Deep Learning Engine App Control Fine Grained Mobile Network Visibility Operator App Specific Intents (Desired Outcomes) Operator Policies (Constraints) Varied based on: User, Subscription Level, Device Type, Content Type, etc. 25
THE INFER-PREDICT-CONTROL PRIMITIVES OFFLINE INFER e.g., link throughput What network metrics affect app QoE metrics? e.g., avg bitrate, stalls, switches REAL TIME PREDICT CONTROL e.g., link quality, cell congestion etc. How will network metrics look in the future? e.g., link throughput in the next 5s What should be the app control for the best app QoE? e.g., ABR for the next video chunk e.g., buffer, app throughput
PLATFORM FEEDBACK: PROCESSING LATENCY DICTATES PREDICTION REQUIREMENTS INFER PREDICT CONTROL INGEST PROCESS FORECAST Real-time consumption of network data Real-time computation of network metrics Real-time prediction of network metrics Lower the latency of processing, lower the requirement on prediction horizon (easier the prediction)
CONCLUSION & LESSONS LEARNED Network control is a Big Control problem Learning techniques can lead to automated control and intent based system design Infer + Predict + Control is a general framework Applies to systems beyond mobile networks Streaming analytics pipeline latencies are hard to predict right now Direct implication on hardness of prediction Analytics pipeline system performance requires careful manual tuning Hard to iterate and keep performance predictable Online learning is still an unsolved problem, both in learning techniques as well as system design