C3: INTERNET-SCALE CONTROL PLANE FOR VIDEO QUALITY OPTIMIZATION Aditya Ganjam, Jibin Zhan, Xi Liu, Faisal Siddiqi, Conviva Junchen Jiang, Vyas Sekar, Carnegie Mellon University Ion Stoica, University of California, Berkeley, Conviva Hui Zhang, Carnegie Mellon University, Conviva
HIGH EXPECTATIONS ON VIDEO QUALITY Television High bitrate/resolution Interruption-free Fast start Need to optimize video quality through the lifetime of a session 2
WHERE TO IMPLEMENT OPTIMIZATION? Multiple choices for actuation Client-side actuation most favorable to incremental deployment CDN 1" Video Origin " ISP 1" Client CDN 2" Backbone Network" 3
MANY CHOICES & NEED QUICK DECISION Optimization parameters Ø Bitrate x CDN (over time) Existing approaches are reactive, apply a probe and update method Ø Too slow to find optimal choice Time CDN Resource Bitrate Bitrate 4
CENTRALIZED PREDICTIVE CONTROL Ideal solution Predict outcome of each parameter choice in real-time Continuously select optimal choice Logically Centralized Controller Prediction model based on global view Per-client decisions in real-time Achieving ideal solution requires Collect global quality information Quality Information Sub-second response Decisions Build a prediction model Make per-client predictions and Client parameter decisions at Internet-scale 5
CHALLENGE: CLIENT HETEROGENEITY Strawman: implement 100 times Heterogeneous software environments and interfaces Slow software update cycle Prediction model based on global view Quality Information Logically Centralized Controller Sub-second response Per-client decisions in real-time Decisions >100 unique client pla/orms Client 6
DESIGN CHOICE: THIN SENSING/ACTUATION LAYER Logically Centralized Controller Minimal client footprint Ø Move computation to controller Define a narrow waist interface Prediction model based on global view Quality Information Sub-second response Per-client decisions in real-time Decisions Client 7
CHALLENGE: REAL-TIME PREDICTIONS AT INTERNET SCALE Internet scale? Geographic scale - all continents Network scale - all network types Client scale 10s to 100s of millions concurrent Prediction model based on global view Quality Information Logically Centralized Controller Sub-second response Client Per-client decisions in real-time Decisions 8
CHALLENGE: REAL-TIME PREDICTIONS AT INTERNET SCALE Achieving real-time response, with the most recent global view at Internet-scale is not feasible based on today s technologies Strawman solutions Centralized global controller not real-time Partitioned controller no global view Prediction model based on global view Quality Information Logically Centralized Controller Sub-second response Client Per-client decisions in real-time Decisions 9
DESIGN CHOICE: SPLIT CONTROL PLANE Create a global prediction model in near real-time (minutes) Make per-client decisions in real-time (sub-second) based on global model and most recent per-client local information Prediction model based on global view Quality Information Logically Centralized Controller Sub-second response Client Per-client decisions in real-time Decisions 10
C3 ARCHITECTURE C3 Controller Modeling Layer near real-time global prediction modeling Decision Layer real-time per-client decisions Narrow waist protocol Session Data Model (SDM) Control Decisions Diverse Client Platforms Adaptor C3 Sensing/Actuation Layer Narrow waist device API ConvivaStreamerProxy (CSP) 11
C3 ARCHITECTURE C3 Controller Modeling Layer near real-time global prediction modeling Decision Layer real-time per-client decisions Narrow waist protocol Session Data Model (SDM) Control Decisions Diverse Client Platforms Adaptor C3 Sensing/Actuation Layer Narrow waist device API ConvivaStreamerProxy (CSP) 12
MOTIVATING EXAMPLE Compute Video Startup Time metric Default Op<on: Compute Video Startup Time in the client, send to controller Session start Ad start Ad end Play start Startup Time 1 = T(play start) T(session start) Startup Time 2 = T(play start) T(session start) (T(Ad end) T(Ad start)) How do we adapt to changes? 13
IDEA: EXPOSE LOW-LEVEL ACTIONS Metric calculation pushed to the controller Session Data Model (SDM) Events, States, Measurements States robustness to message loss Player States Joining Playing Buffering Playing Events Player Monitoring time Network/ stream connection established Video buffer filled up Video buffer empty Video download rate, Available bandwidth, Dropped frames, Frame rendering rate, etc. Buffer replenished sufficiently Stopped/ Exit User action 14
C3 ARCHITECTURE C3 Controller Modeling Layer near real-time global prediction modeling Decision Layer real-time per-client decisions Narrow waist protocol Session Data Model (SDM) Control Decisions Diverse Client Platforms Adaptor C3 Sensing/Actuation Layer Narrow waist device API ConvivaStreamerProxy (CSP) 15
SPLIT CONTROL-PLANE Dual-loop control Client-driven real-time loop (red) Periodic global model learning and dissemination loop (blue) Intelligent Decision Layer Per-client decisions based on global model and client quality info Aggregation across Clients SDM SDM Prediction Model Metric Computation & Per-client Decisions Control Decisions Prediction Model Training Modeling Layer Decision Layer Sensing/ Actuation Layer 16
WHY SPLIT CONTROL CAN ACHIEVE NEAR OPTIMAL PERFORMANCE Domain specific insight globally optimal decisions tend to be persistent on minute-level timescales Persistence results do not hold for individual clients - still need per-client up-todate state CDF 1 0.8 0.6 0.4 0.2 0 CP A CP B CP C 1 10 100 1000 Persistence(min) Figure shows the distribu<on of persistence of the best CDN for three content providers 17
RESPONSIVENESS & SCALE Sub-second response time Geographically distributed decision layer Geo-distribution using cloud Controller scale Horizontally scaled decision layer Burst scaling using cloud Big-data technologies including Spark for modeling layer Physical View of C3 Controller Aggregation/ Compute Cluster Frontend Data Centers Clients 18
FAULT TOLERANCE Physical View of C3 Controller Decision layer gracefully degrades due to stale or missing global model Client-side failover and decision layer state reconstruction to handle instance failure Aggregation/ Compute Cluster Frontend Data Centers Clients 19
REAL-WORLD DEPLOYMENT & EVOLUTION Earlier phases explored alternate architectures Partitioning Heavy client (client offload of computation and decision logic) C3 platform used by premium video publishers over 8 years Over 1 billion unique devices Over 3 million concurrent devices during major events Ø E.g., World Cup, Super Bowl Significant quality improvement results 20
QUALITY IMPROVEMENT: REBUFFERING Premium content publisher One month A/B test Result: Greater then 50% reduction in Buffering Ratio BufRatio (%) 2.5 2 Native control 1.5 C3 control 1 0.5 0 5/1/14 5/8/14 5/15/14 5/22/14 5/29/14 Time (mm/dd/yy) 21
QUALITY IMPROVEMENT: START FAILURES Premium content publisher One month A/B test Result: Greater then 60% reduction in Video Start Failures VSF (%) 4 3 2 Native control 1 C3 control 0 5/1/14 5/8/14 5/15/14 5/22/14 5/29/14 TIme (mm/dd/yy) 22
QUALITY IMPROVEMENT ACROSS PROVIDERS Rebuffering Ratio improvement consistent results across 5 different content providers BufRatio (%) 3 2.5 Native control 2 C3 control 1.5 1 0.5 0 1 2 3 4 5 Content Providers 23
LESSONS Validates centralized control at an unprecedented scale Reinforces the case for centralized control Drivers: client heterogeneity, global policies, real-time monitoring Enablers: big-data, cloud, application-level resilience Unique challenges driving new ideas Split control balances global view and real-time Minimal client critical to handle heterogeneity Broader applicability of C3 architecture Other applications (e.g., gaming, voice, conferencing) Network layer control (e.g., SDN, CDN) 24
CONCLUSIONS Television is coming to the Internet, high expectations on quality Large optimization space for video quality Ø Drives the need for a proactive approach C3 is a centralized predictive control approach Solves key challenges: (a) client heterogeneity, (b) Internet-scale Minimal sensing/actuation layer & move all computation to the controller Split control-plane scale-out solution exploiting application-level resilience C3 used by many content providers with quality improvement results 25