Streaming Video Experimentation at Netflix: Visualizing Practical and Statistical Significance

By Netflix Technology Blog, netflixtechblog.com
View original article →
Save Article
 2 Highlights  2 Notes
Streaming video experimentation at Netflix seeks to optimize the Quality of Experience (QoE for short) of the videos we stream to our 130 million members around the world. To measure QoE, we look at a wide variety of metrics for each playback session, including play delay; the rates of rebuffers (playback interruptions when the video buffer empties), playback errors, and user-initiated aborts; the average bitrate throughout playback; and Video MultimethodAssessment Fusion, a measure of perceptual video quality developed here at Netflix.Many of our experiments are “systems tests”: short-running (hours to a week) A/B experiments that seek to improve one QoE metric without harming others. For example, we may test the production configuration of the adaptive streaming algorithm, which selects video quality based on device capabilities, resolution limits based on the Netflix plan tier, and time-varying network conditions, against a new parameter configuration that aims to reduce play delay without degrading other metrics. Although each test that results in the rollout of a new production experience may only incrementally improve one or two QoE metrics, and only for some members, over time the cumulative impact is a steady improvement in our ability to efficiently deliver high quality streaming video at scale to all of our diverse members.
Note 2 - Edited
The treatment effects in these streaming experiments tend to be heterogeneous with respect to network conditions and other factors. As an example, we might aim to reduce play delay via predictive precaching of the first few seconds of the video that our algorithms predict a member is most likely to play. Such an innovation is likely to have only a small impact on the short play delays that we observe for high quality networks — but it may result in a dramatic reduction of the lengthy play delays that are more common on low throughput or unstable networks.Because treatments in streaming experimentation may have much larger impacts on high (or low) values of a given metric, changes in the mean, the median, or other summary statistics are not generally sufficient to understand if and how the test treatment has changed the behaviour of that metric. In general, we are interested in how the distributions of relevant metrics differ between test experiences (called “cells” here at Netflix). Our goal is to arrive at an inference and visualization solution that simultaneously indicates which parts of the distribution have changed, and both the practical and statistical significance of those changes.
Note 3 Edited