Netflix’s engineering team has posted a new blog about how it uses prioritised load shedding to ensure viewers’ streams are as smooth as possible.
In the past, the streamer has suffered from outages caused by congestion, but is has now implemented a “priority throttling filter” (which they have named Zuul) that can shed unnecessary server requests in real-time whenever there is a problem on the backend.
Zuul prioritises traffic based on how much a user needs it for playback. It uses three buckets to categorise server requests—non-critical, degraded experience, and critical.
The engineers have classed logs and background requests as non-critical items which they say make up a large portion of through-put.
Degraded experience is traffic that affects viewers’ experience, but not the ability to play. The traffic in this bucket is used for features like: stop and pause markers, language selection in the player, viewing history, and others.
Traffic that affects Netflix’s ability to play is classed as critical, with viewers likely to see an error message if their play request fails.
Zuul categorises the viewer requests into one of the buckets and computes a priority score between 1 to 100 for each request given its individual characteristics. If problems develop on the backend, or even with Zuul itself, the filter can throttle loads with the lowest priority first. Playback will also be given preferential treatment over everything else, which the engineers say they hope means viewers won’t notice any problems.
According to the engineers, Netflix suffered an outage earlier this year similar to one in 2019 when viewers were unable to access content. However, with the implementation of Zuul, viewers didn’t notice any problems.
“Zuul’s progressive load shedding kicked in and started shedding traffic until the service was in a healthy state without impacting members’ ability to play at all,” wrote the engineers. “Members were happily watching their favourite show on Netflix while the infrastructure was self-recovering from a system failure.”
The full Netflix engineering team blog post can be read here.