I'm Dale Curtis, the engineering lead for media playback in Chromium. My team is responsible for the web facing APIs for video playback like MSE and WebCodecs, and the platform specific internals involved in demuxing, decoding, and rendering audio and video.
In this article, I'll walk you through Chromium's video rendering architecture. While some details around extensibility are likely Chromium-specific, most of the concepts and designs discussed here apply to other rendering engines and even native playback apps.
Chromium's playback architecture has changed significantly over the years. While we didn't start with the idea of a pyramid of success as described in the first post in this series, we ultimately followed similar steps: reliability, performance, and then extensibility.
In the beginning, video rendering was quite simple—just a for loop choosing which software decoded video frames to send to the compositor. For years this was reliable enough, but as the complexity of the web increased, the need for more performance and efficiency led to architectural changes. Many improvements required OS-specific primitives; thus, our architecture also had to become more extensible to reach all of Chromium's platforms.
Video rendering can be broken into two steps: choosing what to deliver and delivering that information efficiently. In the interest of readability, I'll cover efficient delivery before diving into how Chromium chooses what to deliver.
Some terms and layout
Since this article focuses on rendering, I'll only briefly touch on the demuxing and decoding aspects of the pipeline.
Demuxing is the process by which the media pipeline turns a byte stream into individual encoded audio and video packets.
Decoding is the process by which those packets are turned into raw audio and video frames. In the context of media playback, rendering is choosing where in time to present those decoded audio and video frames.
Decoding and demuxing in our modern security-conscious world requires a fair bit of care. Binary parsers are rich target environments, and media playback is full of binary parsing. As such, security issues in media parsers are extremely common.
Chromium practices defense in depth to reduce the risk of security issues to our users. In practical terms, this means demuxing and software decoding always happen in a low privilege process, while hardware decoding occurs in a process with just enough privileges to talk to the system's GPU.
Chromium's cross-process communication mechanism is called Mojo. While we won't get into the details of Mojo in this article, as the abstraction layer between processes, it's a cornerstone of Chromium's extensible media pipeline. It's important to be aware of this as we walk through the playback pipeline since it informs the complex orchestration of cross-process components interacting to receive, demux, decode, and finally display media.
So many bits
Understanding today's video rendering pipelines requires knowledge of why video is special: bandwidth. A 3840x2160 (4K) resolution playback at 60 frames per second uses between 9-12 gigabits/second of memory bandwidth. While modern systems may have a peak bandwidth in hundreds of gigabits per second, video playback still represents a substantial portion. Without care, the total bandwidth can easily multiply due to copies or trips between GPU and CPU memory.
The goal of any modern video playback engine with efficiency in mind is to minimize bandwidth between the decoder and the final rendering step. For this reason, video rendering is largely decoupled from Chromium's main rendering pipeline. Specifically, from the perspective of our main rendering pipeline, video is just a fixed-size hole with opacity. Chromium achieves this using a concept called surfaces—whereby each video talks directly to Viz.
Due to the popularity of mobile computing, power and efficiency have become a significant focus in the current generation. A result of this is that decoding and rendering are more coupled than ever at the hardware level—resulting in video just looking like a hole with opacity, even to the OS itself! Platform level decoders often only provide opaque buffers that Chromium passes through to the platform level compositing system in the form of overlays.
Every platform has its own form of overlays that their platform decoding APIs work in concert with. Windows has Direct Composition and Media Foundation Transforms, macOS has CoreAnimation Layers and VideoToolbox, Android has SurfaceView and MediaCodec, and Linux has VASurfaces and VA-API. Chromium's abstractions for these concepts are handled by the OverlayProcessor and mojo::VideoDecoder interfaces respectively.
In some cases it's possible for these buffers to be mappable into system memory, so they don't even need to be opaque and don't consume any bandwidth until accessed—Chromium calls these GpuMemoryBuffers. On Windows these are backed by DXGI buffers, on macOS IOSurfaces, on Android AHardwareBuffers, and on Linux DMA buffers. While video playback generally doesn't need this access, these buffers are important for video capture to ensure minimal bandwidth between the capture device and eventual encoders.
Since the GPU is often responsible for both decoding and displaying, the use of these (also often) opaque buffers ensures that high bandwidth video data never actually leaves the GPU. As we discussed earlier, keeping data on the GPU is incredibly important for efficiency; especially at high resolutions and frame rates.
The more we can take advantage of OS primitives like overlays and GPU buffers, the less bandwidth is spent shuffling video bytes around unnecessarily. Keeping everything in one place from decoding all the way to rendering can lead to incredible power efficiency. For example, when Chromium enabled overlays on macOS, power consumption during fullscreen video playback was halved! On other platforms like Windows, Android and ChromeOS, we can use overlays even in non-fullscreen cases, saving up to 50% nearly everywhere.
Now that we've covered the optimal delivery mechanisms, we can discuss how Chromium chooses what to deliver. Chromium's playback stack uses a "pull" based architecture, meaning each component in the stack requests its inputs from the one below it in hierarchical order. At the top of the stack is the rendering of audio and video frames, next lower is decoding, followed by demuxing, and finally I/O. Each rendered audio frame advances a clock which is used to choose video frames for rendering when combined with a presentation interval.
On each presentation interval (each refresh of the display), the video renderer is asked to provide a video frame by a CompositorFrameSink attached to the SurfaceLayer mentioned earlier. For content with a frame rate less than the display rate, that means showing the same frame more than once, while if the frame rate is greater than the display rate, some frames are never shown.
There's much more to synchronizing audio and video in ways that are pleasing to viewers. See Project Butter for a longer discussion on how optimal video smoothness is accomplished in Chromium. It explains how video rendering can be broken down into ideal sequences representing how many times each frame should be shown. For example: "1 frame every display interval (, 60 fps in 60 Hz)", "1 frame every 2 intervals (, 30 fps in 60 Hz)", or more complicated patterns like [2:3:2:3:2] (25 fps in 60 Hz) covering multiple distinct frames and display intervals. The closer a video renderer sticks to this ideal pattern the more likely a user will perceive a playback as being smooth.
While most Chromium platforms render frame by frame, not all do. Our extensible architecture allows for batched rendering as well. Batched rendering is an efficiency technique where the OS level compositor is told about multiple frames in advance and handles releasing them on an application provided timing schedule.
The future is now?
We've focused on how Chromium takes advantage of OS primitives to deliver a best in class playback experience. But what about websites that want to go beyond basic video playback? Can we offer them the same powerful primitives that Chromium itself uses to usher in the next generation of web content?
We think the answer is yes! Extensibility is at the heart of how we think about the web platform these days. We've been working with other browsers and developers to create new technologies like WebGPU and WebCodecs so that web developers can use the very same primitives Chromium does when talking to the OS. WebGPU brings support for GPU buffers and WebCodecs brings platform decoding and encoding primitives compatible with the aforementioned overlay and GPU buffer systems.
End of stream
Thanks for reading! I hope you've left with a better understanding of modern playback systems and how Chromium powers several hundred million hours of watch time every day. If you're looking for further reading on codecs and modern web video I recommend H.264 is magic by Sid Bala, How Modern Video Players Work by Erica Beaves, and Packaging award-winning shows with award-winning technology by Cyril Concolato.
One illustration (the pretty one!) by Una Kravets.