While generative AI might be making headlines in the wider media and entertainment industry, multimodal AI is finding increased adoption in media technology. It is designed to process and connect elements such as visuals, audio, language, tone, and timing, rather than treating them in isolation.
That’s especially important in media and entertainment, where meaning often comes from nuance and emotional context, explains TwelveLabs co-founder, Soyoung Lee. “With this approach, companies can understand not only what was said, but how it was said and what was happening on screen at that exact moment,” Lee adds. “It leads to more accurate insights, stronger content discovery, and ultimately, a deeper connection between creators and their audiences.”

For media and entertainment companies with huge video libraries, it can be difficult to search or monetise their content without a lot of work. Twelve Lab’s Pegasus foundation model includes multimodal AI, enabling users to search their content using more than metadata or transcript keywords, but by understanding what’s happening in the video.
A sports network can quickly find every instance of a specific event or commentator reaction, a broadcaster can use the technology to identify recurring themes across large volumes of footage, or a news team can discover key moments from raw video in near real time.
“By making this kind of deep, contextual search possible, we’re helping teams turn their video archives into usable, indexable assets, unlocking both operational efficiency and new revenue opportunities,” says Lee.