An inside look at immersive audio

Felix Krückels is a certified audio engineer who has been involved in immersive audio mixing since 2012. He was the audio engineer for the international feed at UEFA Euro 2008 and the 2010 FIFA World Cup, both of which were produced in 5.1. The FIFA World Cup 2018 in Russia was another highlight of his career: it was produced in Dolby Atmos.

Krückels became involved in 3D audio almost by accident when he was asked to support an immersive project for the Confed Cup in 2013 as a test for producing the FIFA World Cup Final in Brazil, in 2014, in 3D/immersive audio.

While the production of the 2014 final still required post-processing and mixing (the 3D audio rendition was not available for the live telecast), live production came along at the Women’s World Cup in Vancouver the following year. For this occasion, the entire signal path, from the microphones through to end consumers’ homes, was successfully covered for the first time. The tournament in Russia, for its part, was telecast in quasi-ATMOS 3D/immersive audio.

One Step Removed

According to Krückels, the most important mixing challenges are the “bonus features” afforded by immersive audio: the A1/sound supervisor/audio engineer (different monikers for the same person) has to bear in mind the freedom and flexibility granted to end consumers of the audio content.

This is due to the object-based nature of the MPEG-H or ATMOS audio material: end consumers can individualise the streams they receive by changing the levels of the ambience, the commentator, etc. The A1 therefore needs to take a step back and prepare the material in a way that allows for different mixes.

With next-generation audio (NGA) object-based audio productions, focusing on listening scenarios is no longer possible: sound supervisors create a three-dimensional space based on a given number of objects. This audio content can be consumed using binaural headphones, speakers (2, 4, 6), sound bars, up-fire speakers, etc., and the result can no longer be completely predicted by the A1.

Enter the rendering principle: the audio objects supplied to end consumers contain coordinates rather than channel or speaker references. These allow the decoder at home to render the immersive audio content as a translation/adaptation of the panning information to the real-life speaker setup in end consumers’ homes.

Sound supervisors still need to check whether their mix works in a variety of listening scenarios, even though none of them may correspond to the rendition at home. Up to four presentations are prepared and checked against typical final speaker layouts an A1 needs to check at regular intervals (5.1.4 + 5.1 + Stereo, all with and without commentary—for four presentations).

Among the “presentations” Krückels prepares is a creation he’s developed with the Dolby team called the “Pub Presentation”, where the crowd at the sports venue is barely audible, leaving ample room for the cheers and boos produced live by the people watching the game at the bar. Field-of-play audio details, like ball kicks, tackles, whistle and groaning noises, on the other hand, are hyped.

So Far, So Immersive

Individualisation of 3D/immersive audio can be a blessing in disguise, though: the added flexibility can easily lead to situations where tweaks by end-users blur the audio content beyond recognition.

This explains why sound supervisors of immersive audio productions often favour a conservative approach, with relatively few bells and whistles. They know that they are unable to control what viewers at home do to the incoming presentations and therefore limit the options at the source.

The production as such is relatively straightforward and very similar to 5.1 scenarios, except for the added dimension (height/elevation), which “merely” requires additional busses on the mixing console.

Immersive Nuts and Bolts

The most important consideration for an A1 is the ease with which they can monitor the various presentations and formats (stereo, surround, 3D) right from their console. Lawo’s mc² mixers have allowed audio engineers to control all relevant parameters for over a decade, because speed is of the essence.

As the console itself is only one element—with an external renderer contributing other aspects—the ability to control all relevant devices via one user interface becomes paramount. Thanks to their integration with Dolby and their open Ember+ control protocol, mc² consoles are an important step in the right direction.

As far as 3D microphone placement is concerned, the first step is to look for a venue’s “sweet spot”, i.e. the position where you can hear everything. This position is usually located close to camera 1, i.e. the main camera.

The 3D microphone is suspended from the roof, at a suitable distance from the crowd. It serves the same purpose as a suspended microphone used to capture the overall sound of a symphonic orchestra. For reasons of intelligibility and flexibility, spot microphones are positioned close to all important sound sources. The resulting signals are combined to “make acoustic sense” for the signals coming out of nine speakers.

Krückels likes to work with three planes for his mixes: a mono, a stereo and a surround/3D plane. The surround/3D information usually only concerns the ambience (crowd, etc.). He hardly ever uses dynamic “movie panning effects”, even though some broadcasters like to use the occasional “sound whoosh” to announce instant slomo replays.

In broadcast, usually only the ambience is captured in an immersive way. This is Krückels’ “top plane”, to which he adds—usually in mono—the typical field-of-play noises, signals captured in stereo by microphones close to the cameras, and so on. His third plane (mono), finally, only carries the narrator/commentator. Krückels takes pride in painstakingly separating these three planes to leave sufficient room for artistic license and alternatives.

He favours a static placement of the ambience mix, even though switches among cameras might suggest otherwise. Applying audio-follows-video to the ambience signal, he says, would quickly lead to listening fatigue and discomfort.

If the video director does a good job, humans have no difficulty understanding that the action is on the left side of the field, even though it is shown at the centre. This also explains why ball kick noises are always at the centre (mono), irrespective of where they occur on the field. Of course, the FOP microphones, which also pick up crowd noises, will change the localisation of the crowd. Crowdless events in Covid-19 scenarios, on the other hand, have made it easier to distribute FOP noises in a stereo field: the left side of a goal can appear slightly left of centre in the sound image, and the right side slightly to the right.

Are there different philosophies regarding how immersive audio should be mixed? There seems to be a European and an American approach: Europeans pay more attention to a convincing crowd sound, while American productions often favour an “in your face” aesthetic. In the latter case, placing players at the centre of the audio image and using heavy compression ratios are deemed more important than providing a stable soundscape, never mind audio artefacts.

Krückels himself subscribes to a compromise between these two—paying attention to details (FOP noises) while maintaining a truly immersive, “at the venue” ambience.

Dynamics

Dynamics effects are extremely important in a surround/immersive audio scenario. For key signals—speech, music, FOP noises—the human ear prefers to stay within a maximum range of +7~–10 LU. One solution that puts this principle to clever use, says Krückels, is Lawo’s KICK/PUCK software. It manages to keep the kick and spill noises at a constant level, thus avoiding brutal level jumps and artefacts.

Supplying pre-processed stems to end consumers is highly beneficial. Care should be taken, however, to avoid flattening the stems’ dynamic range beyond recognition.

Will it take off?

Most A1s are confident that 3D/immersive audio will establish itself much faster than 5.1, not least thanks to important sidekicks like VR and AR, i.e. games that have been applying 3D viewing and listening for years. Most of today’s children are already familiar with binaural listening, while headtracking (the fact that the sound does not move when you turn your head while wearing headphones) is available on most gaming consoles. Smartphones are already perfectly able to decode such information.

Audio engineers can easily create binaural mixes that serve as immersive sound renditions—and most people will be hooked almost instantly and never want to return to a stereo mix. Krückels believes that headphones will play an important part in establishing immersive audio.

He himself is not fond of soundbars, which he merely considers a cool compromise. For a true 3D impression—and the hard work that goes with it from the A1’s perspective—soundbars may just fall short of leveraging the added value.

What about 3D/immersive audio in cars?

As such, producing a Dolby Atmos presentation for a given car model is easy and would yield maximum listening satisfaction. The future of 3D/immersive audio looks rather bright…

Your browser is out-of-date!

One Step Removed

So Far, So Immersive

Immersive Nuts and Bolts

Dynamics

Will it take off?

What about 3D/immersive audio in cars?

Related

Inside this year’s Accelerator Projects

For first time in Olympic broadcasting, as many signals will be sent via the cloud as standard delivery

Case study: Riot Games chooses Pixotope to augment LPL Pro League Finals event

Keeping up with the news

Virtual filmmaking on the Big Screen

Next generation cinema technology

Vaudeville Sound Group, Google partner on 3D audio models for IAMF specification

News in brief: Audio round-up