Bug of the day: 2024-09-26

We had a system whose setup was relatively simple: Take a number of live sources, demux them, put them through audiomixer / compositor to combine together their audio/video streams, and mux the result back into another live sink.

This was initially working fine, until some code change broke it. The symptoms were that, after enough time had passed, the memory usage would start increasing. Soon afterwards, the sources would start complaining that their buffers weren’t being consumed fast enough downstream.

A first investigation showed a lot of audio buffers gathered right before the muxer. And I mean A LOT of audio buffers. Gigabytes of them. Raw audio, but still, it’s not easy to gather gigabytes of audio! I had foolishly left an infinite queue at that spot, thinking “it’s just live sources anyway, and with just audio, what could go wrong”. Famous last words, right?

I tried gathering logs around the audio path. It took a while, because, as always, enabling logs made the bug seemingly disappear… until I finally reproduced it, after struggling for many hours. That wasn’t helpful either; everything seemed to be running normally. At that point I thought, one of the changes we had done was to add the audiomixer into the audio path, while there was an input-selector previously – we had decided to mix the audio streams together instead of just selecting one single input. So, I thought about reverting that to get it out of the way, thinking that would surely fix the issue. Alas, it didn’t. The issue persisted, even with the input-selector. How curious.

At that point I decided to start bisecting. That led me to a surprising result: The offending commit was one that enabled alpha compositing. It would force the compositor to output AYUV, making it convert all its sources. So… the problem was in the video path, not the audio path? That would at least explain why nothing came up in my logs earlier! Reverting the offending commit made the issue indeed disappear, but alpha compositing was also a feature that we needed, so I couldn’t just leave it at this, I had to get to the bottom of the issue.

After a little thought, I realised: What if the compositor (configured to a decent n-threads, mind you) still couldn’t keep up with the video conversion? That would mean it’s outputting buffers slower than real time. The audio path is real time, which means audio buffers would slowly start piling up before the muxer. At the same time, video buffers couldn’t be consumed as fast as they are produced, causing the sources to complain about this fact. Everything was checking out. With the detailed logs I had enabled earlier, I had essentially slowed down the audio path and the sources as well, so it was accidentally working again. Slower than real time, but working.

But why would the compositor not keep up with the video conversion? It does process all pads in parallel if you set n-threads to a sensible default. But there’s also a caveat! Each sink pad was still single-threaded! As it turns out, you have to create a VideoConverterConfig, set its threads to a sensible number, and then tell each compositor sink pad to use that. That solved our bug.

Another solution was to use dedicated videoconvert elements before the compositor, with decoupling queues in between, to make sure they’d run in their own thread instead of the same thread as the corresponding compositor sink pads. We ended up doing both at once. The system was running very stable for hours afterwards.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.