Bug of the day: 2017-03-30

We had a pipeline that would capture 4K video, convert it using a high quality but low compression ratio codec, mux it into a .mov container together with raw audio, and write it into a file at a networked drive. The drive is fast enough to easily make use of the 10GB Ethernet card.

One 4K session was running fine. Two 4K sessions were freezing. CPU and memory usage were fine – nothing that seemed to cause things to lock up I tried taking pipeline dumps of the freeze: the queue right before the encoder was full. It was 600ms and qtmux shouldn’t need more than that for interleaving, so why is the queue getting full? So I examined the timestamps of all incoming buffers. Turns out that audio and video were off by more than 600ms … but WHY? (Thursday night is here)

Now some information on how qtmux works: Raw audio has no timestamps. It just assumes that it’s progressing at the given rate. If there’s a gap, the data that comes afterwards just gets squeezed and essentially gets converted to earlier timestamps, messing up A/V sync in the meantime.

So, keeping in mind the above, we noticed that several chunks of audio and/or video were going missing, and that’s what was causing the queue to get full. But why would they go missing? (Friday night is here)

Examining the data as it came from the capture card, we noticed that the callback that fed us the audio/video data was just forgetting to be called. It would get called at very regular intervals, until BOOM, a spike of maybe half a second or even one second happens. What’s worse – the drivers have no internal buffering. If you take too long to process the incoming data, you lose the next frame. We thought that the scheduler might be messing things up, but setting the capture thread to real-time priority didn’t seem to have any effect either. The thread itself was doing nothing much, it was just putting the data into a buffer and leaving, so it wasn’t spending more than 1ms in there. No locking or anything else was happening. CPU usage was still OK, no queues were really full at that point, so what could be wrong? We had this wild idea: does it run out of buffers in the pool? YES! But why? (Monday night is here)

On Tuesday we examined the situation with the buffer pool, looking at pipeline graphs and processing log files. Turns out that the queue right before the file sink was getting full first. After this, the queues between encoder and muxer would also get full. Then the queue before the encoder would also get full. Everything was zero-copy until there in my tests, so that’s what caused the capture card drivers to run out of buffers in their pool. Eventually the situation downstream of the encoder would slowly stabilize, but that was too late because then we had the gap that made everything explode. But why on Earth would the file sink refuse to write any further buffers? That was on Tuesday. I tried several things but nothing really seemed to help.

Yesterday we also enlisted a colleague, and he examined the system parameters in /proc etc. He optimized a lot of things – for example, the throughput with “dd” has pretty much doubled now. However, even with all these changes, the situation became better, but didn’t really solve itself.

Today (Thursday) I examined some graphs of how long the writev() call needed versus how long it took for it to be called again. Turns out that writev() could take up to half a second! Then it would get called again almost immediately. After a quick search, I ran into this issue here: https://github.com/netty/netty/issues/5360 We had already tried increasing the buffering in the filesink earlier, but it had no real effect. Turns out that we weren’t using proper values: even the 4MB we tried is maybe smaller than one video frame. So I gave it some thought and then set the filesink buffer size to 128MB (it won’t actually write the data into the file before that). AND IT WORKS!!!

Leave a Reply Cancel reply