Bug of the day: 2026-03-16

Or should I say, bug combo hit of the month? This is definitely one of the hardest issues I’ve encountered in my life, if not THE hardest.

So, we have an internal component called “videoengine”. The videoengine is using GStreamer with a non-negligible number of downstream-only patches. At some point, GStreamer 1.28 came out, and we decided to update it. It’s a procedure that takes half a day if nothing goes wrong, but maybe there’s a merge conflict or two, and in some rare cases, maybe there’s even an upstream regression that we have to solve first. Yes, that sounds ominous, I know.

So I solved more than our fair share of merge conflicts, in one of which I had to contact Jan and he had to remember his original intention when he wrote that patch (and then he took some time to actually get that merged upstream). And then a couple of automated tests were failing, so I had to cherry-pick some fixes that landed in 1.28.1 a few days later. And just as I thought “whew, this looks ready now”, I noticed that two integration tests were failing. These integration tests were known to be flakey so we sometimes didn’t pay attention to one or two failing, but this time, tests P3 and P4 were failing consistently, so I thought that sounded fishy.

My first reaction was to try and increase the logging level, but that made all tests start failing when logging was configured to any useful level. Those who have worked with GStreamer know that it’s not an unusual thing to happen, because logging is slowing down a latency-sensitive live workflow. So I changed my approach to trying to reproduce the bug locally. Especially test P3 was relatively simple: Start the videoengine, start the RTMP server, stream to it, start an RTMP source from the videoengine, and remux its output into RTMP and a couple more protocols. So I tried taking one of our pre-existing manual test scripts and adapting it. Nope, that seemed to work fine. I then tried tracing back the exact commands that were sent to the videoengine (which were fortunately logged) and replaying them. Nope, that also worked fine. So we’re talking about a case of the Three Most Terrifying Words: Possible Race Condition.

We started brainstorming ideas on what the problem could be, based on the limited amount of information that we had. Except reality kept disproving all of our theories. Every time we thought “yeah, that could be it” or “this should solve it”, it turned out to be irrelevant. The only thing we knew was that GStreamer 1.26.10 was working fine, while 1.28.0 was broken. We could technically bisect, except there are too many commits in there, plus a small collection of local patches that will have to be rebased and will show different levels of merge conflicts depending on the version. Also, the failure wasn’t reproducible locally, and the CI needs to be done in two steps (first compile GStreamer and then the videoengine) and the process takes almost one hour including the integration tests.

What we did have, though, was a pipeline graph. I noticed a failure early on in the pipeline: After the source and demuxer, the H264 parser wasn’t outputting any caps. However, this happened early enough that it could conceivably be that the parser just hasn’t seen a keyframe yet, and it was followed by the test immediately failing. Hmm, could it be that it just needs some more time? We added one 3-second delay to P3 between creating the source and linking it, and another 3-second delay between enabling the sinks and checking them. No luck. Still the same failure near the source. Hmm, that’s fishy.

But, given that the failure was near the source, there was just a small handful amount of elements involved: RTMP source, FLV demuxer, H264 parser. And those had no local patches that could be critical to this specific pipeline’s success or failure. So I started reverting these one by one to their 1.26 state and checking the CI’s (slow) result. I found the RTMP source to be irrelevant to the bug, but both the demuxer and the parser had to be reverted to their 1.26 state in order for P3 to pass. So that’s possibly a combination of factors? To make things worse, P4 still failed, but then I thought it’s probably the 3-second delay and that I should tackle problems one by one.

(Hey, at least these plugins could be independently reverted to their 1.26 state and didn’t require any corresponding changes to GStreamer core or some base class!)

Having nothing else left to do, I decided to play git bisect on the CI. Each step did have to take place in two parts and take almost one hour, but at least this time there were no relevant local patches to rebase, and the amount of bisect steps in each case was much more manageable. Painstaking, but the only way forward.

For the parser, I found the culprit to be this commit by Seungha, that made the element use the VUI framerate when the upstream framerate is 0/1. That initially sounded a bit weird, because upstream didn’t report 0/1 in our case. However, I noticed there’s a follow-up fix that basically told the parser base class to preserve the upstream buffer duration where possible. I restored the h264parse element to its 1.28 state and cherry-picked that commit to baseparse, and the test passed. Okay, it’s still only P3 and only with the FLV plugin at its 1.26 state, but still. After several weeks, finally there was some real progress on this bug.

Let’s focus on P4 though for a bit. My colleague Jan heard me saying it will probably pass if we add another 3-second delay in there. Then I noticed a commit where he was removing said delay from P3. Result: P3 still passes, P4 still fails. Frustrating. He also noticed an End-Of-Stream event in the logs. Okay, if we’re getting EOS, the pipeline can’t really recover, but WHY would it go EOS so early on? I tried adding some logging, but of course that made P3, P4, and something like 3 more tests, fail completely. Sigh.

So I decided to focus on the FLV plugin. Anecdotally, my brain was fried after two days of moving commit IDs and test results back and forth, and watching a slow CI like a hawk in order to scrape off a couple of minutes here and there, thinking they add up to real numbers when done several times. So I accidentally marked a good commit as bad and ended up losing about two hours. Quite an appropriate mistake to make on literally Saint Patience’s day.

Eventually, though, I got there, and found this commit by Tarun to be the culprit. It was the one that adds support for multiple audio/video streams in the FLV demuxer, and unfortunately it was a relatively large one. However, it shouldn’t have any difference in behaviour when there’s only one audio and one video stream, just like before…

… Or should it?

So I realised that the no-more-pads behaviour was different. Before, we used to emit no-more-pads whenever there was one audio and one video stream found, or after a 6-second timeout. However, now the amount is not given. Streams might appear or disappear at any moment. In the case of e.g. MPEG-TS, there’s at least a header telling us the amount of streams currently available. FLV doesn’t have such a header. So the approach taken was to emit no-more-pads after a 6-second timeout. That’s important, because we use no-more-pads to unblock the source. I dug into the code of the integration tests and found a part of the code that gave the source 5 seconds to start up and timed out otherwise. Okay, that would explain A LOT. I restored the FLV plugin to 1.28, changed that hard-coded 6 to a 2, and, what do you know, P3 now passes! I made an upstream merge request to make that hard-coded timeout configurable, which at the time of writing still has a couple more minor review comments pending.

What about P4 though? I had a small idea. I looked at the test code, and realised that Jan had only removed the first of the two 3-second delays. The second one, between enabling the sinks and checking them, was still there. So I added a matching one to P4 and it passed! Turns out, that EOS was a total red herring. It wasn’t the cause of our failure, but the result of it, caused by the test saying “okay, test failed, start teardown”.

However, it’s poor code hygiene to rely on hard-coded delays. By analysing the test a bit more, I found out that the timeout logic was a bit too tight and didn’t give the pipeline enough time to start up, which might explain the general flakiness of the integration tests that I mentioned at the beginning. Without getting too much into the internal logic of the videoengine, the correct course of action here would be to wait for … you guessed it… no-more-pads. (Or, even better, to use the not-new-anymore Streams API)

Following that, I had to clean up my git commit history from the two repositories, because every single step of every single bisect was in there. It looked like a bloody murder scene: Extremely untidy, but everything had to be preserved for forensic reasons. Now that the culprits had been successfully caught, there was a lot of cleaning to do.

The difficulty of this issue wasn’t that much about the typical “that bug was hard to find”. It was more about the fact that there were three different issues, that made the same two tests fail, and that I couldn’t reproduce it locally and had to rely on the CI.

Bug of the day: 2024-09-26

We had a system whose setup was relatively simple: Take a number of live sources, demux them, put them through audiomixer / compositor to combine together their audio/video streams, and mux the result back into another live sink.

This was initially working fine, until some code change broke it. The symptoms were that, after enough time had passed, the memory usage would start increasing. Soon afterwards, the sources would start complaining that their buffers weren’t being consumed fast enough downstream.

A first investigation showed a lot of audio buffers gathered right before the muxer. And I mean A LOT of audio buffers. Gigabytes of them. Raw audio, but still, it’s not easy to gather gigabytes of audio! I had foolishly left an infinite queue at that spot, thinking “it’s just live sources anyway, and with just audio, what could go wrong”. Famous last words, right?

I tried gathering logs around the audio path. It took a while, because, as always, enabling logs made the bug seemingly disappear… until I finally reproduced it, after struggling for many hours. That wasn’t helpful either; everything seemed to be running normally. At that point I thought, one of the changes we had done was to add the audiomixer into the audio path, while there was an input-selector previously – we had decided to mix the audio streams together instead of just selecting one single input. So, I thought about reverting that to get it out of the way, thinking that would surely fix the issue. Alas, it didn’t. The issue persisted, even with the input-selector. How curious.

At that point I decided to start bisecting. That led me to a surprising result: The offending commit was one that enabled alpha compositing. It would force the compositor to output AYUV, making it convert all its sources. So… the problem was in the video path, not the audio path? That would at least explain why nothing came up in my logs earlier! Reverting the offending commit made the issue indeed disappear, but alpha compositing was also a feature that we needed, so I couldn’t just leave it at this, I had to get to the bottom of the issue.

After a little thought, I realised: What if the compositor (configured to a decent n-threads, mind you) still couldn’t keep up with the video conversion? That would mean it’s outputting buffers slower than real time. The audio path is real time, which means audio buffers would slowly start piling up before the muxer. At the same time, video buffers couldn’t be consumed as fast as they are produced, causing the sources to complain about this fact. Everything was checking out. With the detailed logs I had enabled earlier, I had essentially slowed down the audio path and the sources as well, so it was accidentally working again. Slower than real time, but working.

But why would the compositor not keep up with the video conversion? It does process all pads in parallel if you set n-threads to a sensible default. But there’s also a caveat! Each sink pad was still single-threaded! As it turns out, you have to create a VideoConverterConfig, set its threads to a sensible number, and then tell each compositor sink pad to use that. That solved our bug.

Another solution was to use dedicated videoconvert elements before the compositor, with decoupling queues in between, to make sure they’d run in their own thread instead of the same thread as the corresponding compositor sink pads. We ended up doing both at once. The system was running very stable for hours afterwards.

Bug of the day: 2024-02-02

I had started chasing this bug already in December. A coworker of mine had reported that, with a specific input file, using a tricky maneuver, that also required a lot of other moving parts that interacted with our code, and that also involved at some point deleting and re-adding all elements from the pipeline (!), the file would stall after showing only a couple of frames.

My first thought was to try and reproduce it locally, without all the moving parts. I tried repeatedly, but failed. My colleague Jan also tried repeatedly, but failed. No matter what we did, it was all working fine. We also asked for log files, but they didn’t show any issues either. I was really stuck for a long time, because I had no idea how to chase that bug.

Eventually, Jan noticed that the videorate element was trying to bridge a huge gap: It had received only one input frame, but had duplicated several frames. However, our logs did not indicate such a gap. The videorate element is what converts the frame rate of the video between different values, and also what fixes up the (non-live) stream in case a buffer has gone missing or appears twice.

The next step was to ask for additional log file with videorate debug information. Fortunately, my other coworker could still reproduce it with the moving parts. And, there I saw it:

BEGINNING prev buf 1000:04:41.104589703 new buf 1000:04:41.137956369 outgoing ts 1000:04:41.104589703
diff with prev 1000:04:41.104589703 diff with new 1000:04:41.137956369 outgoing ts 0:00:00.000000000

There it was. videorate was calculating the differences wrongly!

My first thought was to look at existing related bug reports and merge requests. I found something that looked suspiciously close to our issue:

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/2186

If my theory was correct, it would mean that reverting https://gitlab.freedesktop.org/gstreamer/gst-plugins-base/-/merge_requests/767 would fix our issue. Except it wasn’t possible to revert, because the code had been repeatedly touched since.

I then started looking at the calculations. Something was wrong with the segment-related parts, related to how the segment base was used. I thought I had found it, made a patch that fixed this specific issue, put it up on the CI, and it ended up breaking other things.

For clarity: Imagine that you are playing back a file, playing it faster or slower, seeking back and forth, etc. The time displayed at your player’s clock corresponds to the timestamps inside the file. However, the parts that know when to display each frame, or when to play each sound, have a different time, according to how you manipulate the playback. For simplification, let’s say that the time displayed at your player’s progress bar is the buffer timestamp, and the time when the frames/sounds are displayed is the running time. So, when you move the slider back and forth in your player application, the buffer timestamp will move back and forth correspondingly, but the running time will keep increasing. In order to convert from one to the other, you need information from what we call a segment.

Seeing as my first patch didn’t work, I thought I’d convert all calculations to running time. Some values used were clearly buffer timestamps, but some others were something strange that was neither a buffer timestamp, nor a running time. So I would store running times for reference, and would convert back and forth when necessary. I had that patch almost ready. Almost. It fixed our initial test case and made everything but one integration test pass. It was almost ready to be merged! I was only thinking about how to handle the corner cases where the calculations end up with a negative timestamp.

And then came that fatal evening. I was visiting the office in Cologne, and took the opportunity to have some nice authentic ramen in Düsseldorf. Jan joined us. So, as Jan and I were walking on the street on the way to the ramen, we started discussing that corner case. Then he told me “Why do we even use running time, anyway? We don’t need it, do we?”.

I didn’t want to believe it at first. I went back to work on the next day, looked at the running times I was using… and I was only using them in order to convert to buffer timestamps and back. So it’s indeed not necessary. We really only need buffer timestamps.

I then looked at the part that I should have looked at first. Why do we even need the segment base in those calculations?

guint64 base;
the running time (plus elapsed time, see offset) of the segment start

So the segment base is only needed if we’re converting to running time. Which we don’t have to. The calculations were just adding and subtracting it back and forth, but doing so wrongly, and that lead to our bug. Makes us wonder how it ever worked. But there were several reports of videorate spitting out negative timestamps, for instance.

I then started removing all uses of the segment base. And fixing a couple of other bugs along the way. And, in the end, not only did all tests on the CI pass, but that even made the tests Sebastian added in his merge request pass (with a little modification on one test), so I integrated them into my code.

As for why the segment base was used? I looked at git blame, and apparently it exists all the way back since the element was ported from 0.10 using the old GstSegment API, more than 10 years ago. I assume that it once worked differently, and back then it made sense to do it that way. One day it didn’t make much sense anymore, but nobody noticed before!

Here is the final merge request:

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/6031

Bug of the day: 2022-02-28

There is this deinterlacing algorithm called yadif. It is very fast, but also very CPU-expensive. For this reason, it also has some ASM optimisations. GStreamer had support for yadif but not for the ASM optimisations, so I had previously adapted them from FFmpeg, which had both. Now, two years later, I had found a small bug: In some video, which involved some static graphics overlaid upon fast-moving content, the lines were jumping up and down (“bobbing”). However, that happened only in GStreamer’s ASM optimisations. Not in the plain C code. Not in FFmpeg’s ASM optimisations.

Having to look over some ASM code that you wrote two years ago already requires a significant amount of bravery. To make things easier, though, I had an equivalent C implementation, a reference ASM code that worked fine, and a lot of comments in my own code. Or at least I thought that these things would make it easier.

So I started looking at the implementation, remembering what I had done, checking if it does what the comments say it does, checking if the end result corresponds to the C implementation, and also comparing it with FFmpeg’s code. I started double-checking and triple-checking everything. It was all correct.

All my differences were in entry points. FFmpeg has fewer input parameters and then calculates some intermediate values, such as the value of the same pixel on the previous line, or on the previous frame. However, GStreamer calculates those in the deinterlacer base class, so I had more input parameters. I thought that maybe one of those was used improperly, but they were all correct. So I was really at a loss.

What do we do when we have no idea what’s wrong with the code? Change the functionality of random parts and see what breaks. By doing this, I slowly figured out that the value of some variable was too high in some special case (the diff parameter when mode is 0). However, there are many steps involved in that calculation, and nothing makes a lot of sense until the very end. Just to make sure, I quadruple-checked that part of the code. Nope, correct. I thought, maybe I’m accidentally messing with a register that’s needed later. Nope, I’m not.

At this point I decided to take a step back and look at the more inconspicuous parts of the code. While looking around, I noticed this macro:

%macro LOAD 2
    movh      %1, %2
    punpcklbw %1, m7
%endmacro

This loads some value into a register and then interleaves it with zeroes. We are adding pixel values, so we need to make sure that the carry doesn’t accidentally spill over into neighbouring pixels. This makes the assumption that m7 is zero. Indeed, I remember this being set to zero early on. But let’s make sure…

LOAD         m7, [bzeroq]

Ah-HA! The end effect would be that the value of m7 is interleaved with itself instead of with zeroes. That was indeed only in the mode==0 special case, and directly influenced the result of the diff parameter.

bzeroq is one of those entry-point parameters, that FFmpeg calculates in the ASM code but GStreamer gets as input. FFmpeg calculates that value earlier on, uses it once, then puts it into the stack for later. I had decided, no need to go via the stack when I can load it directly. Turns out… I can’t.

Going via the stack like FFmpeg did solve my bug.

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1816/diffs?commit_id=fbeecb9e5567b5822e93ea50fd28f820cf7bbdaf

Bug of the day: 2021-07-05

I was updating the code to the latest version of the GStreamer Rust bindings. I ended up touching A LOT of parts in almost every file, so the diff was huge. The biggest culprit was a change in the timestamps API, and we do a lot of things with timestamps, so everything related needed to be updated.

After finally getting everything to compile, I tried running the automated tests to see what went wrong there. All tests passed, apart from one.

This was a bit tricky. It’s not like half of them are failing, which would mean that I missed something fundamental. It’s also not like they’re all passing, which would mean that everything is fine. It’s just one test, and it’s timing out. Timeout means “I had to do a series of many tasks in the background and they’re still not done”, so it’s not exactly hinting at where exactly the issue is.

Fortunately we have an auto-generated API schema file, which describes all commands, with their parameters, types, and default values. I had a diff in that file and had initially not paid attention to it. So I looked at the diff and saw the problem.

I had missed one single M.

So, instead of having the code buffer for 125 milliseconds by default, it was buffering for 125 seconds.

That’s the kind of bug that the compiler wouldn’t possibly catch. I mean, “buffer for 125 seconds”, looks legitimate at a first glance, doesn’t it?

Bug of the day: 2019-07-25

This was actually Sebastian’s bug. He was having a crash caused by an invalid timecode. Now, timecodes are just hours:minutes:seconds:frames labels for each video frame. His code was ending up with a timecode of something like 45:87:84:31. Yes, that’s 87 minutes and 84 seconds. Also 31 frames at 30 fps.

He wondered where such a very invalid timecode might come from, then he noticed he had LTC input accidentally turned on. LTC gets an audio signal as an input and converts that signal into a timecode. It was turned on accidentally, no microphone connected, so he was picking up the music he was playing as “Monitor of sound card”.

He tried reproducing it but failed. Then I looked at him and suggested that he tries the previous song again… And, kaboom! That particular song had the ability to generate crazy timecodes.

The fix is here: https://gitlab.freedesktop.org/gstreamer/gst-plugins-bad/commit/aafda1c76f4089505e16b6128f8b80ab316ab2f0

Bug of the day: 2018-07-12

I was alerted that there was a bug with the new build: Audio/video data didn’t seem to be flowing into the pipeline. Okay, no big deal, let’s enable logging and see where the data goes, right?… Except that enabling logging made the bug go away. The default debug level is *:3, which means “3 everywhere”. If I enabled “*:3,GST_PADS:6” (which means 6 in GST_PADS, 3 everywhere else), the bug disappeared. Same with “*:3,GST_EVENT:6”, which is really a lightweight logging change in our pipeline. Hmm, a tricky race condition. Then I made a typo and enabled something equivalent to “*:3,non_existent_category:6” for logging – which means “have a debug level of 3 everywhere, except in that category which doesn’t exist anyway, where it will be 6”. The bug was not present. We had pretty much the definition of a Heisenbug, where if we as much as sneezed next to the logs, the bug disappeared. How on Earth is one supposed to debug this?

The next thing to try would maybe have been to roll back recent changes and see if they made a difference. Except there were no recent changes in the code. The only thing changed was something in the build system. So, depending on how the code was built, it would or wouldn’t demonstrate the bug. But the bug would show up only without any attempt at enabling any logging.

After some time spent barking up wrong trees, waiting for builds to finish, and trying to figure out the way to reduce the build waiting time, I realised that the problem was essentially the following: The driver was delivering buffers into the GStreamer element, but those buffers were apparently failing to arrive a bit further downstream. A backtrace showed me that the buffers were also not being queued up somewhere. So, if they are produced at the source, not arriving where they were supposed to arrive, and also not piling up anywhere, something in the middle must be dropping them.

I searched the code for pad probes that could be dropping the buffers, but there were none. I also looked at the elements between the source and the one that was missing the buffers, and they were all harmless. Some closer examination showed me that the buffers were even failing to exit the source element. So, something inside the source element must be secretly swallowing up the buffers.

Now, the GStreamer Decklink source element has a property that says “drop this much time when you first start up the element”. A closer look inside the logs revealed that this was indeed the culprit: it was waiting for the first (approximately) 40 hours since start-up until it would start outputting buffers. I looked inside the code that sets this property: it was set to one second. Hard-coded. As in, 1000000000 nsec. I would set that 64-bit property to literally 1000000000, and it would receive a huge number on the other side.

The key word there was “literally”. I tried casting the literal 1000000000 to a 64-bit integer and it worked!

The argument was passed as part of a vararg function (variable number of arguments, NULL-terminated). Those aren’t automatically cast to 64-bit if needed. The result was that the machine was taking 1000000000 as the low 32-bit part, and the high 32-bit part would be completed by whatever garbage was in the register at the time. And that’s how you accidentally convert 1 second to 40 hours!

Bug of the day: 2018-04-11

I recently noticed that we had a very big memory leak. My first reaction was to run valgrind. However, valgrind was only showing me some loose change here and there, nothing serious. Just to be sure, I fixed the small things it found, but the memory usage kept increasing. What’s worse; memory usage was pretty much constant when running things on valgrind, but was skyrocketing when running things normally. Pretty much the definition of a Heisenbug.

Some time later, I noticed that the memory usage was also just fine sometimes when running stuff normally. Then I noticed that, in those cases, it would fail to create an output file, but wouldn’t bother informing you about it. That’s pretty much the definition of a Mad Girlfriend Bug.

I investigated the Mad Girlfriend Bug a bit more, and found that some element would sometimes fail to transition to PLAYING fast enough, causing it to return flushing, and therefore preventing all media data from running through the pipeline. It just happened that valgrind changed the timing so that it was always reproducible. That also explains why the memory leak wasn’t appearing on valgrind:

Roll Safe Think meme: You can't leak media data if it's just not flowing in the pipeline

In the meantime, I took a look at git diff and found what was causing the memory leak. I was using an appsink/appsrc combination. If you pull a buffer from the appsink and push it into the appsrc, it gets consumed by the appsrc. However, if you pull a sample from the appsink and push it into the appsrc, you need to unref it later. I ended up having to make it obvious in the documentation:

https://bugzilla.gnome.org/show_bug.cgi?id=795150

Bug of the day: 2017-10-13

I made some changes near the source of the pipeline, essentially processing some metadata. The result: RTSP preview worked, but the files written were 0-sized. Well, that sounds like cleaning up your kitchen and causing your bathroom sink to start leaking, except the bathroom is at the other side of the apartment, and your shower is working just fine.

I traced it back to a few lines of code that were supposed to be a small improvement, but harmless at worst. They were about buffer meta, which is just what it says on the box: metadata that goes together with the buffer. Timecodes in GStreamer are such metadata. Another metadata is “reference timestamp”: they have some caps, which just act as a label, and a timestamp. Our buffers used to have only one reference timestamp in the past: stream time. Recently we added also hardware time. (Don’t try to understand those, they’re just labels for what it’s worth in this context). When getting the reference timestamp for a buffer, I used to just leave the caps (the label) NULL, meaning “give me whatever is first”. But now we also had hardware time, so I had replaced the NULL with a proper caps description. NULL worked, I could create files. With the proper caps description, I couldn’t. That sounds like going to a bar, getting a beer when you ask for a drink, but getting nothing when you ask for a beer. Makes no sense, does it?

I started tracing back the buffers to see where the problem started. The muxer wasn’t outputting any data. It received video, but it didn’t receive audio. Hmm, suspicious. Kind of makes sense though, the RTSP preview is video-only. We were getting notifications about audio levels though, so something must have gotten lost along the way. Weird. I tried seeing where the audio buffers get lost, and it turned out to be at the exact spot where the aforementioned lines of code were. Okay, now it’s at least really starting to make sense. That part was synchronising audio with video though. It wasn’t supposed to eat bread with buffers instead of bread with butter. So what’s wrong here?

A small investigation showed that, while audio buffers had both stream and hardware reference timestamps, video buffers had only hardware reference timestamps at that point. I tried passing NULL to see what would happen, and the log file happily reported “I found the first reference timestamp for both audio and video. It’s a hardware timestamp in both cases!”… Well, okay, at least synchronisation worked in that case, but it was a pure coincidence. I then looked at the code that added the reference timestamps, and, surprise of all surprises, it was identical for both audio and video. I even added an assertion right afterwards to make sure that they both existed for video. They did. But my video buffers genuinely had no stream reference timestamps when I was trying to get that information. So where does the meta go?

I took a look at my pipeline, to see if there were any elements that would delete the meta. There weren’t. The only element that did anything with meta was timecodestamper: it was configured to remove any existing timecode meta and insert its own. It looked a bit far-fetched to imagine that it would remove the reference timestamp meta as well, but everything else made even less sense, so I gave that a try. There was a function that checked all metas and removed the timecode one. I checked that one, but nope, they were both still intact after its call. What I did find, though, made even less sense. The function to add a timecode meta was removing one of the two other metas!

God enlightened me to talk to Sebastian at that point. He said “oh, Mathieu fixed that one a few days ago”:
https://bugzilla.gnome.org/show_bug.cgi?id=788617

It was essentially a bug in a homegrown linked list implementation, and took him a lot of head scratching and hair pulling to figure it out. It manifested at a completely different place, too. That bug had existed since something like 2012 and nobody had ever run into it in the past. Until Mathieu did, a few days ago. And now I also did. Think about the odds of that.

Bug of the day: 2017-03-30

We had a pipeline that would capture 4K video, convert it using a high quality but low compression ratio codec, mux it into a .mov container together with raw audio, and write it into a file at a networked drive. The drive is fast enough to easily make use of the 10GB Ethernet card.

One 4K session was running fine. Two 4K sessions were freezing. CPU and memory usage were fine – nothing that seemed to cause things to lock up I tried taking pipeline dumps of the freeze: the queue right before the encoder was full. It was 600ms and qtmux shouldn’t need more than that for interleaving, so why is the queue getting full? So I examined the timestamps of all incoming buffers. Turns out that audio and video were off by more than 600ms … but WHY? (Thursday night is here)

Now some information on how qtmux works: Raw audio has no timestamps. It just assumes that it’s progressing at the given rate. If there’s a gap, the data that comes afterwards just gets squeezed and essentially gets converted to earlier timestamps, messing up A/V sync in the meantime.

So, keeping in mind the above, we noticed that several chunks of audio and/or video were going missing, and that’s what was causing the queue to get full. But why would they go missing? (Friday night is here)

Examining the data as it came from the capture card, we noticed that the callback that fed us the audio/video data was just forgetting to be called. It would get called at very regular intervals, until BOOM, a spike of maybe half a second or even one second happens. What’s worse – the drivers have no internal buffering. If you take too long to process the incoming data, you lose the next frame. We thought that the scheduler might be messing things up, but setting the capture thread to real-time priority didn’t seem to have any effect either. The thread itself was doing nothing much, it was just putting the data into a buffer and leaving, so it wasn’t spending more than 1ms in there. No locking or anything else was happening. CPU usage was still OK, no queues were really full at that point, so what could be wrong? We had this wild idea: does it run out of buffers in the pool? YES! But why? (Monday night is here)

On Tuesday we examined the situation with the buffer pool, looking at pipeline graphs and processing log files. Turns out that the queue right before the file sink was getting full first. After this, the queues between encoder and muxer would also get full. Then the queue before the encoder would also get full. Everything was zero-copy until there in my tests, so that’s what caused the capture card drivers to run out of buffers in their pool. Eventually the situation downstream of the encoder would slowly stabilize, but that was too late because then we had the gap that made everything explode. But why on Earth would the file sink refuse to write any further buffers? That was on Tuesday. I tried several things but nothing really seemed to help.

Yesterday we also enlisted a colleague, and he examined the system parameters in /proc etc. He optimized a lot of things – for example, the throughput with “dd” has pretty much doubled now. However, even with all these changes, the situation became better, but didn’t really solve itself.

Today (Thursday) I examined some graphs of how long the writev() call needed versus how long it took for it to be called again. Turns out that writev() could take up to half a second! Then it would get called again almost immediately. After a quick search, I ran into this issue here: https://github.com/netty/netty/issues/5360 We had already tried increasing the buffering in the filesink earlier, but it had no real effect. Turns out that we weren’t using proper values: even the 4MB we tried is maybe smaller than one video frame. So I gave it some thought and then set the filesink buffer size to 128MB (it won’t actually write the data into the file before that). AND IT WORKS!!!