Bug of the day: 2024-09-26

We had a system whose setup was relatively simple: Take a number of live sources, demux them, put them through audiomixer / compositor to combine together their audio/video streams, and mux the result back into another live sink.

This was initially working fine, until some code change broke it. The symptoms were that, after enough time had passed, the memory usage would start increasing. Soon afterwards, the sources would start complaining that their buffers weren’t being consumed fast enough downstream.

A first investigation showed a lot of audio buffers gathered right before the muxer. And I mean A LOT of audio buffers. Gigabytes of them. Raw audio, but still, it’s not easy to gather gigabytes of audio! I had foolishly left an infinite queue at that spot, thinking “it’s just live sources anyway, and with just audio, what could go wrong”. Famous last words, right?

I tried gathering logs around the audio path. It took a while, because, as always, enabling logs made the bug seemingly disappear… until I finally reproduced it, after struggling for many hours. That wasn’t helpful either; everything seemed to be running normally. At that point I thought, one of the changes we had done was to add the audiomixer into the audio path, while there was an input-selector previously – we had decided to mix the audio streams together instead of just selecting one single input. So, I thought about reverting that to get it out of the way, thinking that would surely fix the issue. Alas, it didn’t. The issue persisted, even with the input-selector. How curious.

At that point I decided to start bisecting. That led me to a surprising result: The offending commit was one that enabled alpha compositing. It would force the compositor to output AYUV, making it convert all its sources. So… the problem was in the video path, not the audio path? That would at least explain why nothing came up in my logs earlier! Reverting the offending commit made the issue indeed disappear, but alpha compositing was also a feature that we needed, so I couldn’t just leave it at this, I had to get to the bottom of the issue.

After a little thought, I realised: What if the compositor (configured to a decent n-threads, mind you) still couldn’t keep up with the video conversion? That would mean it’s outputting buffers slower than real time. The audio path is real time, which means audio buffers would slowly start piling up before the muxer. At the same time, video buffers couldn’t be consumed as fast as they are produced, causing the sources to complain about this fact. Everything was checking out. With the detailed logs I had enabled earlier, I had essentially slowed down the audio path and the sources as well, so it was accidentally working again. Slower than real time, but working.

But why would the compositor not keep up with the video conversion? It does process all pads in parallel if you set n-threads to a sensible default. But there’s also a caveat! Each sink pad was still single-threaded! As it turns out, you have to create a VideoConverterConfig, set its threads to a sensible number, and then tell each compositor sink pad to use that. That solved our bug.

Another solution was to use dedicated videoconvert elements before the compositor, with decoupling queues in between, to make sure they’d run in their own thread instead of the same thread as the corresponding compositor sink pads. We ended up doing both at once. The system was running very stable for hours afterwards.

Bug of the day: 2024-02-02

I had started chasing this bug already in December. A coworker of mine had reported that, with a specific input file, using a tricky maneuver, that also required a lot of other moving parts that interacted with our code, and that also involved at some point deleting and re-adding all elements from the pipeline (!), the file would stall after showing only a couple of frames.

My first thought was to try and reproduce it locally, without all the moving parts. I tried repeatedly, but failed. My colleague Jan also tried repeatedly, but failed. No matter what we did, it was all working fine. We also asked for log files, but they didn’t show any issues either. I was really stuck for a long time, because I had no idea how to chase that bug.

Eventually, Jan noticed that the videorate element was trying to bridge a huge gap: It had received only one input frame, but had duplicated several frames. However, our logs did not indicate such a gap. The videorate element is what converts the frame rate of the video between different values, and also what fixes up the (non-live) stream in case a buffer has gone missing or appears twice.

The next step was to ask for additional log file with videorate debug information. Fortunately, my other coworker could still reproduce it with the moving parts. And, there I saw it:

BEGINNING prev buf 1000:04:41.104589703 new buf 1000:04:41.137956369 outgoing ts 1000:04:41.104589703
diff with prev 1000:04:41.104589703 diff with new 1000:04:41.137956369 outgoing ts 0:00:00.000000000

There it was. videorate was calculating the differences wrongly!

My first thought was to look at existing related bug reports and merge requests. I found something that looked suspiciously close to our issue:

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/2186

If my theory was correct, it would mean that reverting https://gitlab.freedesktop.org/gstreamer/gst-plugins-base/-/merge_requests/767 would fix our issue. Except it wasn’t possible to revert, because the code had been repeatedly touched since.

I then started looking at the calculations. Something was wrong with the segment-related parts, related to how the segment base was used. I thought I had found it, made a patch that fixed this specific issue, put it up on the CI, and it ended up breaking other things.

For clarity: Imagine that you are playing back a file, playing it faster or slower, seeking back and forth, etc. The time displayed at your player’s clock corresponds to the timestamps inside the file. However, the parts that know when to display each frame, or when to play each sound, have a different time, according to how you manipulate the playback. For simplification, let’s say that the time displayed at your player’s progress bar is the buffer timestamp, and the time when the frames/sounds are displayed is the running time. So, when you move the slider back and forth in your player application, the buffer timestamp will move back and forth correspondingly, but the running time will keep increasing. In order to convert from one to the other, you need information from what we call a segment.

Seeing as my first patch didn’t work, I thought I’d convert all calculations to running time. Some values used were clearly buffer timestamps, but some others were something strange that was neither a buffer timestamp, nor a running time. So I would store running times for reference, and would convert back and forth when necessary. I had that patch almost ready. Almost. It fixed our initial test case and made everything but one integration test pass. It was almost ready to be merged! I was only thinking about how to handle the corner cases where the calculations end up with a negative timestamp.

And then came that fatal evening. I was visiting the office in Cologne, and took the opportunity to have some nice authentic ramen in Düsseldorf. Jan joined us. So, as Jan and I were walking on the street on the way to the ramen, we started discussing that corner case. Then he told me “Why do we even use running time, anyway? We don’t need it, do we?”.

I didn’t want to believe it at first. I went back to work on the next day, looked at the running times I was using… and I was only using them in order to convert to buffer timestamps and back. So it’s indeed not necessary. We really only need buffer timestamps.

I then looked at the part that I should have looked at first. Why do we even need the segment base in those calculations?

guint64 base;
the running time (plus elapsed time, see offset) of the segment start

So the segment base is only needed if we’re converting to running time. Which we don’t have to. The calculations were just adding and subtracting it back and forth, but doing so wrongly, and that lead to our bug. Makes us wonder how it ever worked. But there were several reports of videorate spitting out negative timestamps, for instance.

I then started removing all uses of the segment base. And fixing a couple of other bugs along the way. And, in the end, not only did all tests on the CI pass, but that even made the tests Sebastian added in his merge request pass (with a little modification on one test), so I integrated them into my code.

As for why the segment base was used? I looked at git blame, and apparently it exists all the way back since the element was ported from 0.10 using the old GstSegment API, more than 10 years ago. I assume that it once worked differently, and back then it made sense to do it that way. One day it didn’t make much sense anymore, but nobody noticed before!

Here is the final merge request:

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/6031

Bug of the day: 2022-02-28

There is this deinterlacing algorithm called yadif. It is very fast, but also very CPU-expensive. For this reason, it also has some ASM optimisations. GStreamer had support for yadif but not for the ASM optimisations, so I had previously adapted them from FFmpeg, which had both. Now, two years later, I had found a small bug: In some video, which involved some static graphics overlaid upon fast-moving content, the lines were jumping up and down (“bobbing”). However, that happened only in GStreamer’s ASM optimisations. Not in the plain C code. Not in FFmpeg’s ASM optimisations.

Having to look over some ASM code that you wrote two years ago already requires a significant amount of bravery. To make things easier, though, I had an equivalent C implementation, a reference ASM code that worked fine, and a lot of comments in my own code. Or at least I thought that these things would make it easier.

So I started looking at the implementation, remembering what I had done, checking if it does what the comments say it does, checking if the end result corresponds to the C implementation, and also comparing it with FFmpeg’s code. I started double-checking and triple-checking everything. It was all correct.

All my differences were in entry points. FFmpeg has fewer input parameters and then calculates some intermediate values, such as the value of the same pixel on the previous line, or on the previous frame. However, GStreamer calculates those in the deinterlacer base class, so I had more input parameters. I thought that maybe one of those was used improperly, but they were all correct. So I was really at a loss.

What do we do when we have no idea what’s wrong with the code? Change the functionality of random parts and see what breaks. By doing this, I slowly figured out that the value of some variable was too high in some special case (the diff parameter when mode is 0). However, there are many steps involved in that calculation, and nothing makes a lot of sense until the very end. Just to make sure, I quadruple-checked that part of the code. Nope, correct. I thought, maybe I’m accidentally messing with a register that’s needed later. Nope, I’m not.

At this point I decided to take a step back and look at the more inconspicuous parts of the code. While looking around, I noticed this macro:

%macro LOAD 2
    movh      %1, %2
    punpcklbw %1, m7
%endmacro

This loads some value into a register and then interleaves it with zeroes. We are adding pixel values, so we need to make sure that the carry doesn’t accidentally spill over into neighbouring pixels. This makes the assumption that m7 is zero. Indeed, I remember this being set to zero early on. But let’s make sure…

LOAD         m7, [bzeroq]

Ah-HA! The end effect would be that the value of m7 is interleaved with itself instead of with zeroes. That was indeed only in the mode==0 special case, and directly influenced the result of the diff parameter.

bzeroq is one of those entry-point parameters, that FFmpeg calculates in the ASM code but GStreamer gets as input. FFmpeg calculates that value earlier on, uses it once, then puts it into the stack for later. I had decided, no need to go via the stack when I can load it directly. Turns out… I can’t.

Going via the stack like FFmpeg did solve my bug.

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1816/diffs?commit_id=fbeecb9e5567b5822e93ea50fd28f820cf7bbdaf

Bug of the day: 2021-07-05

I was updating the code to the latest version of the GStreamer Rust bindings. I ended up touching A LOT of parts in almost every file, so the diff was huge. The biggest culprit was a change in the timestamps API, and we do a lot of things with timestamps, so everything related needed to be updated.

After finally getting everything to compile, I tried running the automated tests to see what went wrong there. All tests passed, apart from one.

This was a bit tricky. It’s not like half of them are failing, which would mean that I missed something fundamental. It’s also not like they’re all passing, which would mean that everything is fine. It’s just one test, and it’s timing out. Timeout means “I had to do a series of many tasks in the background and they’re still not done”, so it’s not exactly hinting at where exactly the issue is.

Fortunately we have an auto-generated API schema file, which describes all commands, with their parameters, types, and default values. I had a diff in that file and had initially not paid attention to it. So I looked at the diff and saw the problem.

I had missed one single M.

So, instead of having the code buffer for 125 milliseconds by default, it was buffering for 125 seconds.

That’s the kind of bug that the compiler wouldn’t possibly catch. I mean, “buffer for 125 seconds”, looks legitimate at a first glance, doesn’t it?

Bug of the day: 2019-07-25

This was actually Sebastian’s bug. He was having a crash caused by an invalid timecode. Now, timecodes are just hours:minutes:seconds:frames labels for each video frame. His code was ending up with a timecode of something like 45:87:84:31. Yes, that’s 87 minutes and 84 seconds. Also 31 frames at 30 fps.

He wondered where such a very invalid timecode might come from, then he noticed he had LTC input accidentally turned on. LTC gets an audio signal as an input and converts that signal into a timecode. It was turned on accidentally, no microphone connected, so he was picking up the music he was playing as “Monitor of sound card”.

He tried reproducing it but failed. Then I looked at him and suggested that he tries the previous song again… And, kaboom! That particular song had the ability to generate crazy timecodes.

The fix is here: https://gitlab.freedesktop.org/gstreamer/gst-plugins-bad/commit/aafda1c76f4089505e16b6128f8b80ab316ab2f0

Bug of the day: 2018-07-12

I was alerted that there was a bug with the new build: Audio/video data didn’t seem to be flowing into the pipeline. Okay, no big deal, let’s enable logging and see where the data goes, right?… Except that enabling logging made the bug go away. The default debug level is *:3, which means “3 everywhere”. If I enabled “*:3,GST_PADS:6” (which means 6 in GST_PADS, 3 everywhere else), the bug disappeared. Same with “*:3,GST_EVENT:6”, which is really a lightweight logging change in our pipeline. Hmm, a tricky race condition. Then I made a typo and enabled something equivalent to “*:3,non_existent_category:6” for logging – which means “have a debug level of 3 everywhere, except in that category which doesn’t exist anyway, where it will be 6”. The bug was not present. We had pretty much the definition of a Heisenbug, where if we as much as sneezed next to the logs, the bug disappeared. How on Earth is one supposed to debug this?

The next thing to try would maybe have been to roll back recent changes and see if they made a difference. Except there were no recent changes in the code. The only thing changed was something in the build system. So, depending on how the code was built, it would or wouldn’t demonstrate the bug. But the bug would show up only without any attempt at enabling any logging.

After some time spent barking up wrong trees, waiting for builds to finish, and trying to figure out the way to reduce the build waiting time, I realised that the problem was essentially the following: The driver was delivering buffers into the GStreamer element, but those buffers were apparently failing to arrive a bit further downstream. A backtrace showed me that the buffers were also not being queued up somewhere. So, if they are produced at the source, not arriving where they were supposed to arrive, and also not piling up anywhere, something in the middle must be dropping them.

I searched the code for pad probes that could be dropping the buffers, but there were none. I also looked at the elements between the source and the one that was missing the buffers, and they were all harmless. Some closer examination showed me that the buffers were even failing to exit the source element. So, something inside the source element must be secretly swallowing up the buffers.

Now, the GStreamer Decklink source element has a property that says “drop this much time when you first start up the element”. A closer look inside the logs revealed that this was indeed the culprit: it was waiting for the first (approximately) 40 hours since start-up until it would start outputting buffers. I looked inside the code that sets this property: it was set to one second. Hard-coded. As in, 1000000000 nsec. I would set that 64-bit property to literally 1000000000, and it would receive a huge number on the other side.

The key word there was “literally”. I tried casting the literal 1000000000 to a 64-bit integer and it worked!

The argument was passed as part of a vararg function (variable number of arguments, NULL-terminated). Those aren’t automatically cast to 64-bit if needed. The result was that the machine was taking 1000000000 as the low 32-bit part, and the high 32-bit part would be completed by whatever garbage was in the register at the time. And that’s how you accidentally convert 1 second to 40 hours!

Bug of the day: 2018-04-11

I recently noticed that we had a very big memory leak. My first reaction was to run valgrind. However, valgrind was only showing me some loose change here and there, nothing serious. Just to be sure, I fixed the small things it found, but the memory usage kept increasing. What’s worse; memory usage was pretty much constant when running things on valgrind, but was skyrocketing when running things normally. Pretty much the definition of a Heisenbug.

Some time later, I noticed that the memory usage was also just fine sometimes when running stuff normally. Then I noticed that, in those cases, it would fail to create an output file, but wouldn’t bother informing you about it. That’s pretty much the definition of a Mad Girlfriend Bug.

I investigated the Mad Girlfriend Bug a bit more, and found that some element would sometimes fail to transition to PLAYING fast enough, causing it to return flushing, and therefore preventing all media data from running through the pipeline. It just happened that valgrind changed the timing so that it was always reproducible. That also explains why the memory leak wasn’t appearing on valgrind:

Roll Safe Think meme: You can't leak media data if it's just not flowing in the pipeline

In the meantime, I took a look at git diff and found what was causing the memory leak. I was using an appsink/appsrc combination. If you pull a buffer from the appsink and push it into the appsrc, it gets consumed by the appsrc. However, if you pull a sample from the appsink and push it into the appsrc, you need to unref it later. I ended up having to make it obvious in the documentation:

https://bugzilla.gnome.org/show_bug.cgi?id=795150

Bug of the day: 2017-10-13

I made some changes near the source of the pipeline, essentially processing some metadata. The result: RTSP preview worked, but the files written were 0-sized. Well, that sounds like cleaning up your kitchen and causing your bathroom sink to start leaking, except the bathroom is at the other side of the apartment, and your shower is working just fine.

I traced it back to a few lines of code that were supposed to be a small improvement, but harmless at worst. They were about buffer meta, which is just what it says on the box: metadata that goes together with the buffer. Timecodes in GStreamer are such metadata. Another metadata is “reference timestamp”: they have some caps, which just act as a label, and a timestamp. Our buffers used to have only one reference timestamp in the past: stream time. Recently we added also hardware time. (Don’t try to understand those, they’re just labels for what it’s worth in this context). When getting the reference timestamp for a buffer, I used to just leave the caps (the label) NULL, meaning “give me whatever is first”. But now we also had hardware time, so I had replaced the NULL with a proper caps description. NULL worked, I could create files. With the proper caps description, I couldn’t. That sounds like going to a bar, getting a beer when you ask for a drink, but getting nothing when you ask for a beer. Makes no sense, does it?

I started tracing back the buffers to see where the problem started. The muxer wasn’t outputting any data. It received video, but it didn’t receive audio. Hmm, suspicious. Kind of makes sense though, the RTSP preview is video-only. We were getting notifications about audio levels though, so something must have gotten lost along the way. Weird. I tried seeing where the audio buffers get lost, and it turned out to be at the exact spot where the aforementioned lines of code were. Okay, now it’s at least really starting to make sense. That part was synchronising audio with video though. It wasn’t supposed to eat bread with buffers instead of bread with butter. So what’s wrong here?

A small investigation showed that, while audio buffers had both stream and hardware reference timestamps, video buffers had only hardware reference timestamps at that point. I tried passing NULL to see what would happen, and the log file happily reported “I found the first reference timestamp for both audio and video. It’s a hardware timestamp in both cases!”… Well, okay, at least synchronisation worked in that case, but it was a pure coincidence. I then looked at the code that added the reference timestamps, and, surprise of all surprises, it was identical for both audio and video. I even added an assertion right afterwards to make sure that they both existed for video. They did. But my video buffers genuinely had no stream reference timestamps when I was trying to get that information. So where does the meta go?

I took a look at my pipeline, to see if there were any elements that would delete the meta. There weren’t. The only element that did anything with meta was timecodestamper: it was configured to remove any existing timecode meta and insert its own. It looked a bit far-fetched to imagine that it would remove the reference timestamp meta as well, but everything else made even less sense, so I gave that a try. There was a function that checked all metas and removed the timecode one. I checked that one, but nope, they were both still intact after its call. What I did find, though, made even less sense. The function to add a timecode meta was removing one of the two other metas!

God enlightened me to talk to Sebastian at that point. He said “oh, Mathieu fixed that one a few days ago”:
https://bugzilla.gnome.org/show_bug.cgi?id=788617

It was essentially a bug in a homegrown linked list implementation, and took him a lot of head scratching and hair pulling to figure it out. It manifested at a completely different place, too. That bug had existed since something like 2012 and nobody had ever run into it in the past. Until Mathieu did, a few days ago. And now I also did. Think about the odds of that.

Bug of the day: 2017-03-30

We had a pipeline that would capture 4K video, convert it using a high quality but low compression ratio codec, mux it into a .mov container together with raw audio, and write it into a file at a networked drive. The drive is fast enough to easily make use of the 10GB Ethernet card.

One 4K session was running fine. Two 4K sessions were freezing. CPU and memory usage were fine – nothing that seemed to cause things to lock up I tried taking pipeline dumps of the freeze: the queue right before the encoder was full. It was 600ms and qtmux shouldn’t need more than that for interleaving, so why is the queue getting full? So I examined the timestamps of all incoming buffers. Turns out that audio and video were off by more than 600ms … but WHY? (Thursday night is here)

Now some information on how qtmux works: Raw audio has no timestamps. It just assumes that it’s progressing at the given rate. If there’s a gap, the data that comes afterwards just gets squeezed and essentially gets converted to earlier timestamps, messing up A/V sync in the meantime.

So, keeping in mind the above, we noticed that several chunks of audio and/or video were going missing, and that’s what was causing the queue to get full. But why would they go missing? (Friday night is here)

Examining the data as it came from the capture card, we noticed that the callback that fed us the audio/video data was just forgetting to be called. It would get called at very regular intervals, until BOOM, a spike of maybe half a second or even one second happens. What’s worse – the drivers have no internal buffering. If you take too long to process the incoming data, you lose the next frame. We thought that the scheduler might be messing things up, but setting the capture thread to real-time priority didn’t seem to have any effect either. The thread itself was doing nothing much, it was just putting the data into a buffer and leaving, so it wasn’t spending more than 1ms in there. No locking or anything else was happening. CPU usage was still OK, no queues were really full at that point, so what could be wrong? We had this wild idea: does it run out of buffers in the pool? YES! But why? (Monday night is here)

On Tuesday we examined the situation with the buffer pool, looking at pipeline graphs and processing log files. Turns out that the queue right before the file sink was getting full first. After this, the queues between encoder and muxer would also get full. Then the queue before the encoder would also get full. Everything was zero-copy until there in my tests, so that’s what caused the capture card drivers to run out of buffers in their pool. Eventually the situation downstream of the encoder would slowly stabilize, but that was too late because then we had the gap that made everything explode. But why on Earth would the file sink refuse to write any further buffers? That was on Tuesday. I tried several things but nothing really seemed to help.

Yesterday we also enlisted a colleague, and he examined the system parameters in /proc etc. He optimized a lot of things – for example, the throughput with “dd” has pretty much doubled now. However, even with all these changes, the situation became better, but didn’t really solve itself.

Today (Thursday) I examined some graphs of how long the writev() call needed versus how long it took for it to be called again. Turns out that writev() could take up to half a second! Then it would get called again almost immediately. After a quick search, I ran into this issue here: https://github.com/netty/netty/issues/5360 We had already tried increasing the buffering in the filesink earlier, but it had no real effect. Turns out that we weren’t using proper values: even the 4MB we tried is maybe smaller than one video frame. So I gave it some thought and then set the filesink buffer size to 128MB (it won’t actually write the data into the file before that). AND IT WORKS!!!

Bug of the day: 2016-12-22

Somewhere, I was receiving an integer and I had to return a filename based on it, where the integer was the sequence number, e.g. 0, 1, 2, etc. I needed to use that integer in a callback, so I was passing it as user_data with a GUINT_TO_POINTER (type casting with some sanity checks) and getting it back with GPOINTER_TO_UINT on the other side. The first file was being written normally, but the second failed. I tried inspecting why and saw this line in the callback:

MyElement *self = MY_ELEMENT (user_data);

where “self” wasn’t needed in that callback, I just added it out of habit, because usually user_data is self. It does have some sanity checks there, so the first time worked because user_data was just 0x0 – so self is NULL, alright. The second time, self was becoming (MyElement *) 0x1, so the sanity check was failing, and kaboom.