Bug of the day: 2024-02-02

I had started chasing this bug already in December. A coworker of mine had reported that, with a specific input file, using a tricky maneuver, that also required a lot of other moving parts that interacted with our code, and that also involved at some point deleting and re-adding all elements from the pipeline (!), the file would stall after showing only a couple of frames.

My first thought was to try and reproduce it locally, without all the moving parts. I tried repeatedly, but failed. My colleague Jan also tried repeatedly, but failed. No matter what we did, it was all working fine. We also asked for log files, but they didn’t show any issues either. I was really stuck for a long time, because I had no idea how to chase that bug.

Eventually, Jan noticed that the videorate element was trying to bridge a huge gap: It had received only one input frame, but had duplicated several frames. However, our logs did not indicate such a gap. The videorate element is what converts the frame rate of the video between different values, and also what fixes up the (non-live) stream in case a buffer has gone missing or appears twice.

The next step was to ask for additional log file with videorate debug information. Fortunately, my other coworker could still reproduce it with the moving parts. And, there I saw it:

BEGINNING prev buf 1000:04:41.104589703 new buf 1000:04:41.137956369 outgoing ts 1000:04:41.104589703
diff with prev 1000:04:41.104589703 diff with new 1000:04:41.137956369 outgoing ts 0:00:00.000000000

There it was. videorate was calculating the differences wrongly!

My first thought was to look at existing related bug reports and merge requests. I found something that looked suspiciously close to our issue:

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/2186

If my theory was correct, it would mean that reverting https://gitlab.freedesktop.org/gstreamer/gst-plugins-base/-/merge_requests/767 would fix our issue. Except it wasn’t possible to revert, because the code had been repeatedly touched since.

I then started looking at the calculations. Something was wrong with the segment-related parts, related to how the segment base was used. I thought I had found it, made a patch that fixed this specific issue, put it up on the CI, and it ended up breaking other things.

For clarity: Imagine that you are playing back a file, playing it faster or slower, seeking back and forth, etc. The time displayed at your player’s clock corresponds to the timestamps inside the file. However, the parts that know when to display each frame, or when to play each sound, have a different time, according to how you manipulate the playback. For simplification, let’s say that the time displayed at your player’s progress bar is the buffer timestamp, and the time when the frames/sounds are displayed is the running time. So, when you move the slider back and forth in your player application, the buffer timestamp will move back and forth correspondingly, but the running time will keep increasing. In order to convert from one to the other, you need information from what we call a segment.

Seeing as my first patch didn’t work, I thought I’d convert all calculations to running time. Some values used were clearly buffer timestamps, but some others were something strange that was neither a buffer timestamp, nor a running time. So I would store running times for reference, and would convert back and forth when necessary. I had that patch almost ready. Almost. It fixed our initial test case and made everything but one integration test pass. It was almost ready to be merged! I was only thinking about how to handle the corner cases where the calculations end up with a negative timestamp.

And then came that fatal evening. I was visiting the office in Cologne, and took the opportunity to have some nice authentic ramen in Düsseldorf. Jan joined us. So, as Jan and I were walking on the street on the way to the ramen, we started discussing that corner case. Then he told me “Why do we even use running time, anyway? We don’t need it, do we?”.

I didn’t want to believe it at first. I went back to work on the next day, looked at the running times I was using… and I was only using them in order to convert to buffer timestamps and back. So it’s indeed not necessary. We really only need buffer timestamps.

I then looked at the part that I should have looked at first. Why do we even need the segment base in those calculations?

guint64 base;
the running time (plus elapsed time, see offset) of the segment start

So the segment base is only needed if we’re converting to running time. Which we don’t have to. The calculations were just adding and subtracting it back and forth, but doing so wrongly, and that lead to our bug. Makes us wonder how it ever worked. But there were several reports of videorate spitting out negative timestamps, for instance.

I then started removing all uses of the segment base. And fixing a couple of other bugs along the way. And, in the end, not only did all tests on the CI pass, but that even made the tests Sebastian added in his merge request pass (with a little modification on one test), so I integrated them into my code.

As for why the segment base was used? I looked at git blame, and apparently it exists all the way back since the element was ported from 0.10 using the old GstSegment API, more than 10 years ago. I assume that it once worked differently, and back then it made sense to do it that way. One day it didn’t make much sense anymore, but nobody noticed before!

Here is the final merge request:

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/6031