Or should I say, bug combo hit of the month? This is definitely one of the hardest issues I’ve encountered in my life, if not THE hardest.
So, we have an internal component called “videoengine”. The videoengine is using GStreamer with a non-negligible number of downstream-only patches. At some point, GStreamer 1.28 came out, and we decided to update it. It’s a procedure that takes half a day if nothing goes wrong, but maybe there’s a merge conflict or two, and in some rare cases, maybe there’s even an upstream regression that we have to solve first. Yes, that sounds ominous, I know.
So I solved more than our fair share of merge conflicts, in one of which I had to contact Jan and he had to remember his original intention when he wrote that patch (and then he took some time to actually get that merged upstream). And then a couple of automated tests were failing, so I had to cherry-pick some fixes that landed in 1.28.1 a few days later. And just as I thought “whew, this looks ready now”, I noticed that two integration tests were failing. These integration tests were known to be flakey so we sometimes didn’t pay attention to one or two failing, but this time, tests P3 and P4 were failing consistently, so I thought that sounded fishy.
My first reaction was to try and increase the logging level, but that made all tests start failing when logging was configured to any useful level. Those who have worked with GStreamer know that it’s not an unusual thing to happen, because logging is slowing down a latency-sensitive live workflow. So I changed my approach to trying to reproduce the bug locally. Especially test P3 was relatively simple: Start the videoengine, start the RTMP server, stream to it, start an RTMP source from the videoengine, and remux its output into RTMP and a couple more protocols. So I tried taking one of our pre-existing manual test scripts and adapting it. Nope, that seemed to work fine. I then tried tracing back the exact commands that were sent to the videoengine (which were fortunately logged) and replaying them. Nope, that also worked fine. So we’re talking about a case of the Three Most Terrifying Words: Possible Race Condition.
We started brainstorming ideas on what the problem could be, based on the limited amount of information that we had. Except reality kept disproving all of our theories. Every time we thought “yeah, that could be it” or “this should solve it”, it turned out to be irrelevant. The only thing we knew was that GStreamer 1.26.10 was working fine, while 1.28.0 was broken. We could technically bisect, except there are too many commits in there, plus a small collection of local patches that will have to be rebased and will show different levels of merge conflicts depending on the version. Also, the failure wasn’t reproducible locally, and the CI needs to be done in two steps (first compile GStreamer and then the videoengine) and the process takes almost one hour including the integration tests.
What we did have, though, was a pipeline graph. I noticed a failure early on in the pipeline: After the source and demuxer, the H264 parser wasn’t outputting any caps. However, this happened early enough that it could conceivably be that the parser just hasn’t seen a keyframe yet, and it was followed by the test immediately failing. Hmm, could it be that it just needs some more time? We added one 3-second delay to P3 between creating the source and linking it, and another 3-second delay between enabling the sinks and checking them. No luck. Still the same failure near the source. Hmm, that’s fishy.
But, given that the failure was near the source, there was just a small handful amount of elements involved: RTMP source, FLV demuxer, H264 parser. And those had no local patches that could be critical to this specific pipeline’s success or failure. So I started reverting these one by one to their 1.26 state and checking the CI’s (slow) result. I found the RTMP source to be irrelevant to the bug, but both the demuxer and the parser had to be reverted to their 1.26 state in order for P3 to pass. So that’s possibly a combination of factors? To make things worse, P4 still failed, but then I thought it’s probably the 3-second delay and that I should tackle problems one by one.
(Hey, at least these plugins could be independently reverted to their 1.26 state and didn’t require any corresponding changes to GStreamer core or some base class!)
Having nothing else left to do, I decided to play git bisect on the CI. Each step did have to take place in two parts and take almost one hour, but at least this time there were no relevant local patches to rebase, and the amount of bisect steps in each case was much more manageable. Painstaking, but the only way forward.
For the parser, I found the culprit to be this commit by Seungha, that made the element use the VUI framerate when the upstream framerate is 0/1. That initially sounded a bit weird, because upstream didn’t report 0/1 in our case. However, I noticed there’s a follow-up fix that basically told the parser base class to preserve the upstream buffer duration where possible. I restored the h264parse element to its 1.28 state and cherry-picked that commit to baseparse, and the test passed. Okay, it’s still only P3 and only with the FLV plugin at its 1.26 state, but still. After several weeks, finally there was some real progress on this bug.
Let’s focus on P4 though for a bit. My colleague Jan heard me saying it will probably pass if we add another 3-second delay in there. Then I noticed a commit where he was removing said delay from P3. Result: P3 still passes, P4 still fails. Frustrating. He also noticed an End-Of-Stream event in the logs. Okay, if we’re getting EOS, the pipeline can’t really recover, but WHY would it go EOS so early on? I tried adding some logging, but of course that made P3, P4, and something like 3 more tests, fail completely. Sigh.
So I decided to focus on the FLV plugin. Anecdotally, my brain was fried after two days of moving commit IDs and test results back and forth, and watching a slow CI like a hawk in order to scrape off a couple of minutes here and there, thinking they add up to real numbers when done several times. So I accidentally marked a good commit as bad and ended up losing about two hours. Quite an appropriate mistake to make on literally Saint Patience’s day.
Eventually, though, I got there, and found this commit by Tarun to be the culprit. It was the one that adds support for multiple audio/video streams in the FLV demuxer, and unfortunately it was a relatively large one. However, it shouldn’t have any difference in behaviour when there’s only one audio and one video stream, just like before…
… Or should it?
So I realised that the no-more-pads behaviour was different. Before, we used to emit no-more-pads whenever there was one audio and one video stream found, or after a 6-second timeout. However, now the amount is not given. Streams might appear or disappear at any moment. In the case of e.g. MPEG-TS, there’s at least a header telling us the amount of streams currently available. FLV doesn’t have such a header. So the approach taken was to emit no-more-pads after a 6-second timeout. That’s important, because we use no-more-pads to unblock the source. I dug into the code of the integration tests and found a part of the code that gave the source 5 seconds to start up and timed out otherwise. Okay, that would explain A LOT. I restored the FLV plugin to 1.28, changed that hard-coded 6 to a 2, and, what do you know, P3 now passes! I made an upstream merge request to make that hard-coded timeout configurable, which at the time of writing still has a couple more minor review comments pending.
What about P4 though? I had a small idea. I looked at the test code, and realised that Jan had only removed the first of the two 3-second delays. The second one, between enabling the sinks and checking them, was still there. So I added a matching one to P4 and it passed! Turns out, that EOS was a total red herring. It wasn’t the cause of our failure, but the result of it, caused by the test saying “okay, test failed, start teardown”.
However, it’s poor code hygiene to rely on hard-coded delays. By analysing the test a bit more, I found out that the timeout logic was a bit too tight and didn’t give the pipeline enough time to start up, which might explain the general flakiness of the integration tests that I mentioned at the beginning. Without getting too much into the internal logic of the videoengine, the correct course of action here would be to wait for … you guessed it… no-more-pads. (Or, even better, to use the not-new-anymore Streams API)
Following that, I had to clean up my git commit history from the two repositories, because every single step of every single bisect was in there. It looked like a bloody murder scene: Extremely untidy, but everything had to be preserved for forensic reasons. Now that the culprits had been successfully caught, there was a lot of cleaning to do.
The difficulty of this issue wasn’t that much about the typical “that bug was hard to find”. It was more about the fact that there were three different issues, that made the same two tests fail, and that I couldn’t reproduce it locally and had to rely on the CI.