Bug of the day: 2018-07-12

I was alerted that there was a bug with the new build: Audio/video data didn’t seem to be flowing into the pipeline. Okay, no big deal, let’s enable logging and see where the data goes, right?… Except that enabling logging made the bug go away. The default debug level is *:3, which means “3 everywhere”. If I enabled “*:3,GST_PADS:6” (which means 6 in GST_PADS, 3 everywhere else), the bug disappeared. Same with “*:3,GST_EVENT:6”, which is really a lightweight logging change in our pipeline. Hmm, a tricky race condition. Then I made a typo and enabled something equivalent to “*:3,non_existent_category:6” for logging – which means “have a debug level of 3 everywhere, except in that category which doesn’t exist anyway, where it will be 6”. The bug was not present. We had pretty much the definition of a Heisenbug, where if we as much as sneezed next to the logs, the bug disappeared. How on Earth is one supposed to debug this?

The next thing to try would maybe have been to roll back recent changes and see if they made a difference. Except there were no recent changes in the code. The only thing changed was something in the build system. So, depending on how the code was built, it would or wouldn’t demonstrate the bug. But the bug would show up only without any attempt at enabling any logging.

After some time spent barking up wrong trees, waiting for builds to finish, and trying to figure out the way to reduce the build waiting time, I realised that the problem was essentially the following: The driver was delivering buffers into the GStreamer element, but those buffers were apparently failing to arrive a bit further downstream. A backtrace showed me that the buffers were also not being queued up somewhere. So, if they are produced at the source, not arriving where they were supposed to arrive, and also not piling up anywhere, something in the middle must be dropping them.

I searched the code for pad probes that could be dropping the buffers, but there were none. I also looked at the elements between the source and the one that was missing the buffers, and they were all harmless. Some closer examination showed me that the buffers were even failing to exit the source element. So, something inside the source element must be secretly swallowing up the buffers.

Now, the GStreamer Decklink source element has a property that says “drop this much time when you first start up the element”. A closer look inside the logs revealed that this was indeed the culprit: it was waiting for the first (approximately) 40 hours since start-up until it would start outputting buffers. I looked inside the code that sets this property: it was set to one second. Hard-coded. As in, 1000000000 nsec. I would set that 64-bit property to literally 1000000000, and it would receive a huge number on the other side.

The key word there was “literally”. I tried casting the literal 1000000000 to a 64-bit integer and it worked!

The argument was passed as part of a vararg function (variable number of arguments, NULL-terminated). Those aren’t automatically cast to 64-bit if needed. The result was that the machine was taking 1000000000 as the low 32-bit part, and the high 32-bit part would be completed by whatever garbage was in the register at the time. And that’s how you accidentally convert 1 second to 40 hours!

Leave a Reply Cancel reply