Bug of the day: 2016-12-14
The previous day, I was looking at the debug logs and noticed a general teardown of the pipeline (setting state to NULL, removing elements, etc) when something crashed (e.g. wrong parameters => failed to link). I was thinking it absolutely shouldn’t happen and went to fix it. I could reproduce the pipeline crashing completely, but not the teardown. In any case, I pushed some fixes for it to fail more gracefully and went to sleep.
Next day, the pipeline started crashing for no apparent reason when I was trying to debug something else. I opened the log and, sure enough, there was a teardown. I also thought I could see some “child process killed” debug output on stdout, but that was running under screen, and “how do I scroll under screen?” should turned into “meh, that must have been just my idea, I can see the teardown here – plus I’m not getting a core dump”. I tried debugging where it could come from, but it ended up coming from a user command. However, my colleague insisted that the pipeline was just crashing.
Upon closer inspection, I noticed two processes doing teardown simultaneously. I asked my colleague and he said “yes, I was running four RTSP previews at once, but I was only recording to a file on one channel”. I tried seeing which process had the filesink in it, and I noticed it was yet another process, the last line of which was something completely normal and non-error. At that point I scrolled up and, sure enough, the “child process killed” message was there. It was even signal 9 (SIGKILL), which explained the absence of a core dump.
What had really happened: I was already debugging something else, so I had set debug log level to a high value. My colleague was running four channels at once, which means four different RTSP sessions, which encoded their streams four times, on a CPU encoder. When the time came for one of these four sessions to start encoding again in order to write it on the file, the machine simply couldn’t handle the load, the child process took longer than the configured timeout to respond and ended up getting killed as “probably deadlocked”.