Bug of the day: 2019-07-25

This was actually Sebastian’s bug. He was having a crash caused by an invalid timecode. Now, timecodes are just hours:minutes:seconds:frames labels for each video frame. His code was ending up with a timecode of something like 45:87:84:31. Yes, that’s 87 minutes and 84 seconds. Also 31 frames at 30 fps.

He wondered where such a very invalid timecode might come from, then he noticed he had LTC input accidentally turned on. LTC gets an audio signal as an input and converts that signal into a timecode. It was turned on accidentally, no microphone connected, so he was picking up the music he was playing as “Monitor of sound card”.

He tried reproducing it but failed. Then I looked at him and suggested that he tries the previous song again… And, kaboom! That particular song had the ability to generate crazy timecodes.

The fix is here: https://gitlab.freedesktop.org/gstreamer/gst-plugins-bad/commit/aafda1c76f4089505e16b6128f8b80ab316ab2f0

Picky eating

You know how sometimes children refuse to eat a specific kind of food for no apparent reason? I’ve been “blessed” with this as an adult. Quite often, there’s a very good reason, they’re just not self-aware enough to express it.

At least, I was not, until I was in my late 20’s. One day, I asked my brother “hey, how do peppers taste for you?”. Knowing me well enough, he figured that I must have a good reason to ask, so he answered honestly: “refreshing, and a bit spicy”. My immediate reply was “wait, so they’re not bitter?”. On that day, I realised that the reason I couldn’t eat peppers was that most other people perceive a completely different taste than I do. For me, they’re like distilled bitterness. Yes, red ones too. Yes, even the rice cooked inside of a stuffed pepper. Yes, I’ve tried it several times.

Since then, I started experimenting by re-adding previously banished foods to my diet. Most people dislike okra because it’s hairy and slimy, though this isn’t the case if it’s prepared correctly. The okra I tried in India was neither hairy nor slimy. Just extremely bitter. Okay, back to not eating it, I guess. Aubergines, on the other hand, were a pleasant surprise. They can be non-bitter sometimes, especially the white, green, or stripy ones. It also depends on the way they’re cooked. I still don’t rely on them for a meal, but I take one bite or two.

When I was a child, I used to pick both mushrooms and peppers out of my pizza. Later on, I started tolerating and even liking mushrooms. My body could still sometimes say “okay that’s it, no more mushrooms for this meal”, and in those cases I’d shrug and finish my meal with something else that was on the table. Except that one day where I was strictly fasting and my only office lunch contained mushrooms. I tried force-feeding myself and my body rejected them quite strongly. Right, no lunch for me today, I guess.

Later on, I found a nice shop that was selling delicious mushrooms of their own production. It was both fasting and mushroom season, so I was happily having mushrooms at least a couple of times per week. Until it happened. I couldn’t finish that meal, and the no-mushrooms-please sensation didn’t just go away. I then remembered that friend of mine who always said she couldn’t have mushrooms, and asked her why. “They make me gag”, she said. Yep, sounds familiar!

It’s been a few years since then, and I’ve come to slightly tolerate them after a long abstinence period. I can have mushrooms on my pizza, if they’re sliced thinly enough, though I might still stop halfway and start removing them. If someone tells me “these mushrooms are delicious”, I might take a bite and say “yes, indeed!”, but not a second one. A quick internet search told me about “sensory processing disorder”, though I haven’t researched it further.

As a conclusion, all I have to say is: Give those children a break. If they’re refusing to eat something, try asking them why, you’re helping them. Just no pressure. Would you eat food sprinkled with the most bitter medicine you’ve ever tasted, or food that triggers your gag reflex? In most cases, they can avoid those foods and still have a balanced diet.

Το Κρασάκι του Τσου στα ελληνικά

(Η παρούσα ανάρτηση είναι μια ευγενική χορηγία του αδερφού μου)

Πριν κάμποσον καιρό, μιλούσα με ένα γνωστό μου για τα χόμπι μας. Του ανέφερα λοιπόν ότι στο παρελθόν είχα μεταφράσει γιαπωνέζικα κόμικς. Μου θύμισε λοιπόν ένα αστείο βίντεο που είχε δει, με τίτλο «το κρασάκι του Τσου», που είχε παρακουσμένους Ελληνικούς στίχους για το τραγούδι «Σούρα νο Χάνα» (=το άνθος της σφαγής) της Κάτζι Μέικο. Οπότε μου είπε «Εγώ θέλω να μεταφράσεις αυτό!» και εγώ δέχθηκα την πρόκληση.

Η προφανής επιλογή θα ήταν να κοιτάξω τις μεταφράσεις στα Αγγλικά και να βασίσω τη δική μου μετάφραση σε αυτές. Προς έκπληξή μου, όμως, όχι μόνο οι Αγγλικές μεταφράσεις ήταν τρία-πουλάκια-κάθονταν, μέχρι και οι μεταγραφές σε romaji είχαν λάθη. Οπότε, λέω, θα το μεταφράσω απ’ την αρχή.

Το μετάφρασμα παρατίθεται.

Ένα νεκρό πρωινό, το χιόνι πέφτει ενταφιάζοντας τα πάντα
Ακούγονται μόνο αδέσποτα σκυλιά που ουρλιάζουν και τα τσόκαρά* μου που τρίζουν.
Περπατώ μελετώντας το βάρος της θείας δίκης
Μια ομπρέλα σα στόχος αγκαλιάζει το σκότος
Περπατώ στο δρόμο της ζωής, ως γυναίκα που έχει εδώ και καιρό αποτινάξει τα δάκρυά της.

Στους μαιάνδρους του ποταμού, ο πυρσός του ταξιδιού σιγά-σιγά αμυδραίνει
Ο παγωμένος γερανός*² δεν κινείται, ο άνεμος κι η βροχή μαίνονται
Η παγωμένη επιφάνειά του αντανακλά μαλλιά ατημέλητα.
Μια ομπρέλα σα στόχος, δε δείχνει ούτε δάκρυα
Περπατώ στο δρόμο της εκδίκησης, ως γυναίκα που έχει εδώ και καιρό αποτινάξει την καρδιά της.

Και τσίπα, και συναισθήματα, και δάκρυα και όνειρα
και το χθες και το αύριο, ανούσιες λέξεις.
Ως γυναίκα που έχει εγκαταλείψει το κορμί της στο ποτάμι της εκδίκησης,
τα έχω αποτινάξει όλα εδώ και καιρό.

* geta
*² (εννοεί το πουλί, όχι το μηχάνημα)

Translation of Shura no Hana

(This post was written by my brother, I’m just posting it here)

A while ago, an acquaintance and I were talking about our hobbies; I mentioned to him that I’ve translated Japanese comics in the past. He recalled a funny video he’d seen, titled “to krasaki tou tsou”, which consisted of misheard Greek lyrics of Kaji Meiko’s “Shura no Hana”. So he said “Dude, why not translate that one then?”. I thought “Challenge accepted!”

My first option, of course, was to take a look at English translations and base my own Greek translation on them. Imagine my surprise, then, when I realised that not only were the translations I found incorrect, even the transcriptions to romaji had mistakes.

With that in mind, I decided to translate it from scratch to both languages. The English version can be found below.

In a dead morning, the snow falls burying everything
All that’s heard is the howls of stray dogs and the creaks of my clogs*.
I walk whilst contemplating the weight of karma
A bull’s-eye-pattern umbrella embraces the darkness
I walk on the way of life, as a woman that has long since thrown her tears away

Atop the river that snakes around, my journey’s light fades away
The frozen crane*² sits still while the wind and rain howl.
The frozen water surface reflects unkempt hair
A bull’s-eye-pattern umbrella hides even my tears
I walk on the way of revenge, as a woman that has long since thrown her heart away

Honour and sentiment, tears and dreams,
yesterday and tomorrow, all empty words.
As a woman that has abandoned her body to the river of revenge,
I’ve long since thrown them all away.

* geta
*² (the bird, not the machine)

Bug of the day: 2018-07-12

I was alerted that there was a bug with the new build: Audio/video data didn’t seem to be flowing into the pipeline. Okay, no big deal, let’s enable logging and see where the data goes, right?… Except that enabling logging made the bug go away. The default debug level is *:3, which means “3 everywhere”. If I enabled “*:3,GST_PADS:6” (which means 6 in GST_PADS, 3 everywhere else), the bug disappeared. Same with “*:3,GST_EVENT:6”, which is really a lightweight logging change in our pipeline. Hmm, a tricky race condition. Then I made a typo and enabled something equivalent to “*:3,non_existent_category:6” for logging – which means “have a debug level of 3 everywhere, except in that category which doesn’t exist anyway, where it will be 6”. The bug was not present. We had pretty much the definition of a Heisenbug, where if we as much as sneezed next to the logs, the bug disappeared. How on Earth is one supposed to debug this?

The next thing to try would maybe have been to roll back recent changes and see if they made a difference. Except there were no recent changes in the code. The only thing changed was something in the build system. So, depending on how the code was built, it would or wouldn’t demonstrate the bug. But the bug would show up only without any attempt at enabling any logging.

After some time spent barking up wrong trees, waiting for builds to finish, and trying to figure out the way to reduce the build waiting time, I realised that the problem was essentially the following: The driver was delivering buffers into the GStreamer element, but those buffers were apparently failing to arrive a bit further downstream. A backtrace showed me that the buffers were also not being queued up somewhere. So, if they are produced at the source, not arriving where they were supposed to arrive, and also not piling up anywhere, something in the middle must be dropping them.

I searched the code for pad probes that could be dropping the buffers, but there were none. I also looked at the elements between the source and the one that was missing the buffers, and they were all harmless. Some closer examination showed me that the buffers were even failing to exit the source element. So, something inside the source element must be secretly swallowing up the buffers.

Now, the GStreamer Decklink source element has a property that says “drop this much time when you first start up the element”. A closer look inside the logs revealed that this was indeed the culprit: it was waiting for the first (approximately) 40 hours since start-up until it would start outputting buffers. I looked inside the code that sets this property: it was set to one second. Hard-coded. As in, 1000000000 nsec. I would set that 64-bit property to literally 1000000000, and it would receive a huge number on the other side.

The key word there was “literally”. I tried casting the literal 1000000000 to a 64-bit integer and it worked!

The argument was passed as part of a vararg function (variable number of arguments, NULL-terminated). Those aren’t automatically cast to 64-bit if needed. The result was that the machine was taking 1000000000 as the low 32-bit part, and the high 32-bit part would be completed by whatever garbage was in the register at the time. And that’s how you accidentally convert 1 second to 40 hours!

How to make a cat want to leave a place

TL;DR: Burn a lot of incense. The cat won’t like the smoke and will leave voluntarily.

What actually happened: One of the cats (a particularly antisocial and uncooperative one) ripped apart the mosquito net at one of the basement windows and jumped inside to give birth to her kittens. She chose a particularly fluffy place as her new nest: on top of some carpets we had stored there. It was also right next to the boiler for the petrol heating that we have.

Leaving the cat there would probably mean that the carpets would eventually need to be thrown away, not to mention that the whole basement would smell badly. What’s worse: my parents had to leave for a few days. If they left the window open, the cat would be able to jump in and out, but so would the thieves. If they closed it, she’d be so upset that she might as well rip everything apart. Furthermore, she was already quite ill-tempered before, and she became particularly aggressive when she had to protect her babies, so grabbing her to take her out by force was out of the question. We tried sprinkling some vinegar and even some bleach on safe places, in case the smell became too unpleasant, but it didn’t seem to work.

A friend suggested to light a fire and let the cat leave because of the smoke. However, a room full of carpets and a petrol boiler wasn’t exactly the safest place for a fire. I told my parents and they thought about burning some incense. That made the cat bring the babies into another room, but still inside the basement. Some progress here: we had saved the carpets from the cats, but we’d still have to leave a way in and out for her – and the thieves.

However, that other room made it much easier to burn A LOT of incense, without having to put it on places that would risk being accidentally knocked over by an upset cat, and also without risking to burn carpets below. At the end it was successful: she was successfully seen outside the basement! The kittens weren’t seen or heard outside the basement, but they also weren’t seen or heard inside, so it’s safe to assume that she finally brought them outside.

Bug of the day: 2018-04-11

I recently noticed that we had a very big memory leak. My first reaction was to run valgrind. However, valgrind was only showing me some loose change here and there, nothing serious. Just to be sure, I fixed the small things it found, but the memory usage kept increasing. What’s worse; memory usage was pretty much constant when running things on valgrind, but was skyrocketing when running things normally. Pretty much the definition of a Heisenbug.

Some time later, I noticed that the memory usage was also just fine sometimes when running stuff normally. Then I noticed that, in those cases, it would fail to create an output file, but wouldn’t bother informing you about it. That’s pretty much the definition of a Mad Girlfriend Bug.

I investigated the Mad Girlfriend Bug a bit more, and found that some element would sometimes fail to transition to PLAYING fast enough, causing it to return flushing, and therefore preventing all media data from running through the pipeline. It just happened that valgrind changed the timing so that it was always reproducible. That also explains why the memory leak wasn’t appearing on valgrind:

Roll Safe Think meme: You can't leak media data if it's just not flowing in the pipeline

In the meantime, I took a look at git diff and found what was causing the memory leak. I was using an appsink/appsrc combination. If you pull a buffer from the appsink and push it into the appsrc, it gets consumed by the appsrc. However, if you pull a sample from the appsink and push it into the appsrc, you need to unref it later. I ended up having to make it obvious in the documentation:

https://bugzilla.gnome.org/show_bug.cgi?id=795150

Categories of cyclists in Thessaloniki (and not only)

The Pirate Ship of Captain Jim

Look, the pirate ship of Captain Jim
has its side-lights backwards when it’s shipping
it will take a thousand people tripping
’cause it’s filled with hashish to the brim
(N. Kavvadias, “Armida”)

These cyclists take special care in having the red light at the front and the white one at the back. So, you see them on your lane and wonder who’s this distracted person on the wrong lane, or (even worse) you see them on the opposite lane and you think you still have time, until they suddenly appear in front of you. I won’t give any further information at the moment, because they always show up when you least expect them so you don’t have time to see much.

The Ninja

These are the cyclists who have taken up a secret mission from Splinter or something, under the condition that they absolutely must not be seen; if by any chance someone sees them, they have to pay with their lives. So, the Ninjas always cycle during the night without any lights on, they wear black clothes, and they even paint their face and hands black. If you see them, you’ll see them on the last moment, because they forgot to cover the metallic parts of their bike with a cloth and they shone a bit under the light. The advantage is that, if they indeed need to sacrifice their life, it will look like an accident.

“Look, ma, no teeth!”

Unlike the Ninja, the “No teeth” can even be detected from a satellite, due to their sigmoid trajectory. They have their hands off the handle bar and their brain busy. They might be driving in the middle lane of the avenue in the middle of the day while playing Pokémon Go, they might be taking romantic selfies while cycling in the sunset, or they might just be eating peanuts, waiting for a chance to scatter them on the asphalt when they suddenly need to brake. [Disclaimer: The aforementioned examples are not a product of my imagination, any resemblance with real events and situations is unfortunately not a coincidence]

The clown car

Last time I looked at the regulations, the maximum amount of people allowed in an average car was 5. As we very well know, however, this doesn’t prevent much larger groups of circus clowns from squeezing inside a car. Similarly, we can see groups of two or even three cyclists riding the same bike, without hesitating to go out on the avenue. Of course, they can’t go fast at all. If you need to overtake them, you should keep three times the normal distance, because it’s never a good idea to be close to a moving object of mass-ive stupidity.

The Usain Bolt

These are the cyclists who believe that the crowded promenade bike lane during peak hours is the most suitable place for them to train for Tour de France. They’re always riding a racing bike, wearing colourful lycra clothes and sunglasses, carrying a water bottle, and obtaining a speed that might even exceed the 50km/h limit for inhabited areas. The slaloms they perform in order to overtake are something that even professional skiers would admire deeply. According to eyewitnesses, when they want to overtake, they’re even capable of cycling on the water of any adjacent river or sea.

The White Rabbit

This is a variation of the Usain Bolt, but the difference is that the White Rabbits aren’t running fast for sport, but because they’re in a hurry. You’ll see them with an average bike and everyday clothing, sweating to break the sound barrier in order to make it on time for a lesson, a shop, or a train. A necessary part of their equipment is their backpack, where they’re carrying their books, purchases, or weekend clothes respectively. If by any chance you find a White Rabbit without a backpack, it means that they’re rushing for some appointment or date, without minding that their significant other will see them drenched in sweat.

The Paganist

This category thinks that, just like Hercules obtained the strength of the lion of Nemea by wearing its skin, they’ll obtain the speed of the Usain Bolt by wearing the same clothes and riding the same bike. The speed they obtain partially justify the expense, because they can go twice as fast as a five-year-old with a trike outside the bike lane. The only thing that Paganists like more than cycling is discussions about cycling. So, they gather together with like-minded people on benches, trying to figure out who managed to find the smallest quality-to-price ratio for their bike and equipment. However, they only seem to care about the denominator of the fraction; the numerator is usually not interesting to them.

The religious procession

In this case, there are at least two bikes leading the rest of their group. These bikes are always going side by side, in order to block the whole width of the bike lane, and their speed is painfully low. You see their mouths moving, and it only makes sense that what can be heard is religious chants. The rest of the group follows them, slowly following the procession and praying, except they can’t hold candles because the wind would blow them off. At the end of the procession we can see some annoyed people following at the same speed, while looking into their calendars and trying to figure out if maybe it’s St Patrick’s day today.

Honourable mention: The taxi driver’s brother

In a well-known joke (at least in Greece), the taxi driver was proudly declaring that “my brother and I always run the red lights”. At the very least, he was considerate enough to stop at the green light, in order to not crash against his brother, in case he was running the red light on the other side. A simple glance at the city streets will reveal that the family in question has numerous members, and they’re not all as prudent as the taxi driver and his brother, which means that they don’t even stop at the green light. Unfortunately, one will notice that these family members are not restricting themselves to cycling, but they’ll often be found driving a motorcycle or a car. Therefore, they won’t be included into the official categories, but we’ll limit them to an honourable mention.

Bug of the day: 2017-10-13

I made some changes near the source of the pipeline, essentially processing some metadata. The result: RTSP preview worked, but the files written were 0-sized. Well, that sounds like cleaning up your kitchen and causing your bathroom sink to start leaking, except the bathroom is at the other side of the apartment, and your shower is working just fine.

I traced it back to a few lines of code that were supposed to be a small improvement, but harmless at worst. They were about buffer meta, which is just what it says on the box: metadata that goes together with the buffer. Timecodes in GStreamer are such metadata. Another metadata is “reference timestamp”: they have some caps, which just act as a label, and a timestamp. Our buffers used to have only one reference timestamp in the past: stream time. Recently we added also hardware time. (Don’t try to understand those, they’re just labels for what it’s worth in this context). When getting the reference timestamp for a buffer, I used to just leave the caps (the label) NULL, meaning “give me whatever is first”. But now we also had hardware time, so I had replaced the NULL with a proper caps description. NULL worked, I could create files. With the proper caps description, I couldn’t. That sounds like going to a bar, getting a beer when you ask for a drink, but getting nothing when you ask for a beer. Makes no sense, does it?

I started tracing back the buffers to see where the problem started. The muxer wasn’t outputting any data. It received video, but it didn’t receive audio. Hmm, suspicious. Kind of makes sense though, the RTSP preview is video-only. We were getting notifications about audio levels though, so something must have gotten lost along the way. Weird. I tried seeing where the audio buffers get lost, and it turned out to be at the exact spot where the aforementioned lines of code were. Okay, now it’s at least really starting to make sense. That part was synchronising audio with video though. It wasn’t supposed to eat bread with buffers instead of bread with butter. So what’s wrong here?

A small investigation showed that, while audio buffers had both stream and hardware reference timestamps, video buffers had only hardware reference timestamps at that point. I tried passing NULL to see what would happen, and the log file happily reported “I found the first reference timestamp for both audio and video. It’s a hardware timestamp in both cases!”… Well, okay, at least synchronisation worked in that case, but it was a pure coincidence. I then looked at the code that added the reference timestamps, and, surprise of all surprises, it was identical for both audio and video. I even added an assertion right afterwards to make sure that they both existed for video. They did. But my video buffers genuinely had no stream reference timestamps when I was trying to get that information. So where does the meta go?

I took a look at my pipeline, to see if there were any elements that would delete the meta. There weren’t. The only element that did anything with meta was timecodestamper: it was configured to remove any existing timecode meta and insert its own. It looked a bit far-fetched to imagine that it would remove the reference timestamp meta as well, but everything else made even less sense, so I gave that a try. There was a function that checked all metas and removed the timecode one. I checked that one, but nope, they were both still intact after its call. What I did find, though, made even less sense. The function to add a timecode meta was removing one of the two other metas!

God enlightened me to talk to Sebastian at that point. He said “oh, Mathieu fixed that one a few days ago”:
https://bugzilla.gnome.org/show_bug.cgi?id=788617

It was essentially a bug in a homegrown linked list implementation, and took him a lot of head scratching and hair pulling to figure it out. It manifested at a completely different place, too. That bug had existed since something like 2012 and nobody had ever run into it in the past. Until Mathieu did, a few days ago. And now I also did. Think about the odds of that.

Bug of the day: 2017-03-30

We had a pipeline that would capture 4K video, convert it using a high quality but low compression ratio codec, mux it into a .mov container together with raw audio, and write it into a file at a networked drive. The drive is fast enough to easily make use of the 10GB Ethernet card.

One 4K session was running fine. Two 4K sessions were freezing. CPU and memory usage were fine – nothing that seemed to cause things to lock up I tried taking pipeline dumps of the freeze: the queue right before the encoder was full. It was 600ms and qtmux shouldn’t need more than that for interleaving, so why is the queue getting full? So I examined the timestamps of all incoming buffers. Turns out that audio and video were off by more than 600ms … but WHY? (Thursday night is here)

Now some information on how qtmux works: Raw audio has no timestamps. It just assumes that it’s progressing at the given rate. If there’s a gap, the data that comes afterwards just gets squeezed and essentially gets converted to earlier timestamps, messing up A/V sync in the meantime.

So, keeping in mind the above, we noticed that several chunks of audio and/or video were going missing, and that’s what was causing the queue to get full. But why would they go missing? (Friday night is here)

Examining the data as it came from the capture card, we noticed that the callback that fed us the audio/video data was just forgetting to be called. It would get called at very regular intervals, until BOOM, a spike of maybe half a second or even one second happens. What’s worse – the drivers have no internal buffering. If you take too long to process the incoming data, you lose the next frame. We thought that the scheduler might be messing things up, but setting the capture thread to real-time priority didn’t seem to have any effect either. The thread itself was doing nothing much, it was just putting the data into a buffer and leaving, so it wasn’t spending more than 1ms in there. No locking or anything else was happening. CPU usage was still OK, no queues were really full at that point, so what could be wrong? We had this wild idea: does it run out of buffers in the pool? YES! But why? (Monday night is here)

On Tuesday we examined the situation with the buffer pool, looking at pipeline graphs and processing log files. Turns out that the queue right before the file sink was getting full first. After this, the queues between encoder and muxer would also get full. Then the queue before the encoder would also get full. Everything was zero-copy until there in my tests, so that’s what caused the capture card drivers to run out of buffers in their pool. Eventually the situation downstream of the encoder would slowly stabilize, but that was too late because then we had the gap that made everything explode. But why on Earth would the file sink refuse to write any further buffers? That was on Tuesday. I tried several things but nothing really seemed to help.

Yesterday we also enlisted a colleague, and he examined the system parameters in /proc etc. He optimized a lot of things – for example, the throughput with “dd” has pretty much doubled now. However, even with all these changes, the situation became better, but didn’t really solve itself.

Today (Thursday) I examined some graphs of how long the writev() call needed versus how long it took for it to be called again. Turns out that writev() could take up to half a second! Then it would get called again almost immediately. After a quick search, I ran into this issue here: https://github.com/netty/netty/issues/5360 We had already tried increasing the buffering in the filesink earlier, but it had no real effect. Turns out that we weren’t using proper values: even the 4MB we tried is maybe smaller than one video frame. So I gave it some thought and then set the filesink buffer size to 128MB (it won’t actually write the data into the file before that). AND IT WORKS!!!