Bug of the day: 2026-03-16

Or should I say, bug combo hit of the month? This is definitely one of the hardest issues I’ve encountered in my life, if not THE hardest.

So, we have an internal component called “videoengine”. The videoengine is using GStreamer with a non-negligible number of downstream-only patches. At some point, GStreamer 1.28 came out, and we decided to update it. It’s a procedure that takes half a day if nothing goes wrong, but maybe there’s a merge conflict or two, and in some rare cases, maybe there’s even an upstream regression that we have to solve first. Yes, that sounds ominous, I know.

So I solved more than our fair share of merge conflicts, in one of which I had to contact Jan and he had to remember his original intention when he wrote that patch (and then he took some time to actually get that merged upstream). And then a couple of automated tests were failing, so I had to cherry-pick some fixes that landed in 1.28.1 a few days later. And just as I thought “whew, this looks ready now”, I noticed that two integration tests were failing. These integration tests were known to be flakey so we sometimes didn’t pay attention to one or two failing, but this time, tests P3 and P4 were failing consistently, so I thought that sounded fishy.

My first reaction was to try and increase the logging level, but that made all tests start failing when logging was configured to any useful level. Those who have worked with GStreamer know that it’s not an unusual thing to happen, because logging is slowing down a latency-sensitive live workflow. So I changed my approach to trying to reproduce the bug locally. Especially test P3 was relatively simple: Start the videoengine, start the RTMP server, stream to it, start an RTMP source from the videoengine, and remux its output into RTMP and a couple more protocols. So I tried taking one of our pre-existing manual test scripts and adapting it. Nope, that seemed to work fine. I then tried tracing back the exact commands that were sent to the videoengine (which were fortunately logged) and replaying them. Nope, that also worked fine. So we’re talking about a case of the Three Most Terrifying Words: Possible Race Condition.

We started brainstorming ideas on what the problem could be, based on the limited amount of information that we had. Except reality kept disproving all of our theories. Every time we thought “yeah, that could be it” or “this should solve it”, it turned out to be irrelevant. The only thing we knew was that GStreamer 1.26.10 was working fine, while 1.28.0 was broken. We could technically bisect, except there are too many commits in there, plus a small collection of local patches that will have to be rebased and will show different levels of merge conflicts depending on the version. Also, the failure wasn’t reproducible locally, and the CI needs to be done in two steps (first compile GStreamer and then the videoengine) and the process takes almost one hour including the integration tests.

What we did have, though, was a pipeline graph. I noticed a failure early on in the pipeline: After the source and demuxer, the H264 parser wasn’t outputting any caps. However, this happened early enough that it could conceivably be that the parser just hasn’t seen a keyframe yet, and it was followed by the test immediately failing. Hmm, could it be that it just needs some more time? We added one 3-second delay to P3 between creating the source and linking it, and another 3-second delay between enabling the sinks and checking them. No luck. Still the same failure near the source. Hmm, that’s fishy.

But, given that the failure was near the source, there was just a small handful amount of elements involved: RTMP source, FLV demuxer, H264 parser. And those had no local patches that could be critical to this specific pipeline’s success or failure. So I started reverting these one by one to their 1.26 state and checking the CI’s (slow) result. I found the RTMP source to be irrelevant to the bug, but both the demuxer and the parser had to be reverted to their 1.26 state in order for P3 to pass. So that’s possibly a combination of factors? To make things worse, P4 still failed, but then I thought it’s probably the 3-second delay and that I should tackle problems one by one.

(Hey, at least these plugins could be independently reverted to their 1.26 state and didn’t require any corresponding changes to GStreamer core or some base class!)

Having nothing else left to do, I decided to play git bisect on the CI. Each step did have to take place in two parts and take almost one hour, but at least this time there were no relevant local patches to rebase, and the amount of bisect steps in each case was much more manageable. Painstaking, but the only way forward.

For the parser, I found the culprit to be this commit by Seungha, that made the element use the VUI framerate when the upstream framerate is 0/1. That initially sounded a bit weird, because upstream didn’t report 0/1 in our case. However, I noticed there’s a follow-up fix that basically told the parser base class to preserve the upstream buffer duration where possible. I restored the h264parse element to its 1.28 state and cherry-picked that commit to baseparse, and the test passed. Okay, it’s still only P3 and only with the FLV plugin at its 1.26 state, but still. After several weeks, finally there was some real progress on this bug.

Let’s focus on P4 though for a bit. My colleague Jan heard me saying it will probably pass if we add another 3-second delay in there. Then I noticed a commit where he was removing said delay from P3. Result: P3 still passes, P4 still fails. Frustrating. He also noticed an End-Of-Stream event in the logs. Okay, if we’re getting EOS, the pipeline can’t really recover, but WHY would it go EOS so early on? I tried adding some logging, but of course that made P3, P4, and something like 3 more tests, fail completely. Sigh.

So I decided to focus on the FLV plugin. Anecdotally, my brain was fried after two days of moving commit IDs and test results back and forth, and watching a slow CI like a hawk in order to scrape off a couple of minutes here and there, thinking they add up to real numbers when done several times. So I accidentally marked a good commit as bad and ended up losing about two hours. Quite an appropriate mistake to make on literally Saint Patience’s day.

Eventually, though, I got there, and found this commit by Tarun to be the culprit. It was the one that adds support for multiple audio/video streams in the FLV demuxer, and unfortunately it was a relatively large one. However, it shouldn’t have any difference in behaviour when there’s only one audio and one video stream, just like before…

… Or should it?

So I realised that the no-more-pads behaviour was different. Before, we used to emit no-more-pads whenever there was one audio and one video stream found, or after a 6-second timeout. However, now the amount is not given. Streams might appear or disappear at any moment. In the case of e.g. MPEG-TS, there’s at least a header telling us the amount of streams currently available. FLV doesn’t have such a header. So the approach taken was to emit no-more-pads after a 6-second timeout. That’s important, because we use no-more-pads to unblock the source. I dug into the code of the integration tests and found a part of the code that gave the source 5 seconds to start up and timed out otherwise. Okay, that would explain A LOT. I restored the FLV plugin to 1.28, changed that hard-coded 6 to a 2, and, what do you know, P3 now passes! I made an upstream merge request to make that hard-coded timeout configurable, which at the time of writing still has a couple more minor review comments pending.

What about P4 though? I had a small idea. I looked at the test code, and realised that Jan had only removed the first of the two 3-second delays. The second one, between enabling the sinks and checking them, was still there. So I added a matching one to P4 and it passed! Turns out, that EOS was a total red herring. It wasn’t the cause of our failure, but the result of it, caused by the test saying “okay, test failed, start teardown”.

However, it’s poor code hygiene to rely on hard-coded delays. By analysing the test a bit more, I found out that the timeout logic was a bit too tight and didn’t give the pipeline enough time to start up, which might explain the general flakiness of the integration tests that I mentioned at the beginning. Without getting too much into the internal logic of the videoengine, the correct course of action here would be to wait for … you guessed it… no-more-pads. (Or, even better, to use the not-new-anymore Streams API)

Following that, I had to clean up my git commit history from the two repositories, because every single step of every single bisect was in there. It looked like a bloody murder scene: Extremely untidy, but everything had to be preserved for forensic reasons. Now that the culprits had been successfully caught, there was a lot of cleaning to do.

The difficulty of this issue wasn’t that much about the typical “that bug was hard to find”. It was more about the fact that there were three different issues, that made the same two tests fail, and that I couldn’t reproduce it locally and had to rely on the CI.

Bug of the day: 2024-09-26

We had a system whose setup was relatively simple: Take a number of live sources, demux them, put them through audiomixer / compositor to combine together their audio/video streams, and mux the result back into another live sink.

This was initially working fine, until some code change broke it. The symptoms were that, after enough time had passed, the memory usage would start increasing. Soon afterwards, the sources would start complaining that their buffers weren’t being consumed fast enough downstream.

A first investigation showed a lot of audio buffers gathered right before the muxer. And I mean A LOT of audio buffers. Gigabytes of them. Raw audio, but still, it’s not easy to gather gigabytes of audio! I had foolishly left an infinite queue at that spot, thinking “it’s just live sources anyway, and with just audio, what could go wrong”. Famous last words, right?

I tried gathering logs around the audio path. It took a while, because, as always, enabling logs made the bug seemingly disappear… until I finally reproduced it, after struggling for many hours. That wasn’t helpful either; everything seemed to be running normally. At that point I thought, one of the changes we had done was to add the audiomixer into the audio path, while there was an input-selector previously – we had decided to mix the audio streams together instead of just selecting one single input. So, I thought about reverting that to get it out of the way, thinking that would surely fix the issue. Alas, it didn’t. The issue persisted, even with the input-selector. How curious.

At that point I decided to start bisecting. That led me to a surprising result: The offending commit was one that enabled alpha compositing. It would force the compositor to output AYUV, making it convert all its sources. So… the problem was in the video path, not the audio path? That would at least explain why nothing came up in my logs earlier! Reverting the offending commit made the issue indeed disappear, but alpha compositing was also a feature that we needed, so I couldn’t just leave it at this, I had to get to the bottom of the issue.

After a little thought, I realised: What if the compositor (configured to a decent n-threads, mind you) still couldn’t keep up with the video conversion? That would mean it’s outputting buffers slower than real time. The audio path is real time, which means audio buffers would slowly start piling up before the muxer. At the same time, video buffers couldn’t be consumed as fast as they are produced, causing the sources to complain about this fact. Everything was checking out. With the detailed logs I had enabled earlier, I had essentially slowed down the audio path and the sources as well, so it was accidentally working again. Slower than real time, but working.

But why would the compositor not keep up with the video conversion? It does process all pads in parallel if you set n-threads to a sensible default. But there’s also a caveat! Each sink pad was still single-threaded! As it turns out, you have to create a VideoConverterConfig, set its threads to a sensible number, and then tell each compositor sink pad to use that. That solved our bug.

Another solution was to use dedicated videoconvert elements before the compositor, with decoupling queues in between, to make sure they’d run in their own thread instead of the same thread as the corresponding compositor sink pads. We ended up doing both at once. The system was running very stable for hours afterwards.

Creating (for real) a smart thermostat using HomeAssistant and ESPHome

So, in my previous attempt, I wanted to create a smart thermostat for my home. I got to the point where I was ready to program the ESP32… but then life caught up with me and it went to my pile of unfinished projects. But recently I found out a way to do that much more easily using the wonderful ESPHome project.

Architecture and Home Assistant installation

For starters, I wanted to plug this into a Home Assistant installation, which I placed in a Raspberry Pi. You can either flash Home Assistant OS, or start it using Docker. I really recommend Home Assistant OS, if you can, because that way you can seamlessly install other integrations, such as LetsEncrypt or MQTT. However, I reused a Raspberry Pi that was already running pihole, so I ran it using docker. The Home Assistant website has very good instructions about that, so I’m not going to cover those. Make sure you enable Bluetooth after installing it.

I also decided to use ESPHome. This allows us to describe the code that the device will run using a YAML file. ESPHome takes care of installing WiFi connectivity, an access point in case it cannot connect to the configured WiFi, Home Assistant connectivity, and even encryption. There are even several templates that one can use.

ESPHome can be installed via a Home Assistant integration. However, I do NOT recommend running ESPHome on a Raspberry Pi. It will need to compile the code that it will upload to the microcontroller, so you want a decently fast computer. It also does not have to be on the same machine anyway. You will use ESPHome to flash the microcontroller, then when it gets online, Home Assistant will automatically discover it, and you don’t need ESPHome anymore… until it’s time to flash the next thing.

I decided to use a Sonoff Basic switch instead of a home-grown ESP32. A Sonoff switch has an ESP8266, together with a 220V power supply, a status LED, and a relay switch, inside a packaging, and it is cheaper than the materials themselves. ESPHome supports the Sonoff switch and allows us to reflash it. Note that there is even a thermostat support inside ESPHome, but I decided against using it for technical reasons: The place where my thermostat has to go has suboptimal thermal insulation, so I had to use a BLE thermometer instead. The Sonoff switch does not support Bluetooth, so I had to implement the thermostat logic inside Home Assistant.

However, the Sonoff switch is configured to power its output either on or off. It gets a Line and Neutral cabling, and lets Neutral always through, whereas Line goes through its relay switch. My gas boiler has two wires that either need to be connected or disconnected. As a result, I got an additional relay switch and connected that to the Sonoff’s output.

To recap, my Home Assistant host talks to my thermometer using Bluetooth, and to my ESPHome-powered Sonoff switch using WiFi. The Sonoff’s output is connected to a 230V relay switch, whose output goes to my gas boiler. The ESPHome host runs on my computer.

Components list

Hardware:

  • A Raspberry Pi 4 or 5, to run Home Assistant
  • A Micro SD card for the Raspberry Pi
  • A Sonoff Basic switch
  • A mains-powered relay switch
  • A 3.3V USB to TTL Serial UART Adapter (like CP2102 or Pl2303)
  • Some jumper cables
  • A single row of header pins
  • A soldering iron
  • A BLE thermometer with a Home Assistant integration (such as RuuviTag)
  • A computer with a USB port that can run ESPHome
  • A Wi-Fi access point that can have all devices in its range
  • Two short mains cables
  • Screwdrivers, cable cutters, pliers
  • Ideally, a multimeter to check your connections

Software:

Installing ESPHome

The docker command to run ESPHome is:

docker run --network=host -v /run/dbus:/run/dbus:ro --privileged --restart=unless-stopped -e TZ=Europe/Athens -d -v $HOME/esphome_config:/config --device=/dev/ttyUSB0 --name=esphome ghcr.io/esphome/esphome

Note that you need to create one directory called esphome_config in your home directory, in order to store your config files.

After installing ESPHome, you navigate to http://localhost:6052 to see its front-end. You have to use Chrome for that (as of early March 2024), because Firefox does not yet implement the Web Serial API, which you will need, in order to flash the device for the first time.

Flashing the microcontroller

In order to connect the Sonoff switch via USB, you need to solder a pin header on the board:

In order to connect it to the computer, you will also need a 3.3V USB to TTL Serial UART Adapter and some jumper cables. Make sure you cross Tx and Rx between the Sonoff switch and the adapter, hold down the Sonoff’s button, connect it to the computer’s USB, keep holding down the button, and release it after 5 seconds.

With the Sonoff plugged into the computer and ESPHome loaded into Chromium, click on the green “New Device” button on the bottom right of the page, and select the appropriate port (should be ttyUSB0 on most Linux distributions). I named mine ThermostatSwitch. ESPHome will then autodetect the device and install the first version of the firmware into it. For this first version, you must select “Plug into this computer”. All subsequent updates can be done over-the-air!

Reboot the board by unplugging it from USB and replugging it. Now, copy the appropriate sections from the YAML template on ESPHome’s website and paste them into the file that you get when clicking “EDIT” in the “ThermostatSwitch” section that should have appeared on the ESPHome webpage. Install it again, over WiFi this time.

Integrating the devices

The Sonoff is now ready to be deployed. Connect its input to the mains power, its output to the relay switch, and the relay switch’s output to your heater. After connecting the device and verifying that ESPHome shows it as Online, you can close the Chrome tab with ESPHome and navigate to your HomeAssistant installation using the browser of your choice. If all went well, HomeAssistant will show a notification on the bottom left that a new device has been detected. Alternatively, you can manually add the ESPHome integration into HomeAssistant and detect your device from there.

The next thing you need is a thermometer. I used RuuviTag, which is also open source. However, I do not recommend its phone app – despite being open source, it contains several trackers. The installation is trivial: Pull out the little plastic that protects the battery, wait a bit, click on the HomeAssistant notification on the bottom left. You can also select which Area (e.g. Living Room, Bedroom) it is in.

Implementing the thermostat logic

Now we have our hardware ready, and it’s time to implement the thermostat logic inside Home Assistant. From the Settings → Integrations page, click Add Integration, and select Generic Thermostat. It should navigate you to the instructions page.

Now, you need to open your configuration.yaml file of Home Assistant. If you are running it via Docker, it should be in the config directory that it asked you to create. Copy the YAML from the Generic Thermostat instructions page and paste it into the end of configuration.yaml. You will need to adapt the heater to point to your Sonoff switch’s entity ID (switch.thermostatswitch_sonoff_basic_relay in my case) and the target_sensor to point to your thermometer (sensor.ruuvitag_XXXX_temperature in my case). You can find the relevant entity IDs from Settings → Devices and Services → Entities.

After this, you need to restart Home Assistant. For docker, you just SSH into the Home Assistant host, run docker stop homeassistant, wait for it to complete, then docker start homeassistant. When it starts back up, your default home view will have a nice thermostat included.

Time to add some automations. We want a temperature setting for daytime, and another one for nighttime. Those can be added from Settings → Devices and Services → Helpers. Add a “Day temperature” and a “Night temperature” helper, of type Number, measured in °C.

In my case, I want day to last at 07:00 and end at 22:00. Keep this in mind.

Another thing that I want to implement is the “One hour party” mode that I have seen in another thermostat. Let’s start from this, because we will need to check for it later. From Settings → Automations and Scenes, go to Scripts. Add a script. Then add the following actions:

  • From Notifications, Send a persistent notification, that the party is starting.
  • From Climate, Set target temperature. I set this to a value slightly higher than Day temperature.
  • Type “Delay” in the action search box, and enter a delay of one hour.
  • Type “if” in the action search box, and select If-Then. If the time is after 22:00 and between 07:00: Then: From Climate, Set target temperature. This one will require the value of Night temperature. Edit this one in YAML:
service: climate.set_temperature
metadata: {}
data:
  temperature: "{{ states('input_number.night_temperature') }}"
target:
  entity_id: climate.thermostat
  • Do the “else” accordingly, for the Day temperature.
  • Send a notification that the party has ended.

Double check your entity IDs and save.

Next, we should make the “Day temperature” and “Night temperature” values take immediate effect when changing them. From Settings → Automations and Scenes, select Create automation from the bottom right. Add a trigger for Entity → State, and select Day temperature. Add a condition for Time and Location → Time, with the time being between 07:00 and 22:00 (you can also make more complex schedules depending on day of week). Add another condition for Entity → State, select the Thermostat entity and the Preset state, and check that it is None. The Away preset is for when you are leaving and want to maintain the home at a relatively low temperature until you are back. Another condition to add is Entity → State. Select the party entity that you created earlier, and select its state should be Off. Then you add an action for Climate → Set target temperature. This one will require the value of Day temperature. Edit this one in YAML like before.

Double check your entity IDs and save. Create another similar automation for Night temperature.

Now let’s create our schedule. This should be simple, based on the things we did before. Add an automation for a start time equal to 07:00. Add a condition for the Preset of Thermostat being None, and for the Party state being Off. On the condition, add the same YAML as earlier. Add a second automation for the night temperature.

If you want, you can also create more helpers for the switching times. I decided to just leave them hardcoded.

Now, you’ll need a way to run your party script. For that, you go to the Overview page of Home Assistant. Edit it from the top right, and choose Take control. Add a Button. The Entity is your party script. The Tap action should be Toggle. If you want, you can take this opportunity to further customise your home screen. Click Done when you’re ready.

Done!

Don’t forget to set up a back up system, or at least to take an image of the SD card. If you suffer a hardware failure, and the SD card is the most common culprit, you won’t have heating until you have everything restored!

I took the opportunity to do some more things with Home Assistant. I got another Sonoff switch and configured it to turn a light on at sunset and off in the evening. I installed a few more thermometers, as well as a carbon dioxide monitor. I got a robot vacuum and installed Valetudo on it, to prevent it from connecting to the cloud. Here is the final result.

Bug of the day: 2024-02-02

I had started chasing this bug already in December. A coworker of mine had reported that, with a specific input file, using a tricky maneuver, that also required a lot of other moving parts that interacted with our code, and that also involved at some point deleting and re-adding all elements from the pipeline (!), the file would stall after showing only a couple of frames.

My first thought was to try and reproduce it locally, without all the moving parts. I tried repeatedly, but failed. My colleague Jan also tried repeatedly, but failed. No matter what we did, it was all working fine. We also asked for log files, but they didn’t show any issues either. I was really stuck for a long time, because I had no idea how to chase that bug.

Eventually, Jan noticed that the videorate element was trying to bridge a huge gap: It had received only one input frame, but had duplicated several frames. However, our logs did not indicate such a gap. The videorate element is what converts the frame rate of the video between different values, and also what fixes up the (non-live) stream in case a buffer has gone missing or appears twice.

The next step was to ask for additional log file with videorate debug information. Fortunately, my other coworker could still reproduce it with the moving parts. And, there I saw it:

BEGINNING prev buf 1000:04:41.104589703 new buf 1000:04:41.137956369 outgoing ts 1000:04:41.104589703
diff with prev 1000:04:41.104589703 diff with new 1000:04:41.137956369 outgoing ts 0:00:00.000000000

There it was. videorate was calculating the differences wrongly!

My first thought was to look at existing related bug reports and merge requests. I found something that looked suspiciously close to our issue:

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/2186

If my theory was correct, it would mean that reverting https://gitlab.freedesktop.org/gstreamer/gst-plugins-base/-/merge_requests/767 would fix our issue. Except it wasn’t possible to revert, because the code had been repeatedly touched since.

I then started looking at the calculations. Something was wrong with the segment-related parts, related to how the segment base was used. I thought I had found it, made a patch that fixed this specific issue, put it up on the CI, and it ended up breaking other things.

For clarity: Imagine that you are playing back a file, playing it faster or slower, seeking back and forth, etc. The time displayed at your player’s clock corresponds to the timestamps inside the file. However, the parts that know when to display each frame, or when to play each sound, have a different time, according to how you manipulate the playback. For simplification, let’s say that the time displayed at your player’s progress bar is the buffer timestamp, and the time when the frames/sounds are displayed is the running time. So, when you move the slider back and forth in your player application, the buffer timestamp will move back and forth correspondingly, but the running time will keep increasing. In order to convert from one to the other, you need information from what we call a segment.

Seeing as my first patch didn’t work, I thought I’d convert all calculations to running time. Some values used were clearly buffer timestamps, but some others were something strange that was neither a buffer timestamp, nor a running time. So I would store running times for reference, and would convert back and forth when necessary. I had that patch almost ready. Almost. It fixed our initial test case and made everything but one integration test pass. It was almost ready to be merged! I was only thinking about how to handle the corner cases where the calculations end up with a negative timestamp.

And then came that fatal evening. I was visiting the office in Cologne, and took the opportunity to have some nice authentic ramen in Düsseldorf. Jan joined us. So, as Jan and I were walking on the street on the way to the ramen, we started discussing that corner case. Then he told me “Why do we even use running time, anyway? We don’t need it, do we?”.

I didn’t want to believe it at first. I went back to work on the next day, looked at the running times I was using… and I was only using them in order to convert to buffer timestamps and back. So it’s indeed not necessary. We really only need buffer timestamps.

I then looked at the part that I should have looked at first. Why do we even need the segment base in those calculations?

guint64 base;
the running time (plus elapsed time, see offset) of the segment start

So the segment base is only needed if we’re converting to running time. Which we don’t have to. The calculations were just adding and subtracting it back and forth, but doing so wrongly, and that lead to our bug. Makes us wonder how it ever worked. But there were several reports of videorate spitting out negative timestamps, for instance.

I then started removing all uses of the segment base. And fixing a couple of other bugs along the way. And, in the end, not only did all tests on the CI pass, but that even made the tests Sebastian added in his merge request pass (with a little modification on one test), so I integrated them into my code.

As for why the segment base was used? I looked at git blame, and apparently it exists all the way back since the element was ported from 0.10 using the old GstSegment API, more than 10 years ago. I assume that it once worked differently, and back then it made sense to do it that way. One day it didn’t make much sense anymore, but nobody noticed before!

Here is the final merge request:

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/6031

Bug of the day: 2022-02-28

There is this deinterlacing algorithm called yadif. It is very fast, but also very CPU-expensive. For this reason, it also has some ASM optimisations. GStreamer had support for yadif but not for the ASM optimisations, so I had previously adapted them from FFmpeg, which had both. Now, two years later, I had found a small bug: In some video, which involved some static graphics overlaid upon fast-moving content, the lines were jumping up and down (“bobbing”). However, that happened only in GStreamer’s ASM optimisations. Not in the plain C code. Not in FFmpeg’s ASM optimisations.

Having to look over some ASM code that you wrote two years ago already requires a significant amount of bravery. To make things easier, though, I had an equivalent C implementation, a reference ASM code that worked fine, and a lot of comments in my own code. Or at least I thought that these things would make it easier.

So I started looking at the implementation, remembering what I had done, checking if it does what the comments say it does, checking if the end result corresponds to the C implementation, and also comparing it with FFmpeg’s code. I started double-checking and triple-checking everything. It was all correct.

All my differences were in entry points. FFmpeg has fewer input parameters and then calculates some intermediate values, such as the value of the same pixel on the previous line, or on the previous frame. However, GStreamer calculates those in the deinterlacer base class, so I had more input parameters. I thought that maybe one of those was used improperly, but they were all correct. So I was really at a loss.

What do we do when we have no idea what’s wrong with the code? Change the functionality of random parts and see what breaks. By doing this, I slowly figured out that the value of some variable was too high in some special case (the diff parameter when mode is 0). However, there are many steps involved in that calculation, and nothing makes a lot of sense until the very end. Just to make sure, I quadruple-checked that part of the code. Nope, correct. I thought, maybe I’m accidentally messing with a register that’s needed later. Nope, I’m not.

At this point I decided to take a step back and look at the more inconspicuous parts of the code. While looking around, I noticed this macro:

%macro LOAD 2
    movh      %1, %2
    punpcklbw %1, m7
%endmacro

This loads some value into a register and then interleaves it with zeroes. We are adding pixel values, so we need to make sure that the carry doesn’t accidentally spill over into neighbouring pixels. This makes the assumption that m7 is zero. Indeed, I remember this being set to zero early on. But let’s make sure…

LOAD         m7, [bzeroq]

Ah-HA! The end effect would be that the value of m7 is interleaved with itself instead of with zeroes. That was indeed only in the mode==0 special case, and directly influenced the result of the diff parameter.

bzeroq is one of those entry-point parameters, that FFmpeg calculates in the ASM code but GStreamer gets as input. FFmpeg calculates that value earlier on, uses it once, then puts it into the stack for later. I had decided, no need to go via the stack when I can load it directly. Turns out… I can’t.

Going via the stack like FFmpeg did solve my bug.

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1816/diffs?commit_id=fbeecb9e5567b5822e93ea50fd28f820cf7bbdaf

Creating a smart thermostat using ESP32+openHAB+Mosquitto+Apache+letsencrypt

I wanted a smart thermostat for my village home, so that we can turn off the heating when we leave and turn it on a few hours before we’re due to arrive. Unfortunately this came with a lot of restrictions, which basically excluded almost all choices in the market currently:

  1. Encryption
  2. No admin/admin vulnerabillities
  3. Something that respects the GDPR
  4. Something ideally open source, or at least that respects the GPL
  5. Open protocol, so I don’t need to pollute my phone with yet another fishy app
  6. Something that doesn’t depend on third party servers, otherwise I risk ending up with an expensive paperweight at a random company’s whim

The excellent folks at https://hestiapi.com/ have a product that look like it’s checking all my boxes. Plus, it’s apparently a company from Athens! However, I eventually decided on a DIY solution based on one of my pre-existing servers. I’d install OpenHAB and MQTT on my server, and have an ESP32 on-site as the controller. The advantage of using OpenHAB and MQTT on a pre-existing server, as opposed to a RaspberryPi on-site, is that I don’t need to try to speak with a real person on my ISP’s tech support in order to convince them to give me a real IP address.

This blog post will cover the installation of openHAB and MQTT on an existing Apache web server using letsencrypt.

For the following instructions, assume a root shell on the server.

First of all, I installed mosquitto on my Debian server:

apt install mosquitto mosquitto-clients

Then I edited /etc/mosquitto/mosquitto.conf to make it work with a username/password and also my existing letsencrypt certificates, by adding these lines at the bottom:

tls_version tlsv1.2
listener 8883
allow_anonymous false
password_file /etc/mosquitto/users
certfile /etc/mosquitto/certs/fullchain.pem
keyfile /etc/mosquitto/certs/privkey.pem
cafile /etc/ssl/certs/DST_Root_CA_X3.pem

Now it’s referencing some files that don’t exist yet. First of all, we need to remove the existing /etc/mosquitto/certs directory and symlink it to our /etc/letsencrypt/live/example.com directory. We also need to give the mosquitto user access to the certificates by adding it to the ssl-cert group. Feel free to ignore the README that says that the directory must be readable only by the mosquitto user – having it part of the ssl-cert group works just fine.

We also need to create the /etc/mosquitto/users file. We initially edit it by adding a list of usernames and passwords, one username per line, with a colon between usernames and passwords. Example:

jimmy:password
admin:letmein

We then encrypt the file using this command:

mosquitto_passwd -U /etc/mosquitto/users

Restart the mosquitto service:

/etc/init.d/mosquitto restart

And this part is ready. Next, we install openHAB. I installed the testing distribution:

wget -qO - 'https://openhab.jfrog.io/artifactory/api/gpg/key/public' | apt-key add -
echo 'deb https://openhab.jfrog.io/artifactory/openhab-linuxpkg testing main' | tee /etc/apt/sources.list.d/openhab.list
apt update
apt install openhab openhab-addons openjdk-11-jre

Now, openHAB runs its own web server on port 8080, and 8443 for SSL using self-signed certificates. We do not want to expose port 8080 to the public. Also, for SSL it’s using certificates in a different format than letsencrypt’s default, so we would theoretically need to convert the certificates every two months and restart the openHAB server. It’s easier to configure Apache to do a reverse proxy on a different port than the default 443, which we use for our own stuff. The example that I had found online uses port 444 instead, but Firefox complains that this address is restricted. So let’s use port 1443 instead:

 <VirtualHost *:1443>
        ServerName example.com
        SSLEngine on
        SSLCertificateFile /etc/letsencrypt/live/example.com/fullchain.pem
        SSLCertificateKeyFile /etc/letsencrypt/live/example.com/privkey.pem
        Header set Set-Cookie "X-OPENHAB-AUTH-HEADER=1"
        ProxyPreserveHost On
        ProxyPass / http://127.0.0.1:8080/
        ProxyPassReverse http://127.0.0.1:8080/ /
        RequestHeader set X-Forwarded-Proto "https" env=HTTPS
        Header add Authorization ""
        RequestHeader unset Authorization
        ErrorLog ${APACHE_LOG_DIR}/openhab_error.log
        CustomLog ${APACHE_LOG_DIR}/openhab_access.log combined
        <Location />
                AuthType Basic
                AuthName "example.com 1443 "
                AuthUserFile /etc/openhab/.passwd
                Require valid-user
        </Location>
</VirtualHost>

Add the necessary NameVirtualHost *:1443 and Listen 1443 to /etc/apache2/ports.conf and you’re ready.

You will also notice that we’re password-protecting the webpage. We’ll explain the reason in a while. For now, just create the file in question:

htpasswd -c /etc/openhab/.passwd jimmy

and enter the password in the prompt.

After this is done, restart Apache, point your browser towards https://example.com:1443 and create the administrator’s username and password. You will also be prompted to install the MQTT module.

After logging in as administrator, go to Settings -> (under System Services) API Security, click “Show advanced”, and enable “Allow Basic Authentication” in order for the Android app to work. (I’m not 100% sure that this step is necessary, in fact)

Note: DO NOT disable “Implicit User Role”, as the Android app will break. It does ask for a username and password, but I think those are used for Apache’s authentication instead. I had initially tried to disable Apache’s authentication and also disable “Implicit User Role”, thinking that already gives me proper access control. The Android app failed spectacularly.

Now, let’s add a dummy thermostat. Go to Settings -> Things and click the Plus button to create a new Thing. From MQTT Binding, select MQTT Broker. Add your example.com hostname (ideally not 127.0.0.1 otherwise certificate verification will fail), port 8883 even though it’s the default, provide the username and password you configured for mosquitto, and enable Secure Connection. Your broker should show up as Online. In order to prevent it from breaking at every letsencrypt update, disable Certificate Pinning and Public Key Pinning, and clear their hashes.

For now, let’s add a dummy On/Off switch. Go back to Things and add a new Generic MQTT Thing. Give it a name, select the MQTT Broker you added earlier, and then go to Channels. Add a Channel of On/Off Switch type. Give it a name and select an MQTT State Topic, for instance thermostat/status. Leave the Command Topic empty for now, it can be a read-only switch. Its Custom On value can be 1 and its Custom Off value can be 0. It should also show up as Online.

Go back to Settings and click Items. Add an item for the switch you just added and select its channel. Let’s send a dummy command to turn it on:

mosquitto_pub --insecure -u jimmy -d -h example.com -p 8883 -t thermostat/status -m 1 -P password

It should show up as ON on the openHAB GUI. Change 1 to 0 to turn it off.

If you want to check if the command arrived to the Mosquitto server itself,
you can run a listener:

mosquitto_sub --insecure -u jimmy -d -h example.com -p 8883 -t thermostat/status -P password

While it’s running, it should show you any updates that it catches.

Note that I used the --insecure switch in both commands. I couldn’t get certificate verification to work here, but it doesn’t matter because it’s running on the host itself.

You can also install the openHAB Android client and configure it with the https://example.com:1443 remote server with your configured username and password. It will show an empty layout, but we haven’t configured our smart home’s layout yet. It will be explained in Part 2, together with the actual thermostat’s ESP32 implementation.

Bug of the day: 2021-07-05

I was updating the code to the latest version of the GStreamer Rust bindings. I ended up touching A LOT of parts in almost every file, so the diff was huge. The biggest culprit was a change in the timestamps API, and we do a lot of things with timestamps, so everything related needed to be updated.

After finally getting everything to compile, I tried running the automated tests to see what went wrong there. All tests passed, apart from one.

This was a bit tricky. It’s not like half of them are failing, which would mean that I missed something fundamental. It’s also not like they’re all passing, which would mean that everything is fine. It’s just one test, and it’s timing out. Timeout means “I had to do a series of many tasks in the background and they’re still not done”, so it’s not exactly hinting at where exactly the issue is.

Fortunately we have an auto-generated API schema file, which describes all commands, with their parameters, types, and default values. I had a diff in that file and had initially not paid attention to it. So I looked at the diff and saw the problem.

I had missed one single M.

So, instead of having the code buffer for 125 milliseconds by default, it was buffering for 125 seconds.

That’s the kind of bug that the compiler wouldn’t possibly catch. I mean, “buffer for 125 seconds”, looks legitimate at a first glance, doesn’t it?

Debian Linux on Chuwi Hi10 X

TL;DR: Hardware-wise, everything works nicely apart from the accelerometer (fixed, see below). GNOME turned out to be the most touch-friendly desktop environment. Installation process was annoying. A few UI papercuts, but nothing major.

Introduction

I bought the Chuwi Hi10 X tablet the other day. It’s a nice affordable tablet with a detachable keyboard. Feeling-wise, it feels sturdy enough. I booted Windows exactly once, to make sure all peripherals work, in case I had to return it. After that was done, I decided to go install Linux on it.

My only complaint about the hardware is that the charger has a USB-C interface but doesn’t charge my phone. It has two USB-C ports on the device and two USB-A ports on the keyboard, which is already more ports than most modern laptops. I love this.

Installation

So, I downloaded the Debian testing installer, the version enhanced with non-free firmware, and started it.

First problem: everything was rotated. I had the tablet docked into the keyboard, looking at it in landscape mode. Everything in the installer was in portrait mode, and I couldn’t find a way to rotate it. That’s annoying, for sure, but I could put up with it until the end of the installation.

Second problem: The WiFi card worked too unreliably to be able to connect to my network. Fortunately I had a USB to Ethernet adaptor lying next to me.

Third problem: The touch screen didn’t work, so I had to make do with tilting my head to look at the screen and using the touchpad for clicks. I ended up forgetting about the mouse and using only the keyboard.

Fortunately, the second and third problem disappeared after I booted into my installed system, and all desktop environments allow you to rotate the screen. EDIT: I realised that the accelerometer works with the mxc4005 kernel module, which isn’t built by default on Debian. Should work out of the box on Ubuntu, but I also reported a Debian bug for it to be built next time.

Desktop environments: LxQt

My first choice was LxQt – I wanted something lightweight.

The first thing I noticed was that, I could rotate the screen, but the touchscreen input wasn’t rotated accordingly. I worked around it by modifying the udev trick found here and adding it to my Autostart so it would automatically rotate the screen on each login.

Next thing: It’s time to get rid of my temporary cable connection and see if the WiFi works. There was no front-end for that. Most on-line tutorials will tell you to just install wicd, which I wouldn’t have really minded if it wasn’t unavailable in testing or unstable. At the end I solved this by manually installing nm-tray on top. I did report a Debian bug to make nm-tray a dependency of the metapackage.

My next target was to lock the device using the power button (Android much?). No such luck. LxQt instantly shuts down the device when the power button is pressed, no confirmation, no way to override this. I tried many things and couldn’t get it to work. This post suggests using some GNOME tricks, but I didn’t even have the GNOME dependencies installed at the time.

I then installed an on-screen keyboard (Florence). It worked well enough, but… not on the screen saver. XScreensaver didn’t seem to support using an on-screen keyboard, so I had no way to unlock my tablet without the physical keyboard.

At this point, I was pretty much done with LxQt and tried XFCE instead.

Desktop environments: XFCE

First, the power button. It doesn’t instantly shut down the tablet, at least. It doesn’t do anything useful either. It’s mapped to doing exactly nothing. This is an improvement, in the sense that an accidental press of the button is harmless, but still not exactly what I needed.

Next, the on-screen keyboard on the screensaver. I somehow ended up using XScreensaver again, which apparently shouldn’t be the case – apparently XFCE has its own screensaver with support for on-screen keyboards? But I only found out about this when it was too late.

Another major annoyance with XFCE is that you can’t even navigate a menu, such as the start menu, with the touchscreen. In order to go to a submenu, you have to keep your finger over the menu item. The moment you let go of your finger, the submenu disappears. That makes it impossible to select anything on the submenu.

I decided to not bother with XFCE anymore and went to my usual preferred desktop environment, KDE.

Desktop environments: KDE

Ahh, a breath of fresh air! I saw a screenshot of its on-screen keyboard on the lock screen before I installed it. I then proceeded to remap the power button to “Lock screen”. This is wonderful.

However… how do we actually enable the on-screen keyboard? I went through a couple of options, didn’t find it, asked the internet… and found out that Wayland has what I believe to be a killer feature: Keyboard auto pop-up!

However, Wayland support for KDE is still unfinished, so finally I decided to switch to GNOME.

Desktop environments: GNOME (the winner!)

I installed GNOME and it brought Wayland with it. I was impressed to see how touch-optimised everything was. The on-screen keyboard worked nicely, out of the box, including on the lock screen. All buttons were big enough for me to not need to aim like a hawk. Wonderful!

Now, Wayland meant that I couldn’t bring my screen rotation script. I went to GNOME’s settings, rotated the screen, and that worked quite nicely. It even remembered this setting when logging out and back in … but not for the touchscreen. It registers my touches at the rotated coordinates after logging back in, so after each login I have to rotate the screen to Portrait and back to Landscape. This is the biggest issue that I have with GNOME, but it still feels better than the other desktop environments overall. EDIT: It’s already fixed in git, thanks a lot garnacho! It’s also not an issue with the accelerometer module enabled.

Next thing to try: The Power Button. I could remap it, but didn’t find the option to lock the screen there. I clicked the next best thing, which was “Do nothing”. I then went to Keyboard Shortcuts and tried remapping the power button to “Lock screen”. This only works intermittently, but at least I have an easily accessible option to lock the screen without it. I think it has something to do with some tablet auto-detection code, which turns out to be flimsy, and the button defaulting to lock the screen on tablets. In any case, that’s another papercut that needs fixing. I had a short chat with some nice folks on #gnome-hackers about it, it looks like they are aware that their tablet detection code needs to be worked on, so I didn’t annoy them further.

I then was happy enough to start adding input methods. I set up the system in Greek, because it comes with stuff like keyboard input and timezone. I went to add Japanese input. GNOME comes with ibus integrated, so I just have to install and enable ibus-mozc, right?… Wrong. Somehow it ended up detecting Greek input instead of romaji, and then it couldn’t be converted to hiragana because.. it’s Greek. The only way around it was to switch my system back to English, which I meant to do since the beginning, and remove the Greek input from the keyboard. Hmm, still not good enough. I tried anthy instead of mozc, which is clearly inferior, but at least it worked. I then tried reporting the bug, so I brought back mozc to test it and… it works?… WAT. ¯\_(ツ)_/¯ First law of engineering says “if it works, don’t touch it”. I could theoretically set up another system and try it there, but that would take too much time and I’m not sure I have enough right now.

I also couldn’t find any on-screen keyboards that support Japanese. As of now, if I want to type in Japanese, I need to either have the external keyboard plugged in, or go to one of those online input systems. I tried inputting Japanese again now, using mozc romaji input and the on-screen keyboard, and that worked fine. Hiragana input would have been more convenient but it was showing the wrong labels, so I reported it.

Lastly, Firefox has to be started with env MOZ_USE_XINPUT2=1 in order for touch-based scrolling to work. I modified the firefox.desktop file and added a launcher in /usr/local/bin.

Conclusions

The tablet itself is wonderful. Linux, on the other hand, isn’t quite ready for touch-based devices yet. GNOME seemed to have some optimisations in place, but several papercuts need to be worked on. KDE also worked decently in the short time that I tried it, but it really needs Wayland support in this regard. The respective teams are actively working on these issues in both desktop environments, so I’m optimistic for the future. XFCE and LxQt, in the other hand, are still barely usable with a touch screen, so I wouldn’t recommend them yet.

Bug of the day: 2019-07-25

This was actually Sebastian’s bug. He was having a crash caused by an invalid timecode. Now, timecodes are just hours:minutes:seconds:frames labels for each video frame. His code was ending up with a timecode of something like 45:87:84:31. Yes, that’s 87 minutes and 84 seconds. Also 31 frames at 30 fps.

He wondered where such a very invalid timecode might come from, then he noticed he had LTC input accidentally turned on. LTC gets an audio signal as an input and converts that signal into a timecode. It was turned on accidentally, no microphone connected, so he was picking up the music he was playing as “Monitor of sound card”.

He tried reproducing it but failed. Then I looked at him and suggested that he tries the previous song again… And, kaboom! That particular song had the ability to generate crazy timecodes.

The fix is here: https://gitlab.freedesktop.org/gstreamer/gst-plugins-bad/commit/aafda1c76f4089505e16b6128f8b80ab316ab2f0

Translation of Shura no Hana

(This post was written by my brother, I’m just posting it here)

A while ago, an acquaintance and I were talking about our hobbies; I mentioned to him that I’ve translated Japanese comics in the past. He recalled a funny video he’d seen, titled “to krasaki tou tsou”, which consisted of misheard Greek lyrics of Kaji Meiko’s “Shura no Hana”. So he said “Dude, why not translate that one then?”. I thought “Challenge accepted!”

My first option, of course, was to take a look at English translations and base my own Greek translation on them. Imagine my surprise, then, when I realised that not only were the translations I found incorrect, even the transcriptions to romaji had mistakes.

With that in mind, I decided to translate it from scratch to both languages. The English version can be found below.

In a dead morning, the snow falls burying everything
All that’s heard is the howls of stray dogs and the creaks of my clogs*.
I walk whilst contemplating the weight of karma
A bull’s-eye-pattern umbrella embraces the darkness
I walk on the way of life, as a woman that has long since thrown her tears away

Atop the river that snakes around, my journey’s light fades away
The frozen crane*² sits still while the wind and rain howl.
The frozen water surface reflects unkempt hair
A bull’s-eye-pattern umbrella hides even my tears
I walk on the way of revenge, as a woman that has long since thrown her heart away

Honour and sentiment, tears and dreams,
yesterday and tomorrow, all empty words.
As a woman that has abandoned her body to the river of revenge,
I’ve long since thrown them all away.

* geta
*² (the bird, not the machine)