Bug of the day: 2024-09-26

We had a system whose setup was relatively simple: Take a number of live sources, demux them, put them through audiomixer / compositor to combine together their audio/video streams, and mux the result back into another live sink.

This was initially working fine, until some code change broke it. The symptoms were that, after enough time had passed, the memory usage would start increasing. Soon afterwards, the sources would start complaining that their buffers weren’t being consumed fast enough downstream.

A first investigation showed a lot of audio buffers gathered right before the muxer. And I mean A LOT of audio buffers. Gigabytes of them. Raw audio, but still, it’s not easy to gather gigabytes of audio! I had foolishly left an infinite queue at that spot, thinking “it’s just live sources anyway, and with just audio, what could go wrong”. Famous last words, right?

I tried gathering logs around the audio path. It took a while, because, as always, enabling logs made the bug seemingly disappear… until I finally reproduced it, after struggling for many hours. That wasn’t helpful either; everything seemed to be running normally. At that point I thought, one of the changes we had done was to add the audiomixer into the audio path, while there was an input-selector previously – we had decided to mix the audio streams together instead of just selecting one single input. So, I thought about reverting that to get it out of the way, thinking that would surely fix the issue. Alas, it didn’t. The issue persisted, even with the input-selector. How curious.

At that point I decided to start bisecting. That led me to a surprising result: The offending commit was one that enabled alpha compositing. It would force the compositor to output AYUV, making it convert all its sources. So… the problem was in the video path, not the audio path? That would at least explain why nothing came up in my logs earlier! Reverting the offending commit made the issue indeed disappear, but alpha compositing was also a feature that we needed, so I couldn’t just leave it at this, I had to get to the bottom of the issue.

After a little thought, I realised: What if the compositor (configured to a decent n-threads, mind you) still couldn’t keep up with the video conversion? That would mean it’s outputting buffers slower than real time. The audio path is real time, which means audio buffers would slowly start piling up before the muxer. At the same time, video buffers couldn’t be consumed as fast as they are produced, causing the sources to complain about this fact. Everything was checking out. With the detailed logs I had enabled earlier, I had essentially slowed down the audio path and the sources as well, so it was accidentally working again. Slower than real time, but working.

But why would the compositor not keep up with the video conversion? It does process all pads in parallel if you set n-threads to a sensible default. But there’s also a caveat! Each sink pad was still single-threaded! As it turns out, you have to create a VideoConverterConfig, set its threads to a sensible number, and then tell each compositor sink pad to use that. That solved our bug.

Another solution was to use dedicated videoconvert elements before the compositor, with decoupling queues in between, to make sure they’d run in their own thread instead of the same thread as the corresponding compositor sink pads. We ended up doing both at once. The system was running very stable for hours afterwards.

Creating (for real) a smart thermostat using HomeAssistant and ESPHome

So, in my previous attempt, I wanted to create a smart thermostat for my home. I got to the point where I was ready to program the ESP32… but then life caught up with me and it went to my pile of unfinished projects. But recently I found out a way to do that much more easily using the wonderful ESPHome project.

Architecture and Home Assistant installation

For starters, I wanted to plug this into a Home Assistant installation, which I placed in a Raspberry Pi. You can either flash Home Assistant OS, or start it using Docker. I really recommend Home Assistant OS, if you can, because that way you can seamlessly install other integrations, such as LetsEncrypt or MQTT. However, I reused a Raspberry Pi that was already running pihole, so I ran it using docker. The Home Assistant website has very good instructions about that, so I’m not going to cover those. Make sure you enable Bluetooth after installing it.

I also decided to use ESPHome. This allows us to describe the code that the device will run using a YAML file. ESPHome takes care of installing WiFi connectivity, an access point in case it cannot connect to the configured WiFi, Home Assistant connectivity, and even encryption. There are even several templates that one can use.

ESPHome can be installed via a Home Assistant integration. However, I do NOT recommend running ESPHome on a Raspberry Pi. It will need to compile the code that it will upload to the microcontroller, so you want a decently fast computer. It also does not have to be on the same machine anyway. You will use ESPHome to flash the microcontroller, then when it gets online, Home Assistant will automatically discover it, and you don’t need ESPHome anymore… until it’s time to flash the next thing.

I decided to use a Sonoff Basic switch instead of a home-grown ESP32. A Sonoff switch has an ESP8266, together with a 220V power supply, a status LED, and a relay switch, inside a packaging, and it is cheaper than the materials themselves. ESPHome supports the Sonoff switch and allows us to reflash it. Note that there is even a thermostat support inside ESPHome, but I decided against using it for technical reasons: The place where my thermostat has to go has suboptimal thermal insulation, so I had to use a BLE thermometer instead. The Sonoff switch does not support Bluetooth, so I had to implement the thermostat logic inside Home Assistant.

However, the Sonoff switch is configured to power its output either on or off. It gets a Line and Neutral cabling, and lets Neutral always through, whereas Line goes through its relay switch. My gas boiler has two wires that either need to be connected or disconnected. As a result, I got an additional relay switch and connected that to the Sonoff’s output.

To recap, my Home Assistant host talks to my thermometer using Bluetooth, and to my ESPHome-powered Sonoff switch using WiFi. The Sonoff’s output is connected to a 230V relay switch, whose output goes to my gas boiler. The ESPHome host runs on my computer.

Components list

Hardware:

  • A Raspberry Pi 4 or 5, to run Home Assistant
  • A Micro SD card for the Raspberry Pi
  • A Sonoff Basic switch
  • A mains-powered relay switch
  • A 3.3V USB to TTL Serial UART Adapter (like CP2102 or Pl2303)
  • Some jumper cables
  • A single row of header pins
  • A soldering iron
  • A BLE thermometer with a Home Assistant integration (such as RuuviTag)
  • A computer with a USB port that can run ESPHome
  • A Wi-Fi access point that can have all devices in its range
  • Two short mains cables
  • Screwdrivers, cable cutters, pliers
  • Ideally, a multimeter to check your connections

Software:

Installing ESPHome

The docker command to run ESPHome is:

docker run --network=host -v /run/dbus:/run/dbus:ro --privileged --restart=unless-stopped -e TZ=Europe/Athens -d -v $HOME/esphome_config:/config --device=/dev/ttyUSB0 --name=esphome ghcr.io/esphome/esphome

Note that you need to create one directory called esphome_config in your home directory, in order to store your config files.

After installing ESPHome, you navigate to http://localhost:6052 to see its front-end. You have to use Chrome for that (as of early March 2024), because Firefox does not yet implement the Web Serial API, which you will need, in order to flash the device for the first time.

Flashing the microcontroller

In order to connect the Sonoff switch via USB, you need to solder a pin header on the board:

In order to connect it to the computer, you will also need a 3.3V USB to TTL Serial UART Adapter and some jumper cables. Make sure you cross Tx and Rx between the Sonoff switch and the adapter, hold down the Sonoff’s button, connect it to the computer’s USB, keep holding down the button, and release it after 5 seconds.

With the Sonoff plugged into the computer and ESPHome loaded into Chromium, click on the green “New Device” button on the bottom right of the page, and select the appropriate port (should be ttyUSB0 on most Linux distributions). I named mine ThermostatSwitch. ESPHome will then autodetect the device and install the first version of the firmware into it. For this first version, you must select “Plug into this computer”. All subsequent updates can be done over-the-air!

Reboot the board by unplugging it from USB and replugging it. Now, copy the appropriate sections from the YAML template on ESPHome’s website and paste them into the file that you get when clicking “EDIT” in the “ThermostatSwitch” section that should have appeared on the ESPHome webpage. Install it again, over WiFi this time.

Integrating the devices

The Sonoff is now ready to be deployed. Connect its input to the mains power, its output to the relay switch, and the relay switch’s output to your heater. After connecting the device and verifying that ESPHome shows it as Online, you can close the Chrome tab with ESPHome and navigate to your HomeAssistant installation using the browser of your choice. If all went well, HomeAssistant will show a notification on the bottom left that a new device has been detected. Alternatively, you can manually add the ESPHome integration into HomeAssistant and detect your device from there.

The next thing you need is a thermometer. I used RuuviTag, which is also open source. However, I do not recommend its phone app – despite being open source, it contains several trackers. The installation is trivial: Pull out the little plastic that protects the battery, wait a bit, click on the HomeAssistant notification on the bottom left. You can also select which Area (e.g. Living Room, Bedroom) it is in.

Implementing the thermostat logic

Now we have our hardware ready, and it’s time to implement the thermostat logic inside Home Assistant. From the Settings → Integrations page, click Add Integration, and select Generic Thermostat. It should navigate you to the instructions page.

Now, you need to open your configuration.yaml file of Home Assistant. If you are running it via Docker, it should be in the config directory that it asked you to create. Copy the YAML from the Generic Thermostat instructions page and paste it into the end of configuration.yaml. You will need to adapt the heater to point to your Sonoff switch’s entity ID (switch.thermostatswitch_sonoff_basic_relay in my case) and the target_sensor to point to your thermometer (sensor.ruuvitag_XXXX_temperature in my case). You can find the relevant entity IDs from Settings → Devices and Services → Entities.

After this, you need to restart Home Assistant. For docker, you just SSH into the Home Assistant host, run docker stop homeassistant, wait for it to complete, then docker start homeassistant. When it starts back up, your default home view will have a nice thermostat included.

Time to add some automations. We want a temperature setting for daytime, and another one for nighttime. Those can be added from Settings → Devices and Services → Helpers. Add a “Day temperature” and a “Night temperature” helper, of type Number, measured in °C.

In my case, I want day to last at 07:00 and end at 22:00. Keep this in mind.

Another thing that I want to implement is the “One hour party” mode that I have seen in another thermostat. Let’s start from this, because we will need to check for it later. From Settings → Automations and Scenes, go to Scripts. Add a script. Then add the following actions:

  • From Notifications, Send a persistent notification, that the party is starting.
  • From Climate, Set target temperature. I set this to a value slightly higher than Day temperature.
  • Type “Delay” in the action search box, and enter a delay of one hour.
  • Type “if” in the action search box, and select If-Then. If the time is after 22:00 and between 07:00: Then: From Climate, Set target temperature. This one will require the value of Night temperature. Edit this one in YAML:
service: climate.set_temperature
metadata: {}
data:
  temperature: "{{ states('input_number.night_temperature') }}"
target:
  entity_id: climate.thermostat
  • Do the “else” accordingly, for the Day temperature.
  • Send a notification that the party has ended.

Double check your entity IDs and save.

Next, we should make the “Day temperature” and “Night temperature” values take immediate effect when changing them. From Settings → Automations and Scenes, select Create automation from the bottom right. Add a trigger for Entity → State, and select Day temperature. Add a condition for Time and Location → Time, with the time being between 07:00 and 22:00 (you can also make more complex schedules depending on day of week). Add another condition for Entity → State, select the Thermostat entity and the Preset state, and check that it is None. The Away preset is for when you are leaving and want to maintain the home at a relatively low temperature until you are back. Another condition to add is Entity → State. Select the party entity that you created earlier, and select its state should be Off. Then you add an action for Climate → Set target temperature. This one will require the value of Day temperature. Edit this one in YAML like before.

Double check your entity IDs and save. Create another similar automation for Night temperature.

Now let’s create our schedule. This should be simple, based on the things we did before. Add an automation for a start time equal to 07:00. Add a condition for the Preset of Thermostat being None, and for the Party state being Off. On the condition, add the same YAML as earlier. Add a second automation for the night temperature.

If you want, you can also create more helpers for the switching times. I decided to just leave them hardcoded.

Now, you’ll need a way to run your party script. For that, you go to the Overview page of Home Assistant. Edit it from the top right, and choose Take control. Add a Button. The Entity is your party script. The Tap action should be Toggle. If you want, you can take this opportunity to further customise your home screen. Click Done when you’re ready.

Done!

Don’t forget to set up a back up system, or at least to take an image of the SD card. If you suffer a hardware failure, and the SD card is the most common culprit, you won’t have heating until you have everything restored!

I took the opportunity to do some more things with Home Assistant. I got another Sonoff switch and configured it to turn a light on at sunset and off in the evening. I installed a few more thermometers, as well as a carbon dioxide monitor. I got a robot vacuum and installed Valetudo on it, to prevent it from connecting to the cloud. Here is the final result.

Bug of the day: 2024-02-02

I had started chasing this bug already in December. A coworker of mine had reported that, with a specific input file, using a tricky maneuver, that also required a lot of other moving parts that interacted with our code, and that also involved at some point deleting and re-adding all elements from the pipeline (!), the file would stall after showing only a couple of frames.

My first thought was to try and reproduce it locally, without all the moving parts. I tried repeatedly, but failed. My colleague Jan also tried repeatedly, but failed. No matter what we did, it was all working fine. We also asked for log files, but they didn’t show any issues either. I was really stuck for a long time, because I had no idea how to chase that bug.

Eventually, Jan noticed that the videorate element was trying to bridge a huge gap: It had received only one input frame, but had duplicated several frames. However, our logs did not indicate such a gap. The videorate element is what converts the frame rate of the video between different values, and also what fixes up the (non-live) stream in case a buffer has gone missing or appears twice.

The next step was to ask for additional log file with videorate debug information. Fortunately, my other coworker could still reproduce it with the moving parts. And, there I saw it:

BEGINNING prev buf 1000:04:41.104589703 new buf 1000:04:41.137956369 outgoing ts 1000:04:41.104589703
diff with prev 1000:04:41.104589703 diff with new 1000:04:41.137956369 outgoing ts 0:00:00.000000000

There it was. videorate was calculating the differences wrongly!

My first thought was to look at existing related bug reports and merge requests. I found something that looked suspiciously close to our issue:

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/2186

If my theory was correct, it would mean that reverting https://gitlab.freedesktop.org/gstreamer/gst-plugins-base/-/merge_requests/767 would fix our issue. Except it wasn’t possible to revert, because the code had been repeatedly touched since.

I then started looking at the calculations. Something was wrong with the segment-related parts, related to how the segment base was used. I thought I had found it, made a patch that fixed this specific issue, put it up on the CI, and it ended up breaking other things.

For clarity: Imagine that you are playing back a file, playing it faster or slower, seeking back and forth, etc. The time displayed at your player’s clock corresponds to the timestamps inside the file. However, the parts that know when to display each frame, or when to play each sound, have a different time, according to how you manipulate the playback. For simplification, let’s say that the time displayed at your player’s progress bar is the buffer timestamp, and the time when the frames/sounds are displayed is the running time. So, when you move the slider back and forth in your player application, the buffer timestamp will move back and forth correspondingly, but the running time will keep increasing. In order to convert from one to the other, you need information from what we call a segment.

Seeing as my first patch didn’t work, I thought I’d convert all calculations to running time. Some values used were clearly buffer timestamps, but some others were something strange that was neither a buffer timestamp, nor a running time. So I would store running times for reference, and would convert back and forth when necessary. I had that patch almost ready. Almost. It fixed our initial test case and made everything but one integration test pass. It was almost ready to be merged! I was only thinking about how to handle the corner cases where the calculations end up with a negative timestamp.

And then came that fatal evening. I was visiting the office in Cologne, and took the opportunity to have some nice authentic ramen in Düsseldorf. Jan joined us. So, as Jan and I were walking on the street on the way to the ramen, we started discussing that corner case. Then he told me “Why do we even use running time, anyway? We don’t need it, do we?”.

I didn’t want to believe it at first. I went back to work on the next day, looked at the running times I was using… and I was only using them in order to convert to buffer timestamps and back. So it’s indeed not necessary. We really only need buffer timestamps.

I then looked at the part that I should have looked at first. Why do we even need the segment base in those calculations?

guint64 base;
the running time (plus elapsed time, see offset) of the segment start

So the segment base is only needed if we’re converting to running time. Which we don’t have to. The calculations were just adding and subtracting it back and forth, but doing so wrongly, and that lead to our bug. Makes us wonder how it ever worked. But there were several reports of videorate spitting out negative timestamps, for instance.

I then started removing all uses of the segment base. And fixing a couple of other bugs along the way. And, in the end, not only did all tests on the CI pass, but that even made the tests Sebastian added in his merge request pass (with a little modification on one test), so I integrated them into my code.

As for why the segment base was used? I looked at git blame, and apparently it exists all the way back since the element was ported from 0.10 using the old GstSegment API, more than 10 years ago. I assume that it once worked differently, and back then it made sense to do it that way. One day it didn’t make much sense anymore, but nobody noticed before!

Here is the final merge request:

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/6031

Bug of the day: 2022-02-28

There is this deinterlacing algorithm called yadif. It is very fast, but also very CPU-expensive. For this reason, it also has some ASM optimisations. GStreamer had support for yadif but not for the ASM optimisations, so I had previously adapted them from FFmpeg, which had both. Now, two years later, I had found a small bug: In some video, which involved some static graphics overlaid upon fast-moving content, the lines were jumping up and down (“bobbing”). However, that happened only in GStreamer’s ASM optimisations. Not in the plain C code. Not in FFmpeg’s ASM optimisations.

Having to look over some ASM code that you wrote two years ago already requires a significant amount of bravery. To make things easier, though, I had an equivalent C implementation, a reference ASM code that worked fine, and a lot of comments in my own code. Or at least I thought that these things would make it easier.

So I started looking at the implementation, remembering what I had done, checking if it does what the comments say it does, checking if the end result corresponds to the C implementation, and also comparing it with FFmpeg’s code. I started double-checking and triple-checking everything. It was all correct.

All my differences were in entry points. FFmpeg has fewer input parameters and then calculates some intermediate values, such as the value of the same pixel on the previous line, or on the previous frame. However, GStreamer calculates those in the deinterlacer base class, so I had more input parameters. I thought that maybe one of those was used improperly, but they were all correct. So I was really at a loss.

What do we do when we have no idea what’s wrong with the code? Change the functionality of random parts and see what breaks. By doing this, I slowly figured out that the value of some variable was too high in some special case (the diff parameter when mode is 0). However, there are many steps involved in that calculation, and nothing makes a lot of sense until the very end. Just to make sure, I quadruple-checked that part of the code. Nope, correct. I thought, maybe I’m accidentally messing with a register that’s needed later. Nope, I’m not.

At this point I decided to take a step back and look at the more inconspicuous parts of the code. While looking around, I noticed this macro:

%macro LOAD 2
    movh      %1, %2
    punpcklbw %1, m7
%endmacro

This loads some value into a register and then interleaves it with zeroes. We are adding pixel values, so we need to make sure that the carry doesn’t accidentally spill over into neighbouring pixels. This makes the assumption that m7 is zero. Indeed, I remember this being set to zero early on. But let’s make sure…

LOAD         m7, [bzeroq]

Ah-HA! The end effect would be that the value of m7 is interleaved with itself instead of with zeroes. That was indeed only in the mode==0 special case, and directly influenced the result of the diff parameter.

bzeroq is one of those entry-point parameters, that FFmpeg calculates in the ASM code but GStreamer gets as input. FFmpeg calculates that value earlier on, uses it once, then puts it into the stack for later. I had decided, no need to go via the stack when I can load it directly. Turns out… I can’t.

Going via the stack like FFmpeg did solve my bug.

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1816/diffs?commit_id=fbeecb9e5567b5822e93ea50fd28f820cf7bbdaf

Creating a smart thermostat using ESP32+openHAB+Mosquitto+Apache+letsencrypt

I wanted a smart thermostat for my village home, so that we can turn off the heating when we leave and turn it on a few hours before we’re due to arrive. Unfortunately this came with a lot of restrictions, which basically excluded almost all choices in the market currently:

  1. Encryption
  2. No admin/admin vulnerabillities
  3. Something that respects the GDPR
  4. Something ideally open source, or at least that respects the GPL
  5. Open protocol, so I don’t need to pollute my phone with yet another fishy app
  6. Something that doesn’t depend on third party servers, otherwise I risk ending up with an expensive paperweight at a random company’s whim

The excellent folks at https://hestiapi.com/ have a product that look like it’s checking all my boxes. Plus, it’s apparently a company from Athens! However, I eventually decided on a DIY solution based on one of my pre-existing servers. I’d install OpenHAB and MQTT on my server, and have an ESP32 on-site as the controller. The advantage of using OpenHAB and MQTT on a pre-existing server, as opposed to a RaspberryPi on-site, is that I don’t need to try to speak with a real person on my ISP’s tech support in order to convince them to give me a real IP address.

This blog post will cover the installation of openHAB and MQTT on an existing Apache web server using letsencrypt.

For the following instructions, assume a root shell on the server.

First of all, I installed mosquitto on my Debian server:

apt install mosquitto mosquitto-clients

Then I edited /etc/mosquitto/mosquitto.conf to make it work with a username/password and also my existing letsencrypt certificates, by adding these lines at the bottom:

tls_version tlsv1.2
listener 8883
allow_anonymous false
password_file /etc/mosquitto/users
certfile /etc/mosquitto/certs/fullchain.pem
keyfile /etc/mosquitto/certs/privkey.pem
cafile /etc/ssl/certs/DST_Root_CA_X3.pem

Now it’s referencing some files that don’t exist yet. First of all, we need to remove the existing /etc/mosquitto/certs directory and symlink it to our /etc/letsencrypt/live/example.com directory. We also need to give the mosquitto user access to the certificates by adding it to the ssl-cert group. Feel free to ignore the README that says that the directory must be readable only by the mosquitto user – having it part of the ssl-cert group works just fine.

We also need to create the /etc/mosquitto/users file. We initially edit it by adding a list of usernames and passwords, one username per line, with a colon between usernames and passwords. Example:

jimmy:password
admin:letmein

We then encrypt the file using this command:

mosquitto_passwd -U /etc/mosquitto/users

Restart the mosquitto service:

/etc/init.d/mosquitto restart

And this part is ready. Next, we install openHAB. I installed the testing distribution:

wget -qO - 'https://openhab.jfrog.io/artifactory/api/gpg/key/public' | apt-key add -
echo 'deb https://openhab.jfrog.io/artifactory/openhab-linuxpkg testing main' | tee /etc/apt/sources.list.d/openhab.list
apt update
apt install openhab openhab-addons openjdk-11-jre

Now, openHAB runs its own web server on port 8080, and 8443 for SSL using self-signed certificates. We do not want to expose port 8080 to the public. Also, for SSL it’s using certificates in a different format than letsencrypt’s default, so we would theoretically need to convert the certificates every two months and restart the openHAB server. It’s easier to configure Apache to do a reverse proxy on a different port than the default 443, which we use for our own stuff. The example that I had found online uses port 444 instead, but Firefox complains that this address is restricted. So let’s use port 1443 instead:

 <VirtualHost *:1443>
        ServerName example.com
        SSLEngine on
        SSLCertificateFile /etc/letsencrypt/live/example.com/fullchain.pem
        SSLCertificateKeyFile /etc/letsencrypt/live/example.com/privkey.pem
        Header set Set-Cookie "X-OPENHAB-AUTH-HEADER=1"
        ProxyPreserveHost On
        ProxyPass / http://127.0.0.1:8080/
        ProxyPassReverse http://127.0.0.1:8080/ /
        RequestHeader set X-Forwarded-Proto "https" env=HTTPS
        Header add Authorization ""
        RequestHeader unset Authorization
        ErrorLog ${APACHE_LOG_DIR}/openhab_error.log
        CustomLog ${APACHE_LOG_DIR}/openhab_access.log combined
        <Location />
                AuthType Basic
                AuthName "example.com 1443 "
                AuthUserFile /etc/openhab/.passwd
                Require valid-user
        </Location>
</VirtualHost>

Add the necessary NameVirtualHost *:1443 and Listen 1443 to /etc/apache2/ports.conf and you’re ready.

You will also notice that we’re password-protecting the webpage. We’ll explain the reason in a while. For now, just create the file in question:

htpasswd -c /etc/openhab/.passwd jimmy

and enter the password in the prompt.

After this is done, restart Apache, point your browser towards https://example.com:1443 and create the administrator’s username and password. You will also be prompted to install the MQTT module.

After logging in as administrator, go to Settings -> (under System Services) API Security, click “Show advanced”, and enable “Allow Basic Authentication” in order for the Android app to work. (I’m not 100% sure that this step is necessary, in fact)

Note: DO NOT disable “Implicit User Role”, as the Android app will break. It does ask for a username and password, but I think those are used for Apache’s authentication instead. I had initially tried to disable Apache’s authentication and also disable “Implicit User Role”, thinking that already gives me proper access control. The Android app failed spectacularly.

Now, let’s add a dummy thermostat. Go to Settings -> Things and click the Plus button to create a new Thing. From MQTT Binding, select MQTT Broker. Add your example.com hostname (ideally not 127.0.0.1 otherwise certificate verification will fail), port 8883 even though it’s the default, provide the username and password you configured for mosquitto, and enable Secure Connection. Your broker should show up as Online. In order to prevent it from breaking at every letsencrypt update, disable Certificate Pinning and Public Key Pinning, and clear their hashes.

For now, let’s add a dummy On/Off switch. Go back to Things and add a new Generic MQTT Thing. Give it a name, select the MQTT Broker you added earlier, and then go to Channels. Add a Channel of On/Off Switch type. Give it a name and select an MQTT State Topic, for instance thermostat/status. Leave the Command Topic empty for now, it can be a read-only switch. Its Custom On value can be 1 and its Custom Off value can be 0. It should also show up as Online.

Go back to Settings and click Items. Add an item for the switch you just added and select its channel. Let’s send a dummy command to turn it on:

mosquitto_pub --insecure -u jimmy -d -h example.com -p 8883 -t thermostat/status -m 1 -P password

It should show up as ON on the openHAB GUI. Change 1 to 0 to turn it off.

If you want to check if the command arrived to the Mosquitto server itself,
you can run a listener:

mosquitto_sub --insecure -u jimmy -d -h example.com -p 8883 -t thermostat/status -P password

While it’s running, it should show you any updates that it catches.

Note that I used the --insecure switch in both commands. I couldn’t get certificate verification to work here, but it doesn’t matter because it’s running on the host itself.

You can also install the openHAB Android client and configure it with the https://example.com:1443 remote server with your configured username and password. It will show an empty layout, but we haven’t configured our smart home’s layout yet. It will be explained in Part 2, together with the actual thermostat’s ESP32 implementation.

Bug of the day: 2021-07-05

I was updating the code to the latest version of the GStreamer Rust bindings. I ended up touching A LOT of parts in almost every file, so the diff was huge. The biggest culprit was a change in the timestamps API, and we do a lot of things with timestamps, so everything related needed to be updated.

After finally getting everything to compile, I tried running the automated tests to see what went wrong there. All tests passed, apart from one.

This was a bit tricky. It’s not like half of them are failing, which would mean that I missed something fundamental. It’s also not like they’re all passing, which would mean that everything is fine. It’s just one test, and it’s timing out. Timeout means “I had to do a series of many tasks in the background and they’re still not done”, so it’s not exactly hinting at where exactly the issue is.

Fortunately we have an auto-generated API schema file, which describes all commands, with their parameters, types, and default values. I had a diff in that file and had initially not paid attention to it. So I looked at the diff and saw the problem.

I had missed one single M.

So, instead of having the code buffer for 125 milliseconds by default, it was buffering for 125 seconds.

That’s the kind of bug that the compiler wouldn’t possibly catch. I mean, “buffer for 125 seconds”, looks legitimate at a first glance, doesn’t it?

Debian Linux on Chuwi Hi10 X

TL;DR: Hardware-wise, everything works nicely apart from the accelerometer (fixed, see below). GNOME turned out to be the most touch-friendly desktop environment. Installation process was annoying. A few UI papercuts, but nothing major.

Introduction

I bought the Chuwi Hi10 X tablet the other day. It’s a nice affordable tablet with a detachable keyboard. Feeling-wise, it feels sturdy enough. I booted Windows exactly once, to make sure all peripherals work, in case I had to return it. After that was done, I decided to go install Linux on it.

My only complaint about the hardware is that the charger has a USB-C interface but doesn’t charge my phone. It has two USB-C ports on the device and two USB-A ports on the keyboard, which is already more ports than most modern laptops. I love this.

Installation

So, I downloaded the Debian testing installer, the version enhanced with non-free firmware, and started it.

First problem: everything was rotated. I had the tablet docked into the keyboard, looking at it in landscape mode. Everything in the installer was in portrait mode, and I couldn’t find a way to rotate it. That’s annoying, for sure, but I could put up with it until the end of the installation.

Second problem: The WiFi card worked too unreliably to be able to connect to my network. Fortunately I had a USB to Ethernet adaptor lying next to me.

Third problem: The touch screen didn’t work, so I had to make do with tilting my head to look at the screen and using the touchpad for clicks. I ended up forgetting about the mouse and using only the keyboard.

Fortunately, the second and third problem disappeared after I booted into my installed system, and all desktop environments allow you to rotate the screen. EDIT: I realised that the accelerometer works with the mxc4005 kernel module, which isn’t built by default on Debian. Should work out of the box on Ubuntu, but I also reported a Debian bug for it to be built next time.

Desktop environments: LxQt

My first choice was LxQt – I wanted something lightweight.

The first thing I noticed was that, I could rotate the screen, but the touchscreen input wasn’t rotated accordingly. I worked around it by modifying the udev trick found here and adding it to my Autostart so it would automatically rotate the screen on each login.

Next thing: It’s time to get rid of my temporary cable connection and see if the WiFi works. There was no front-end for that. Most on-line tutorials will tell you to just install wicd, which I wouldn’t have really minded if it wasn’t unavailable in testing or unstable. At the end I solved this by manually installing nm-tray on top. I did report a Debian bug to make nm-tray a dependency of the metapackage.

My next target was to lock the device using the power button (Android much?). No such luck. LxQt instantly shuts down the device when the power button is pressed, no confirmation, no way to override this. I tried many things and couldn’t get it to work. This post suggests using some GNOME tricks, but I didn’t even have the GNOME dependencies installed at the time.

I then installed an on-screen keyboard (Florence). It worked well enough, but… not on the screen saver. XScreensaver didn’t seem to support using an on-screen keyboard, so I had no way to unlock my tablet without the physical keyboard.

At this point, I was pretty much done with LxQt and tried XFCE instead.

Desktop environments: XFCE

First, the power button. It doesn’t instantly shut down the tablet, at least. It doesn’t do anything useful either. It’s mapped to doing exactly nothing. This is an improvement, in the sense that an accidental press of the button is harmless, but still not exactly what I needed.

Next, the on-screen keyboard on the screensaver. I somehow ended up using XScreensaver again, which apparently shouldn’t be the case – apparently XFCE has its own screensaver with support for on-screen keyboards? But I only found out about this when it was too late.

Another major annoyance with XFCE is that you can’t even navigate a menu, such as the start menu, with the touchscreen. In order to go to a submenu, you have to keep your finger over the menu item. The moment you let go of your finger, the submenu disappears. That makes it impossible to select anything on the submenu.

I decided to not bother with XFCE anymore and went to my usual preferred desktop environment, KDE.

Desktop environments: KDE

Ahh, a breath of fresh air! I saw a screenshot of its on-screen keyboard on the lock screen before I installed it. I then proceeded to remap the power button to “Lock screen”. This is wonderful.

However… how do we actually enable the on-screen keyboard? I went through a couple of options, didn’t find it, asked the internet… and found out that Wayland has what I believe to be a killer feature: Keyboard auto pop-up!

However, Wayland support for KDE is still unfinished, so finally I decided to switch to GNOME.

Desktop environments: GNOME (the winner!)

I installed GNOME and it brought Wayland with it. I was impressed to see how touch-optimised everything was. The on-screen keyboard worked nicely, out of the box, including on the lock screen. All buttons were big enough for me to not need to aim like a hawk. Wonderful!

Now, Wayland meant that I couldn’t bring my screen rotation script. I went to GNOME’s settings, rotated the screen, and that worked quite nicely. It even remembered this setting when logging out and back in … but not for the touchscreen. It registers my touches at the rotated coordinates after logging back in, so after each login I have to rotate the screen to Portrait and back to Landscape. This is the biggest issue that I have with GNOME, but it still feels better than the other desktop environments overall. EDIT: It’s already fixed in git, thanks a lot garnacho! It’s also not an issue with the accelerometer module enabled.

Next thing to try: The Power Button. I could remap it, but didn’t find the option to lock the screen there. I clicked the next best thing, which was “Do nothing”. I then went to Keyboard Shortcuts and tried remapping the power button to “Lock screen”. This only works intermittently, but at least I have an easily accessible option to lock the screen without it. I think it has something to do with some tablet auto-detection code, which turns out to be flimsy, and the button defaulting to lock the screen on tablets. In any case, that’s another papercut that needs fixing. I had a short chat with some nice folks on #gnome-hackers about it, it looks like they are aware that their tablet detection code needs to be worked on, so I didn’t annoy them further.

I then was happy enough to start adding input methods. I set up the system in Greek, because it comes with stuff like keyboard input and timezone. I went to add Japanese input. GNOME comes with ibus integrated, so I just have to install and enable ibus-mozc, right?… Wrong. Somehow it ended up detecting Greek input instead of romaji, and then it couldn’t be converted to hiragana because.. it’s Greek. The only way around it was to switch my system back to English, which I meant to do since the beginning, and remove the Greek input from the keyboard. Hmm, still not good enough. I tried anthy instead of mozc, which is clearly inferior, but at least it worked. I then tried reporting the bug, so I brought back mozc to test it and… it works?… WAT. ¯\_(ツ)_/¯ First law of engineering says “if it works, don’t touch it”. I could theoretically set up another system and try it there, but that would take too much time and I’m not sure I have enough right now.

I also couldn’t find any on-screen keyboards that support Japanese. As of now, if I want to type in Japanese, I need to either have the external keyboard plugged in, or go to one of those online input systems. I tried inputting Japanese again now, using mozc romaji input and the on-screen keyboard, and that worked fine. Hiragana input would have been more convenient but it was showing the wrong labels, so I reported it.

Lastly, Firefox has to be started with env MOZ_USE_XINPUT2=1 in order for touch-based scrolling to work. I modified the firefox.desktop file and added a launcher in /usr/local/bin.

Conclusions

The tablet itself is wonderful. Linux, on the other hand, isn’t quite ready for touch-based devices yet. GNOME seemed to have some optimisations in place, but several papercuts need to be worked on. KDE also worked decently in the short time that I tried it, but it really needs Wayland support in this regard. The respective teams are actively working on these issues in both desktop environments, so I’m optimistic for the future. XFCE and LxQt, in the other hand, are still barely usable with a touch screen, so I wouldn’t recommend them yet.

Bug of the day: 2019-07-25

This was actually Sebastian’s bug. He was having a crash caused by an invalid timecode. Now, timecodes are just hours:minutes:seconds:frames labels for each video frame. His code was ending up with a timecode of something like 45:87:84:31. Yes, that’s 87 minutes and 84 seconds. Also 31 frames at 30 fps.

He wondered where such a very invalid timecode might come from, then he noticed he had LTC input accidentally turned on. LTC gets an audio signal as an input and converts that signal into a timecode. It was turned on accidentally, no microphone connected, so he was picking up the music he was playing as “Monitor of sound card”.

He tried reproducing it but failed. Then I looked at him and suggested that he tries the previous song again… And, kaboom! That particular song had the ability to generate crazy timecodes.

The fix is here: https://gitlab.freedesktop.org/gstreamer/gst-plugins-bad/commit/aafda1c76f4089505e16b6128f8b80ab316ab2f0

Translation of Shura no Hana

(This post was written by my brother, I’m just posting it here)

A while ago, an acquaintance and I were talking about our hobbies; I mentioned to him that I’ve translated Japanese comics in the past. He recalled a funny video he’d seen, titled “to krasaki tou tsou”, which consisted of misheard Greek lyrics of Kaji Meiko’s “Shura no Hana”. So he said “Dude, why not translate that one then?”. I thought “Challenge accepted!”

My first option, of course, was to take a look at English translations and base my own Greek translation on them. Imagine my surprise, then, when I realised that not only were the translations I found incorrect, even the transcriptions to romaji had mistakes.

With that in mind, I decided to translate it from scratch to both languages. The English version can be found below.

In a dead morning, the snow falls burying everything
All that’s heard is the howls of stray dogs and the creaks of my clogs*.
I walk whilst contemplating the weight of karma
A bull’s-eye-pattern umbrella embraces the darkness
I walk on the way of life, as a woman that has long since thrown her tears away

Atop the river that snakes around, my journey’s light fades away
The frozen crane*² sits still while the wind and rain howl.
The frozen water surface reflects unkempt hair
A bull’s-eye-pattern umbrella hides even my tears
I walk on the way of revenge, as a woman that has long since thrown her heart away

Honour and sentiment, tears and dreams,
yesterday and tomorrow, all empty words.
As a woman that has abandoned her body to the river of revenge,
I’ve long since thrown them all away.

* geta
*² (the bird, not the machine)

Bug of the day: 2018-07-12

I was alerted that there was a bug with the new build: Audio/video data didn’t seem to be flowing into the pipeline. Okay, no big deal, let’s enable logging and see where the data goes, right?… Except that enabling logging made the bug go away. The default debug level is *:3, which means “3 everywhere”. If I enabled “*:3,GST_PADS:6” (which means 6 in GST_PADS, 3 everywhere else), the bug disappeared. Same with “*:3,GST_EVENT:6”, which is really a lightweight logging change in our pipeline. Hmm, a tricky race condition. Then I made a typo and enabled something equivalent to “*:3,non_existent_category:6” for logging – which means “have a debug level of 3 everywhere, except in that category which doesn’t exist anyway, where it will be 6”. The bug was not present. We had pretty much the definition of a Heisenbug, where if we as much as sneezed next to the logs, the bug disappeared. How on Earth is one supposed to debug this?

The next thing to try would maybe have been to roll back recent changes and see if they made a difference. Except there were no recent changes in the code. The only thing changed was something in the build system. So, depending on how the code was built, it would or wouldn’t demonstrate the bug. But the bug would show up only without any attempt at enabling any logging.

After some time spent barking up wrong trees, waiting for builds to finish, and trying to figure out the way to reduce the build waiting time, I realised that the problem was essentially the following: The driver was delivering buffers into the GStreamer element, but those buffers were apparently failing to arrive a bit further downstream. A backtrace showed me that the buffers were also not being queued up somewhere. So, if they are produced at the source, not arriving where they were supposed to arrive, and also not piling up anywhere, something in the middle must be dropping them.

I searched the code for pad probes that could be dropping the buffers, but there were none. I also looked at the elements between the source and the one that was missing the buffers, and they were all harmless. Some closer examination showed me that the buffers were even failing to exit the source element. So, something inside the source element must be secretly swallowing up the buffers.

Now, the GStreamer Decklink source element has a property that says “drop this much time when you first start up the element”. A closer look inside the logs revealed that this was indeed the culprit: it was waiting for the first (approximately) 40 hours since start-up until it would start outputting buffers. I looked inside the code that sets this property: it was set to one second. Hard-coded. As in, 1000000000 nsec. I would set that 64-bit property to literally 1000000000, and it would receive a huge number on the other side.

The key word there was “literally”. I tried casting the literal 1000000000 to a 64-bit integer and it worked!

The argument was passed as part of a vararg function (variable number of arguments, NULL-terminated). Those aren’t automatically cast to 64-bit if needed. The result was that the machine was taking 1000000000 as the low 32-bit part, and the high 32-bit part would be completed by whatever garbage was in the register at the time. And that’s how you accidentally convert 1 second to 40 hours!