About the Matrix incident on July 26 2023

What happened

Our two NVMe storage drives have been well over 200% of the reported lifetime used. That was on purpose, to reduce e-waste. As somehow they managed to get too synchronized with it and both reported 255%, I got scared that the raid may not be as effective anymore in case of an incident. I then requested the replacement of one of the drives that had the highest RW counter – however note that there was no data integrety error and the raid was healthy.

We have now replaced the requested drive, however – it would seem that both of the drives have failed due to their condition. We have replaced requested drive and booted the server into the Rescue system, however we are unable to make the remaining original drive visible to the system. Even replacement of the connectors does not appear to make the drive visible.

Sadly the hoster could neither get the the remaining drive, nor the replaced drive recognized by the os anymore and left me the server in a state where I basically only had one blank drive.

After multiple unsuccessful attempts with kernel parameters and a while of having the server powered off and stuff, to make the rescue OS initialize the drive again, i then had to give in and requested the other drive to be replaced as well.

[Wed Jul 26 19:19:10 2023] nvme nvme1: I/O 9 QID 0 timeout, disable controller
[Wed Jul 26 19:19:10 2023] nvme nvme1: Device shutdown incomplete; abort shutdown
[Wed Jul 26 19:19:10 2023] nvme nvme1: Removing after probe failure status: -4

As soon as the new drive was in place and i gathered everything to be able to restore from backups, I have started to set the “new” server up and to restore from backups. Around 1:45a, I accepted that it was bedtime.

Next day I figured that the previous Debian installation was still on an old, now unmaintained PostgreSQL version, so I wanted to take the opportunity to upgrade to a current version. Sadly this took way too long with native tools (linking methods did not work due to incompatibility) and after around 6-7hr I have canceled the progress and will retry with 3rdparty tooling later.

However, this Synapse setup was different from my customer servers and linked the config to /etc/synapse from the encrypted partition for some reason. You guessed it, I was backing up a symlink. That meant that ontop of the missing Synapse config, also the signing key for messages was lost and also the original webserverconfig missing.

The state from late afternoon on July 27

After most important config parts were rebuilt, the Synapse server started with most important workers and on purpose without media repo and without appservices or presence support and tried to get back in sync with the network:

I then had a little fight with the media repository which uses 3rdparty software. Functionallity has been restored, though for a yet unclear reason, some image resolutions for room/people avatars are still missing. Also here the original config was gone.

Later that night

Around midnight, Whatsapp, Telegram and Signal bridges have returned.

July 28

July 29


How to contact me:
Follow me on Mastodon / More options on tchncs.de