About the Matrix incident on July 26 2023
What happened
Our two NVMe storage drives have been well over 200% of the reported lifetime used. That was on purpose, to reduce e-waste. As somehow they managed to get too synchronized with it and both reported 255%, I got scared that the raid may not be as effective anymore in case of an incident. I then requested the replacement of one of the drives that had the highest RW counter – however note that there was no data integrety error and the raid was healthy.
We have now replaced the requested drive, however – it would seem that both of the drives have failed due to their condition. We have replaced requested drive and booted the server into the Rescue system, however we are unable to make the remaining original drive visible to the system. Even replacement of the connectors does not appear to make the drive visible.
Sadly the hoster could neither get the the remaining drive, nor the replaced drive recognized by the os anymore and left me the server in a state where I basically only had one blank drive.
After multiple unsuccessful attempts with kernel parameters and a while of having the server powered off and stuff, to make the rescue OS initialize the drive again, i then had to give in and requested the other drive to be replaced as well.
[Wed Jul 26 19:19:10 2023] nvme nvme1: I/O 9 QID 0 timeout, disable controller
[Wed Jul 26 19:19:10 2023] nvme nvme1: Device shutdown incomplete; abort shutdown
[Wed Jul 26 19:19:10 2023] nvme nvme1: Removing after probe failure status: -4
As soon as the new drive was in place and i gathered everything to be able to restore from backups, I have started to set the “new” server up and to restore from backups. Around 1:45a, I accepted that it was bedtime.
Next day I figured that the previous Debian installation was still on an old, now unmaintained PostgreSQL version, so I wanted to take the opportunity to upgrade to a current version. Sadly this took way too long with native tools (linking methods did not work due to incompatibility) and after around 6-7hr I have canceled the progress and will retry with 3rdparty tooling later.
However, this Synapse setup was different from my customer servers and linked the config to /etc/synapse
from the encrypted partition for some reason. You guessed it, I was backing up a symlink. That meant that ontop of the missing Synapse config, also the signing key for messages was lost and also the original webserverconfig missing.
The state from late afternoon on July 27
After most important config parts were rebuilt, the Synapse server started with most important workers and on purpose without media repo and without appservices or presence support and tried to get back in sync with the network:
I then had a little fight with the media repository which uses 3rdparty software. Functionallity has been restored, though for a yet unclear reason, some image resolutions for room/people avatars are still missing. Also here the original config was gone.
Later that night
Around midnight, Whatsapp, Telegram and Signal bridges have returned.
July 28
- Current state is that we (the community and me) are waiting for outgoing federation to return to normal. Some servers already work fine, others still don't and Matrix.org only from time to time. Please have patience, it is expected to sort itself out within the next hours. Right now we assume the delays / missing outgoing federation is caused by the new signing key mentioned above.
- Presence has returned (online status of users)
- Our moderation bot has returned
- As a last resort, an issue report for the federation issues has been filed.
July 29
- around 2am, it was discovered / reproduced that the server signature-keys are not properly refreshed on remote servers and they throw errors like
Signature on retrieved event $e4xQAons8TGPgR4iy4RhGRX_0_dfCZmRTrhdL9MoypM was invalid (unable to verify signature for sender domain tchncs.de: 401: Failed to find any key to satisfy
. It's a good thing to have at least some certainty. Still hoping for help on Github while looking for options. - external login providers have been added again
- most media issues (loading small versions of images such as avatars) should be resolved
How to contact me:
Follow me on Mastodon / More options on tchncs.de