tchncs

News about the tchncs.de service.

About the Matrix incident on July 26 2023

What happened

Our two NVMe storage drives have been well over 200% of the reported lifetime used. That was on purpose, to reduce e-waste. As somehow they managed to get too synchronized with it and both reported 255%, I got scared that the raid may not be as effective anymore in case of an incident. I then requested the replacement of one of the drives that had the highest RW counter – however note that there was no data integrety error and the raid was healthy.

We have now replaced the requested drive, however – it would seem that both of the drives have failed due to their condition. We have replaced requested drive and booted the server into the Rescue system, however we are unable to make the remaining original drive visible to the system. Even replacement of the connectors does not appear to make the drive visible.

Sadly the hoster could neither get the the remaining drive, nor the replaced drive recognized by the os anymore and left me the server in a state where I basically only had one blank drive.

After multiple unsuccessful attempts with kernel parameters and a while of having the server powered off and stuff, to make the rescue OS initialize the drive again, i then had to give in and requested the other drive to be replaced as well.

[Wed Jul 26 19:19:10 2023] nvme nvme1: I/O 9 QID 0 timeout, disable controller
[Wed Jul 26 19:19:10 2023] nvme nvme1: Device shutdown incomplete; abort shutdown
[Wed Jul 26 19:19:10 2023] nvme nvme1: Removing after probe failure status: -4

As soon as the new drive was in place and i gathered everything to be able to restore from backups, I have started to set the “new” server up and to restore from backups. Around 1:45a, I accepted that it was bedtime.

Next day I figured that the previous Debian installation was still on an old, now unmaintained PostgreSQL version, so I wanted to take the opportunity to upgrade to a current version. Sadly this took way too long with native tools (linking methods did not work due to incompatibility) and after around 6-7hr I have canceled the progress and will retry with 3rdparty tooling later.

However, this Synapse setup was different from my customer servers and linked the config to /etc/synapse from the encrypted partition for some reason. You guessed it, I was backing up a symlink. That meant that ontop of the missing Synapse config, also the signing key for messages was lost and also the original webserverconfig missing.

The state from late afternoon on July 27

After most important config parts were rebuilt, the Synapse server started with most important workers and on purpose without media repo and without appservices or presence support and tried to get back in sync with the network:

I then had a little fight with the media repository which uses 3rdparty software. Functionallity has been restored, though for a yet unclear reason, some image resolutions for room/people avatars are still missing. Also here the original config was gone.

Later that night

Around midnight, Whatsapp, Telegram and Signal bridges have returned.

July 28

  • Current state is that we (the community and me) are waiting for outgoing federation to return to normal. Some servers already work fine, others still don't and Matrix.org only from time to time. Please have patience, it is expected to sort itself out within the next hours. Right now we assume the delays / missing outgoing federation is caused by the new signing key mentioned above.
  • Presence has returned (online status of users)
  • Our moderation bot has returned
  • As a last resort, an issue report for the federation issues has been filed.

July 29

  • around 2am, it was discovered / reproduced that the server signature-keys are not properly refreshed on remote servers and they throw errors like Signature on retrieved event $e4xQAons8TGPgR4iy4RhGRX_0_dfCZmRTrhdL9MoypM was invalid (unable to verify signature for sender domain tchncs.de: 401: Failed to find any key to satisfy. It's a good thing to have at least some certainty. Still hoping for help on Github while looking for options.
  • external login providers have been added again
  • most media issues (loading small versions of images such as avatars) should be resolved

How to contact me:
Follow me on Mastodon / More options on tchncs.de

OpenTalk Screenshot

Known issues

  • Speedtest not yet implemented
  • Metrics not yet implemented (affects you in the sense of: to see whether it's annoying to you if i restart the service)
  • Speedtest not supporting latency (not a direct OpenTalk issue, the speedtest-software does not support it yet)
  • Protocol PDF export only works from within the pad
  • Recordings not yet opensourced, the menuentry will not work
  • no proper Email support yet
  • phone calls feature not addressed yet, but planned

Login / signup

OpenTalks authentication service (Keycloak) is connected to our authentication server (Zitadel). Until recently, you had to make sure to use the button below the login form, now it is hidden with CSS. Please remember that your accountname ends with @tchncs.de.

About this instance

Due to the complexity of OpenTalk, this instance derivates off of their official “lite” Docker example. It adds a number of services, trying to reach an as complete as possible experience. As of the time of writing, this is still in progress and a few more restarts are to be expected in order to apply new settings and stuff.

Maintenance window: because it doesn't make sense to restart all the time while you are trying to give it a fair test, I will try to apply new settings only between 5-9pm CET. (see below)

This instance is categorized as “playground mode”. Its purpose is to evaluate whether it is feasable to keep it long-term. This also means that you still are more than welcome to use and test it, because software that is not used by anybody can't be tested/evaluated properly.

Playground mode window: This service will be evaluated until mid of June 2023

About OpenTalk

tba (irrational happy “Rust” noises)

#tchncs #opentalk #playground


How to contact me:
Follow me on Mastodon / More options on tchncs.de

Update

The evaluation period has completed. The instance will stay.


Say hi to a new and exciting service at tchncs.de – well – for now. :)

Due to previous mistakes, I have decided to declare testing-periodes for new services, before they will be added to the portfolio long-term. In this case i went with one month, meaning until April 6th '23.

BookWyrm makes a good first impression and was highly voted for in our new survey. You can track, rate, discuss and share books you are reading. It even is possible to link sources of books. All this while being part of the fediverse like Mastodon!

Sounds great? Wonderful:


I have requested an invite but received no email!?

Please give me some time to review requests and send invites. BookWyrm does not send an email until the actual invite. If it takes multiple days, please contact me directly or check your spam. 😇

How final is this setup?

Well it looks fine so far but is fairly fresh and since it is not a simple Docker install, it is possible that there are small mistakes that still need fixing. Note that bookwyrm.social appears to be very slow right now, which causes book imports to fail (if you use this domain as a source. there are more options!).

What if it does not qualify during testing?

In that case, I will give users enough time to look for a new instance and publish an announcement on the instance, as well as on this article.

What if it does qualify?

In that case ... well ... it will just continue running and descriptions will be updated accordingly.

#tchncs #bookwyrm #playground


How to contact me:
Follow me on Mastodon / More options on tchncs.de

Started december 20, 8:30p CET, videos of the tchncs PeerTube instance are moving to a new, more flexible home. At the time of writing, we are at over 5.5 TB of video storage.

Status updates

  • Dec 31, 4p: The remaining issue is a compatibility problem with permissions set by PeerTube to the storage objects. A few videos are failing to be moved to remote storage (they fail at the last step but files are in fact moved successfully usually). You can play around with resolution to work around playback issues or try to reupload the video if it's urgent. Here is the bugreport to the issue. I am not sure why some videos work and some don't.
  • Dec 29, 5p: A problem with the media proxy-server was identified. As a result, the machine is no longer starving of available bandwidth. This results in smoooother playback and overall better instance snappiness.
  • Dec 28, 5p: First round done, re-initiated migration to catch and transfer failed videos due to flaky old storage backend
  • Dec 26, 10a: 4.9 TB of 5.5+ TB

Benefits of the new location

  • higher availability and overall reliability: the old network storage became unavailable from time to time over the years, sometimes outages, sometimes maintenances.
  • scalability: the network storage drive has a maximum amount of storage you can rent. The new storage will not have such a restriction.
  • redundancy: the storage can (easier) be replicated to a different location

Challenges / known migration issues

  • hidden videos: it appears that PeerTube hides videos that are pending migration
  • videos that failed to move: it appears that the network storage became even more unreliable during the migration. This in turn appears to cause video moving to fail from time to time. These videos will remain hidden. I will try to reinvoke the migration when the queue of pending videos to move is empty.

All it takes is patience

As of right now, there is no reason to worry. Everything is under control, but the process will still take a couple of days. Please be patient. 😇

#tchncs


How to contact me:
Follow me on Mastodon / More options on tchncs.de