I managed to completely stop my computer from booting earlier today. And then, eventually, I managed to fix it again.
This is really a story about how important it is to document your sysadmin tasks, which is a lesson I learned many years ago in my professional life but still haven't applied at home!
A few years ago (in fact shortly after the first Covid-19 lockdown was imposed in the UK) I bought myself a new laptop. It's a 15” Dell XPS with 16GB of RAM and a 12-core i7 processor in it. It's a lovely machine if you don't want to do too much gaming (I don't) and don't want to use Apple kit (I definitely don't).
When I got it I quite enjoyed the Windows 10 experience; I hadn't used Windows for many years at that point and it was fun to see where the OS had gone. But obviously after a short while common sense (or lack of it!) took over and I reformatted the thing and popped Linux on to it.
I've been an Arch Linux user for many years (yes I know the “how do you know if someone uses Arch?” joke) so naturally that was my go-to. Being a typically paranoid FOSS user I encrypted everything up to the nines, followed the Arch installation guide very, very carefully, and off we go.
How awesome. I now have a lovely little laptop running KDE Plasma on Arch, everything's fully encrypted with LUKS including the boot partition, and I'm a happy little user.
Until earlier today.
Breaking the code
You may be aware that a few weeks ago a French anarchist was arrested, his devices seized, and within an unfeasibly short period of time his LUKS-encrypted disk was cracked. This of course sent a wave of shock through many security-conscious people, and the conclusion was reached that a nation-state level actor, with access to large numbers of GPU devices, can brute-force the AES passphrase which until recently was the default algorithm used by LUKS to protect the disk decryption key.
There are workarounds, but the real solution is to move to using LUKS2 and using Argon2id rather than AES to protect the passphrase. See https://dys2p.com/en/2023-05-luks-security.html for a much more in-depth discussion.
As mentioned I was aware of this issue but hadn't done much about it for two reasons:
- I'm not a French anarchist, and while there are strong problems with the maxim “If you have nothing to hide you have nothing to fear” I don't have any reason to believe that nation-state-level actors are likely to want to inspect my devices.
- I use GRUB as my boot-loader, and GRUB has limited (read “virtually no”) support for Argon2id as a decryption algorithm.
Why do I use GRUB? Because it's the only boot loader that supports encryption of the boot partition. While in theory there's no real harm in leaving your boot partition unencrypted (after all the boot loader and the kernel image should be the same for everyone) it appeals to my sense of “doing it right” to have as much of the disk encrypted as is possible.
So when I boot the laptop, it prompts me for the decryption passphrase before it even gets to the GRUB menu. It's all quite satisfying really.
Get on with it!
Browsing /r/Linux on Reddit after lunch today I came across a post which suggested that there was a patched version of GRUB in the AUR which would allow GRUB to use Argon2id for decrypting the passphrase.
Nice! All I needed to do was drop in that replacement GRUB package and then follow the instructions for upgrading to LUKS2 and re-encrypting my LUKS encryption key.
pacman -U -- 'grub-improved-luks2-git-2.06.r499.ge67a551a4-1-x86_64.pkg.tar.zst'
loading packages...
resolving dependencies...
looking for conflicting packages...
:: grub-improved-luks2-git and grub are in conflict. Remove grub? [y/N]
What could possibly go wrong?
Package installed, I re-ran grub-mkconfig
and rebooted.
Oh.
Oh dear.
Instead of prompting me for my LUKS passphrase like it always has done before, my computer is now loading straight into GRUB. Which panics because it can't find a root device, and drops me into an emergency shell.
So what's actually happened? This took me quite a bit of head-scratching to find out, but essentially before I made the change, the boot process was something like this:
- GRUB starts and recognises that the boot partition is encrypted
- GRUB prompts for the decryption passphrase
- GRUB decrypts the boot partition, reads
grub.cfg
and from then on booting continues like normal (including decrypting and mounting the other partitions)
A fuller description is at https://cryptsetup-team.pages.debian.net/cryptsetup/encrypted-boot.html
However by installing the newer patched GRUB package I had not realised that it shipped with its own version of /etc/default/grub
which meant that when I subsequently ran grub-mkconfig
I clobbered my carefully set up decryption information and so now what was happening was:
- GRUB starts, and nobody has told it that the boot partition is encrypted
- GRUB tries to find the Linux /root partition, which doesn't exist as it's hiding behind a LUKS cryptmapper
- GRUB throws its toys out of the pram (to be fair it can't do anything other than that) and drops me into a shell, saying “you made this mess, human, you need to sort it out”
Now, if I'd made careful, detailed notes when I first set up the machine I'd have known exactly how to fix it.
As it was, a process of trial and error, along with lots of staring at the Arch Wiki pages for initial system installation and GRUB, eventually got things going. After about two hours.
For posterity, this is what I needed to do, although getting all these ducks in a row took quite a bit of trial and error to get right:
- Boot into an Arch installation image. I didn't have one lying around, but fortunately I did have a spare computer lying around (who doesn't?!) on which to download the ISO, and more spare USB sticks than I can count to copy it onto.
- Once in the Arch installation shell, decrypt the LUKS-encrypted partitions ...
cryptsetup open /dev/nvme0n1p3 lvm
- ... and mount the resulting LVM volumes into where the Arch setup would expect them to be:
mount /dev/mapper/volgrpit-root /mnt
mount /dev/mapper/volgrpit-home /mnt/home
- Mount the EFI volume:
mount /dev/nvme0n1p1 /mnt/efi
- Start up the wifi or other network connection, cos you're going to need it shortly.
- chroot into the existing system
arch-chroot /mnt
- Install the original GRUB package
pacman -S grub
- Modify
/etc/grub/default
to have all the options necessary to boot the system as it was before. Could I remember those? Of course not! Eventually I realised that Arch's package manger “pacman” had saved the old file as /etc/default/grub.pacnew
but not before I'd made several abortive attempts at modifying the existing file. Rebooting each time, of course.
- Reinstall grub, and reinstall its config
grub-install --target=x86_64-efi --efi-directory=/boot --bootloader-id=GRUB
grub-mkconfig -o /boot/grub/grub.cfg
- Reboot, and breathe a huge sigh of relief as everything's back the way it should be.
Conclusion
Written out like that it's not particularly complicated. The problem came of course because I didn't have a guide written out like that. All of the above steps are somewhere in the Arch installation guide, but picking the right steps to run when you're not completely sure of what the problem is, isn't easy.
And as I said at the very top of this post, it's a strong lesson in why you should always make notes of your adventures in sysadmin!