The Tale of a Corrupt Backup

25 Feb, 2023 · 4 min read · #backup #restic #hardware

As I mentioned in the post about my backup system, I run restic check regularly, to test the integrity of my backup. It has never reported any error in my repository. It performs a quick, shallow check, and does not verify that all the data is intact. I also restore a random file, whenever I run restic check, to test recoverability.

Shortly after I wrote the previous post, I decided to run restic check --read-data for the first time ever. This reads every file in the repository and simulates a full restore. To my utter horror, it reported many errors like these!

Pack ID does not match, want 5e66c2ac, got e80051de
pack d64be86d contains 1 errors: [Blob ID does not match, want 8ebf2c10, got 350f6ba1]

The “Pack ID does not match” error occurs when the SHA256 hash of the contents of a file, in the “data” directory of the Restic repository, does not match its name. This generally indicates that the file is corrupt. I have successfully restored my entire data from this repository, many times in the past. Clearly the repository was healthy then. When did it go bad?

Mysteriously, the set of broken pack and blob IDs changed each time I ran the check! Why would a different set of files be corrupt each time? When I ran sha256sum on the flagged files, the hashes matched their names. This made no sense! At this point, I suspected that Restic had a bug, but I could not find anything wrong in the code.

The Culprit

Some online discussions hinted at the possibility of the hardware being at fault and it struck me. I had build a new desktop PC in December of 2021. Firefox tabs had been crashing occasionally, when playing videos, ever since I switched to this PC. I assumed that this had something to with the drivers or the DRM plugin. The crashes were rare enough to not motivate me to dig deeper. Could a bad piece of hardware have been the issue all along?

I ran memtest86 and sure enough, it reported errors within seconds. This broke SHA256 computation at random points in the checks, resulting in different packs and blobs being flagged. This may also have broken the encryption of any data backed up from the new desktop.

The Fix

The PC has a dual-channel memory kit. I was able to isolate the fault to one of the sticks, by running memtest86 against each individually. I removed the faulty stick and ran restic check --read-data again a few times. It still flagged a few packs and blobs, but the set was consistent across runs. Checking the hashes manually confirmed that these files were indeed corrupt. I repaired the repository, by following this comment. I ran restic check --read-data again and everything looked good. Subsequently, I raised a warranty claim to replace the G.SKILL memory kit, through Acro Engineering Company.

It is possible that the faulty RAM has corrupted not just the backups, but also any original file I created or modified on the desktop. I checked all the files I consider to be critical and none of them appear to have been affected. I still do not know if some file that I did not check is corrupted. 🤷

Lessons Learnt

  1. Hardware can have bugs too and they can fail in undetectable ways.
  2. Always run tests to verify that any new hardware is functioning correctly. If I had run memtest86 immediately after building the PC, I would not have corrupted the repository in the first place. I also could have conveniently requested a replacement from Amazon, without going through the warranty claim process.
  3. Partial restores are not sufficient to test recoverability. Perform full restores or deep integrity checks. If I had not run a deep check by chance, I may have found the corruption only when restoring the next time. That would have resulted in a permanent data loss.

If you liked what you read, consider subscribing to the RSS feed in your favourite feed reader.