This finally happened: a filesystem in a RAID array got corrupted due to a faulty SATA expansion card. Please repeat after me “RAID is not a backup!”.
Luckily, the data on the array was regularly backed up by restic – an excellent open-source backup solution with a great selection of backup back-end types. Therefore, at most, I might lose the files added/modified in the few days after the last backup was made.
The only problem is the time to restore. The backup spans more than 6 TiB of data, and pulling that down from the cloud could take weeks on a typical residential connection. On the bright side, most of the files are still accessible on the corrupted filesystem. Therefore, if there is a way to establish trust in the data I have, the amount of data to download from the cloud will not be very large.
The naive approach
If you copy the remaining data over to a clean filesystem and run
rsync -avhW --no-compress --progress /backup-mount-location /new_location
you will update all files that have different metadata, but files with idential metadata and silent corruption in their data will remain present.
The next step could possibly be to run
restic -r rest:http://localhost:8080/ -v restore latest --target /new_location --host nas
where http://localhost:8080/
is served by rclone
. That would eventually eliminate all silent data corruption, but, in my expeience, that also leads to restic rewriting files without any data corruption. Therefore, this is almost equivalent to restoring into an empty location with the respective time penalty.
A more advanced approach
restic-provided metadata
restic provides a list of files in a repository by running
restic -r rest:http://localhost:8080/ ls latest
which can also be accessed in a JSON-format with --json
key:
restic -r rest:http://localhost:8080/ ls --json latest
with a sample output being:
{
"name": "weather.png",
"type": "file",
"path": "/home/alexandr/temp/weather.png",
"uid": 1000,
"gid": 1000,
"size": 32396,
"mode": 436,
"permissions": "-rw-rw-r--",
"mtime": "2020-04-07T18:03:41.786871542-04:00",
"atime": "2020-04-07T18:03:41.786871542-04:00",
"ctime": "2021-08-16T00:12:30.711094917-04:00",
"struct_type": "node"
}
Unfortunately, there is no hash information provided, and, what’s even worse, restic does not store hash information for larger files. This manual provides a good explanation of restic’s internal data storage structure.
One can explore repository objects directly and find that files are represented by nodes that refer to files’ binary data via ordered lists of SHA-256 hashes. That brings us half-way there, as large files are split into multiple blobs. Therefore, we need to find out the lengths of those blobs. This information is available in the repo’s index files.
This github issue describes how to obtain lengths of those blobs. I compiled a modified version of restic that prints chunks’ hashes and sized in the JSON-output of ls
command. The output now looks like:
{
"name": "weather.png",
"type": "file",
"path": "/home/alexandr/temp/weather.png",
"uid": 1000,
"gid": 1000,
"size": 32396,
"mode": 436,
"permissions": "-rw-rw-r--",
"mtime": "2020-04-07T18:03:41.786871542-04:00",
"atime": "2020-04-07T18:03:41.786871542-04:00",
"ctime": "2021-08-16T00:12:30.711094917-04:00",
"content": [
"844c1c113aea5e0096f6803e14f2493222a3a16975e8e1e4cb5ebc51b5b02fd0"
],
"contentsize": [
32428
],
"struct_type": "node"
}
A python tool to compare chunk-hashes
The next step is to generate the list of all files in a snapshot in JSON-format with their hashes and block sizes:
restic -r rest:http://localhost:8080/ ls --json latest > snapshot.json
I created restic-hashdiff tool that processes this list and prints out all files that have a different size or a hash mismatch. The usage is the following:
python3 -m restic_hashdiff snapshot.json
Outcome
After about 16 hours of reading and comparing data, the resulting list of files with silent data corruption contained way less than 1% of total number of files in the snapshot. This greatly reduced the restore time from weeks to less than a day.