Skip to content

Computing checksums for files

Checksums are a way to generate a fixed-size "fingerprint" of a file. They are used to verify that two copies of a file are the same without comparing the files themselves. For example, if we have a large file and you have a large file, we can check that they are the same (with very high probability) by running a command like,

$ sha512sum TMY-User-Manual.pdf 
2693e3bd683b3d7283c60b2c83e2e... TMY-User-Manual.pdf

(where we have elided the complete checksum because it is quite long).

There are several checksum algorithms and they vary in how expensive they are to compute and how likely it is that two different files have the same checksum, called a "collision". CRC32 is not appropriate for this use because it has a quite high chance of collisions, but it is used on some communication links. MD5 and SHA1 are quite old and collisions are known to be possible. SHA256 and SHA512 are thought to be robust enough that the chance of a collision is essentially zero.

Our convention is to store SHA512 checksums in a file called SHA512.sums in the top level directory of a dataset, together with a README.txt (or README.nfo or README.md) file. The actual data should be in a subdirectory.

To create a SHA512.sums file, suppose that the data is in a subdirectory called tmy. We would then do,

$ find ./tmy -type f -print0 | xargs -0 sha512sum > SHA512.sums

The first part of that command, with find, descends into the ./tmy directory looking for regular files (-type f). It will skip directories and any special files. It then prints the filenames that it finds to the standard output. The reason for -print0 as opposed to -print is to correctly handle files with spaces, quotes, or other special characters in their names. It does this by using a NULL character, or 0 as a delimiter. We hope that there are no files with NULL characters in their names. This is possible, but rare and should be corrected before packaging and distribution of the data.

The second part of that command, with xargs, reads filenames from its standard input, separated by a NULL character (-0) and runs the sha512sum command on them. The output is then redirected and saved in the SHA512.sums file.

There are similar commands for computing the other kinds of checksum that are used in exactly the same way: sha256sum, and even sha1sum or md5sum for situations where those have been used by someone else. Checksums made with sha256sum should be stored in a file called SHA256.sums and similarly with the others.