Why do we trust Crypographic Hashes for Uniqueness
A cryptographic hash is a way to get a unique digest of an input. As you can see in the example above, the digest is always the same size. In the picture above, the input changes and the input can be as large as it wants up to millions and billions of letters and the digest will always be the same.
To give a bit of a metaphor to it, suppose you want to a hash for a book. A possible hash might be the first 1000 letters in the book. You could probably identify a book by the first 1000 letters.
Let's go with this example and see how useful such a hash would be:
The ideal cryptographic hash function has four main or significant properties:
- it is easy to compute the hash value for any given message
- Grabbing the first 1000 letters is pretty easy
- it is infeasible to generate a message that has a given hash
- It is too easy... just make sure the first 1000 letters are the same
- it is infeasible to modify a message without changing the hash
- This is too easy. Just make the 1000 letters the same as the hash
- it is infeasible to find two different messages with the same hash
- It is unlikely to find two books with the same first 1000 letters
UniquenessI'd like to focus on the uniqueness aspect of a hash. It has applications in many areas.
That is to say using hash functions to identify a file. This can be used for both identification and verification purposes.
For example, let's say you have a file, mylife.doc, that you are sharing on the internet. It is shared in lots of places. Your friend downloads that file. How can you friend tell if the file downloaded properly or someone hasn't put a virus on the file?
You could use a hash function like MD5 of SHA-1. So you tell your friend that mylife.doc has an MD5 checksum of 79054025255fb1a26e4bc422aef54eb4. After he downloads the file, he finds the MD5 checksum of the file he downloaded. If they match, he can be sure the file downloaded properly and didn't have any viruses or anything in the file.
However, cryptographic hashes are not perfect. There's always a possibility that a malicous person could find a way to trick the algorithm. For example, perhaps a malicious person finds a way to insert a virus into mylife.doc and after calculating the MD5, it is the same as the original mylife.doc?
And indeed, many hash functions have been cracked; others are just imperfect and messages will provide the same hash. The term for this is they will collide.
Why Only Rely on the HashA big question is why only rely on the Hash? Indeed, I want to focus not on the actual hash function, but more on the applications used.
MD5 checksums are used all the time to verify the integrity of a file.
The popular GIT source control system used SHA1 to uniquely identify files.
A lot of people become concerned at some point that they will, by random happenstance, have two objects in their repository that hash to the same SHA-1 value. What then?
If you do happen to commit an object that hashes to the same SHA-1 value as a previous object in your repository, Git will see the previous object already in your Git database and assume it was already written. If you try to check out that object again at some point, you’ll always get the data of the first object.
However, you should be aware of how ridiculously unlikely this scenario is. The SHA-1 digest is 20 bytes or 160 bits. The number of randomly hashed objects needed to ensure a 50% probability of a single collision is about 2^80 (the formula for determining collision probability is
p = (n(n-1)/2) * (1/2^160)). 2^80 is 1.2 x 10^24 or 1 million billion billion. That’s 1,200 times the number of grains of sand on the earth.
In all these cases, there are potential collisions; both intentional and unintentional.
What is perculair is just how must trust the designers of such schemes give to the hash function. I'm not saying the hash function is a problem. It is just that as an engineer, you try to expect that certain things can fail even if you don't expect them to, and you have backups in place.
Some Enhancements to the ApplicationsThere are many things GIT or file integrity systems could do instead of just using a hash functions.
They could include the length of the file in addition to the hash. This would ensure nothing could be added/removed from the file and still give the same hash. It eliminates any possibility of something being added/removed from the file (within file length modulus). The only avenue open is to modify the file.
In the case of GIT, accidental collisions are the bigger problem. One could include the file name as part of the hash as well as the size.
So instead of GIT or file integrity systems using a hash that looks like this:
They'd use one that looks like this
first 10 letters of filename|file length|hash
It could end up looking like this:
There are a lot of variations that could be done. It's just a thought.