TIL: Locality Sensitive Hashing
I’ve been planning on writing a longer series of post about hash functions to learn more about how they work and what goes into designing
good hash functions.
During my research I stumbled upon something called Locality Sensitive Hashing. If you’ve used a hash function before or know something about them, you probably know that changing the input to a hash function by just a little bit produces a wildly different result. Take for example the sha256 hash function with the string: “Hash functions”.
The output of this is:
But if we just change one letter so that we have the string “Hash functions” we get the output:
The idea behind a locality sensitive hash function is that data that are very similar would hash to similar outputs. This would mean that the example above would with high probability hash to two close hash digests. There are several applications where using hash functions with this property are useful such as malware fingerprinting and similarity between documents.
To read more about this quite interesting subject I would recommend both the wikipedia page: https://en.wikipedia.org/wiki/Locality-sensitive_hashing and this section from Mining of Massive Datasets: http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf