TIL: Locality Sensitive Hashing

I’ve been planning on writing a longer series of post about hash functions to learn more about how they work and what goes into designing
good hash functions.

During my research I stumbled upon something called Locality Sensitive Hashing. If you’ve used a hash function before or know something about them, you probably know that changing the input to a hash function by just a little bit produces a wildly different result. Take for example the sha256 hash function with the string: “Hash functions”.

The output of this is: 5a2dba974930153b12b69e46023e73bd09dcdc8216fe2960a991791389f4519c

But if we just change one letter so that we have the string “Hash functions” we get the output: 457321e76142b2fcb0147ceb8233f4efc7f1d0154c1b1aa48778f905f7de7bc0

The idea behind a locality sensitive hash function is that data that are very similar would hash to similar outputs. This would mean that the example above would with high probability hash to two close hash digests. There are several applications where using hash functions with this property are useful such as malware fingerprinting and similarity between documents.

To read more about this quite interesting subject I would recommend both the wikipedia page: https://en.wikipedia.org/wiki/Locality-sensitive_hashing and this section from Mining of Massive Datasets: http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf