Welcome to Blockhash.
The 256 bit hashes that Blockhash generate are designed to be near unique for images, even after an image has been rescaled. The Hamming distance between two hashes (the number of bits that differ) indicate how far apart two images are, with single-digit values generally giving a good indication that the images are identical, even if they are of different size.
Use case parameters
We've designed Blockhash for use cases with the following in mind:
- Identifying derivative works is less important than verbatim re-use (ie., the algorithm doesn't need to match images which have been manipulated beyond resizing and format changes)
- False positives (images that are not the same generate blockhashes that have less than 10 variations) should be kept at an absolute minimum and occur no more than on 1 out of 10,000 images in a random test set.
- Blockhash variations between images of original size (above 640 pixels wide) down to thumbnail size (100 pixels wide) should be no more than 10 bits in 95% of cases.
For images in general, the algorithm generates the same blockhash value for two different images in 1% of the cases (data based on a random sampling of 100,000 images).
For photographs, the algorithm generates practically unique blockhashes, but for icons, clipart, maps and other images, the algorithm generates less unique blockhashses. Larger areas of the same color in an image, either as a background or borders, result in hashes that collide more frequently.
You can try generating a few hashes by fetching the Python version and running it on your images:
$ git checkout https://github.com/commonsmachinery/blockhash-python $ blockhash-python/blockhash.py
We're working on an RFC to describe the modified algorithm we're using in detail. You can follow the progress on Github.
File an issue in the respective Github repository, find our developers goodoldshep & artfwo on IRC in channel #commonsmachinery on FreeNode, or mail email@example.com