Tool to detect (and get rid of) similar images using perceptual hashing (pHash lib)

mk-fg mk-fg Last update: Jun 05, 2022

image-deduplication-tool: simple tool to detect (and get rid of) similar images using perceptual hashing

There's gonna be a lot of duplicates in almost any arbitrary collection ofimages, and it can actually be surprising how many.

pHash lib, which is the core of the tool, easily detects cropped and retouchedimages, or same thing in different resolutions and formats.

Tool goes over the specified paths, calculating the hashes of all the imagesthere, pickling them into a db file between runs (to save lots of time onre-calculating all of them).Then it just compares the hashes, showing closest results first.

pHash lib seem to be able to utilize multiple cpu cores for the hashing whenbuilt with with openmp flag, but it didn't seem to work for me, so put muchsimplier solution in place to scale such task - just forking worker pid for eachhardware thread.

Optinally, tool can start handy fehviewer, where human can make a decision to remove one image version or the other(with pre-configured "rm" action) for each duplicate pair, skip to the next pairor stop the comparison.

Warning

As illustrated in#1 andCImg#49, libpHash/CImg will fall backto using potentially unsafe (exploitable with crafted pathnames) "sh -c"commands for non-image file formats and might not get filename-escapingcorrectly there (especially with CImg versions up to 1.5.3).

Simple safeguard for that particular issue would be only to run the tool onimage paths (where CImg doesn't run "sh"), not paths that contain mixed-typefiles, or at least make sure there's no funky stuff in the filenames, scriptdoesn't enforce any kind of policy there.

Note also that thing libpHash/CImg runs (usually) is ImageMagick's "convert",which can have all sort of issues with malicious file contents (see e.g.ImageTragick bug there), so maybe it's not a good idea torun the tool on a bunch of unsanitized images, ever.

One other precaution is that with the --feh option, script will run "feh"program, and --feh-args parameter may contain options (e.g. --info) that will beexecuted in the shell by feh, so either don't use --feh for weird and/orpossibly-malicious (e.g. really weird) filenames or at least remove --infooption from the --feh-args commandline.

Requirements

Usage

Just run as e.g. ./image_matcher.py --feh ~/media/images.

% ./image_matcher.py -husage: image_matcher.py [-h] [--hash-db PATH] [-d [PATH]] [-p THREADS]                        [-n COUNT] [--feh] [--feh-args CMDLINE] [--debug]                        paths [paths ...]positional arguments:  paths                 Paths to match images in (can be files or dirs).optional arguments:  -h, --help            show this help message and exit  --hash-db PATH        Path to db to store hashes in (default:                        ./image_matcher.db).  -d [PATH], --reported-db [PATH]                        Record already-displayed pairs in a specified file and                        dont show these again. Can be specified without                        parameter to use "reported.db" file in the current dir.  -p THREADS, --parallel THREADS                        How many hashing ops can be done in parallel (default:                        try cpu_count() or 1).  -n COUNT, --top-n COUNT                        Limit output to N most similar results (default: print                        all).  --feh                 Run feh for each image match with removal actions                        defined (see --feh-args).  --feh-args CMDLINE    Feh commandline parameters (space-separated, unless                        quoted with ") before two image paths (default: -GNFY                        --info "echo '%f %wx%h (diff: {diff}, {diff_n} /                        {diff_count})'" --action8 "rm %f" --action1 "kill -INT                        {pid}", only used with --feh, python-format keywords                        available: path1, path2, n, pid, diff, diff_n,                        diff_count)  --debug               Verbose operation mode.

feh can be customized to do any action or show any kind of info alongside imageswith --feh-args parameter. It's also possible to make it show imagesside-by-side in montage mode or in separate windows in multiwindow mode, see"man feh" for details.

Default feh command line:

feh -GNFY --info "echo '%f %wx%h (diff: {diff}, {diff_n} / {diff_count})'" --action8 "rm %f" --action1 "kill -INT {pid}" {path1} {path2}

makes it show fullscreen image, some basic info (along with difference betweenimage hashes and how much images there are with the same level of difference)about it and action reference, pressing "8" there will remove currentlydisplayed version, "1" will stop the comparison and quitting feh ("q") will goto the next pair.

Without --feh (non-interactive / non-gui mode), script outputs pairs of imagesand the integer Hamming distancevalue for their perceptual hash values (basically the degree of differencebetween the two).

Output is sorted by this "distance", so most similar images (with the lowestnumber) should come first (see --top-n parameter).

Optional --reported-db (or "-d") parameter allows efficient skipping ofalready-reported "similar" image pairs by recording these in a dbm file.Intended usage for this option is to skip repeating same hash-similar pairs onrepeated runs, reporting similarity for new images instead.

Operation

Script does these steps, in order:

  • Try to load pre-calculated image hash values from --hash-db file.

  • Calculate missing perceptual hash values (ph_dct_imagehash) for each imagefound, possibly in multiple subprocesses.

  • Dump (pickle) produced hash values (back) to a --hash-db file.

  • Calculate the difference between hashes of each image pair for all two-imagecombinations, sorting the results.

  • Print (or run "feh" on) each found image-pair, in most-similar-first order,optionally skipping pairs matching those in --reported-db file.

It's fairly simple, with all the magic and awesomeness in calculation of that"perceptual hash" values, which is contained in libpHash.

Known Issues

pHash seem to be prone to hanging indefinitely on some non-image files withoutconsuming much resources. Use ./image_matcher.py --debug -p 1 to see on whichexact file it hangs on in such cases.Might add some check for file magic to see if it's image before running pHashover it in the future.

pHash also gives zero as a hash value for some images. No idea why it does thatatm, but these "0" values obviously can't be meaningfully compared to anything,so tool skips them, issuing a log message (seen only with --debug).

Subscribe to our newsletter