Option to identify duplicate images in files that are not Binary Duplicates

Started by Tveloso, September 28, 2020, 04:46:00 AM

Previous topic - Next topic

Tveloso

IMatch allows us to automatically add files to categories at Ingest, including when files are found to be duplicates of files already in the database.  However, when the existing files have had Metadata updates, the newly indexed files, that are their duplicates, will not be detected as such, and will not be added to the configured "duplicates" category (because the files are no longer Binary Duplicates).

This FR requests the ability to still identify these "duplicate files", at Indexing time, perhaps by calculating and storing a separate checksum over the pixel data only, or by incorporating some aspect of the existing Visually Similar Search functionality, during Indexing, to identify "Potential Duplicates".
--Tony

sinus

If this is easy possible, it would be of course a good idea and I would give a +1
Sometimes I have binary duplicates ... although my wife does not produce them  ;) ... but me.  8)
Best wishes from Switzerland! :-)
Markus

Tveloso

Thank you Markus

I suppose I shouldn't blame all the duplicates on my wife.  :)

Tony
--Tony

Mario

I have given this some (new) thoughts.
This feature was on my long-term to-do list for a long time.

Re-reading all images to build something like a special index is out of question.

My first experiment was to use a hash (a value derived mathematically from a data set) on the thumbnail.
This did not work, because sometimes thumbnails for the same image differ, due to slight differences in the version of the image processor used to generate the thumbnail (e.g., WIC vs. JPG, different versions of the image libraries used).

Then I re-evaluated the visual query data IMatch produces to find visually similar images, and gave this some thought.
And with a handful of new algorithms, some additions to the database (automatic migration when the database s opened for the first time) etc. I was able to implement this.

The Search menu now has two commands to find "dupes". Search Copies (binary identical files) and "Search Duplicates" (which finds 100% copies based on the image data alone).
The Duplicate search is insensitive to changes done to metadata or other changes in the file, it looks only at the image itself and finds all copies.
Note: It does not find copies where the image has been resized or otherwise modified. For that the "Find Visually Similar Images" search feature is the way to go.

I will also update the indexing configuration so users can choose between "copies" and "identical images" when assigning dupes during indexing to one or more categories.
I don't want to add yet another category selector, because this would require a complete re-design of the dialog. The user must choose one of the options, depending on what he/she considers a duplicate.
-- Mario
IMatch Developer
Forum Administrator
http://www.photools.com  -  Contact & Support - Follow me on 𝕏 - Like photools.com on Facebook

thrinn

Quote from: Mario on September 30, 2020, 09:40:13 AM
The Search menu now has two commands to find "dupes". Search Copies (binary identical files) and "Search Duplicates" (which finds 100% copies based on the image data alone).
The Duplicate search is insensitive to changes done to metadata or other changes in the file, it looks only at the image itself and finds all copies.
Note: It does not find copies where the image has been resized or otherwise modified. For that the "Find Visually Similar Images" search feature is the way to go.

I will also update the indexing configuration so users can choose between "copies" and "identical images" when assigning dupes during indexing to one or more categories.
I don't want to add yet another category selector, because this would require a complete re-design of the dialog. The user must choose one of the options, depending on what he/she considers a duplicate.
This sounds like a valuable new feature. As of now, I am not aware of duplicates in my database, but being not aware of them is of course not the same as not having them...
Thorsten
Win 10 / 64, IMatch 2018, IMA

Mario

Quote from: thrinn on September 30, 2020, 10:00:31 AM
This sounds like a valuable new feature. As of now, I am not aware of duplicates in my database, but being not aware of them is of course not the same as not having them...

Different users have different workflows and face different challenges.
One way to produce duplicates was mentioned in the FR, accidentally downloading and indexing the same files (e.g. from phones or cards).
Another typical scenario for having to deal with tons of duplicates is somebody starting with DAM (especially companies & libraries), having accumulated multiple copies of file due to, eh,  organizational issues... ;D

The existing duplicate search can find 100% copies (binary identical). This works well in many scenarios.
But if a file has been added previously, metadata was changed and written back, the existing duplicate checker cannot longer find the copy. The file is no longer binary identical.
The new feature solves this by looking only at the image data itself.

But both search modes are needed. Sometimes you need to find only binary duplicates (real 1:1 copies) and sometimes you need to find visually identical files.
And sometimes you need to find visually similar files. Or files with similar colors. Or files with specific properties, e.g. "mostly blue" or "green at the top". IMatch supports all of this  :)

The hard part was to find a way to make the new dupe checker it fast enough to be used during ingesting without bringing performance down to a crawl.
-- Mario
IMatch Developer
Forum Administrator
http://www.photools.com  -  Contact & Support - Follow me on 𝕏 - Like photools.com on Facebook

Tveloso

Quote from: Mario on September 30, 2020, 09:40:13 AM
Then I re-evaluated the visual query data IMatch produces to find visually similar images, and gave this some thought.
And with a handful of new algorithms, some additions to the database (automatic migration when the database s opened for the first time) etc. I was able to implement this.

Quote from: Mario on September 30, 2020, 10:43:58 AM
The hard part was to find a way to make the new dupe checker it fast enough to be used during ingesting without bringing performance down to a crawl.

Thank you so much Mario!

I know that we sometimes don't realize just how much effort it can be to a new feature to an already immensely feature-rich product like IMatch.

I'm very much looking forward to using the new search to see just how many dupes I have (and for IMatch to keep me - and my wife - in check with creating new ones).  :)

Thank you again!
--Tony

Darius1968

I'm backing up this request, as I too, have struggled on occasion, with many/various visual-duplicates entering my DB, being a challenge to round-up, for lack of a feature to explicitly say such images actually look identical to others, already in my DB - I have observed that sometimes, the visually similar feature is inaccurate, in that some of the results marked as less similar, percentage-wise can have the potential to end up being the actual visual duplicates!  (This, over images with similarity values closer to 100% not even remotely visually resembling the originals, used to generate the search!)

+1

jch2103

John

Mario

Quote from: Darius1968 on September 30, 2020, 06:09:35 PM
I have observed that sometimes, the visually similar feature is inaccurate, in that some of the results marked as less similar, percentage-wise can have the potential to end up being the actual visual duplicates!  (This, over images with similarity values closer to 100% not even remotely visually resembling the originals, used to generate the search!)
+1

This is very unlikely if the image data is identical. Changing the size, cropping, any change to color or contrast will cause images to vary widely.
Also, remember to always use a sort preset which sorts by similarity. Sometimes users sort by things like Capture Time and then wonder why the search results seem inaccurate...
-- Mario
IMatch Developer
Forum Administrator
http://www.photools.com  -  Contact & Support - Follow me on 𝕏 - Like photools.com on Facebook