How to reject a duplicate match

Started by Damit, March 12, 2024, 06:23:31 PM

Previous topic - Next topic

Damit

I have read the help and done a search here and I am surprised there is no information on how to tell IMatch that a detected duplicate is not a duplicate. I know I can set it to binary to get exact matches, but I like to use the visual similarity setting.  Unfortunately, this leads to a lot of false positives, and I would like to eliminate them.  Is this possible?

Mario

Quoteand I am surprised there is no information on how to tell
This is not surprising at all. There is no such feature nor was it ever requested or is needed.

If you search for duplicates, IMatch only considers the "image information" aka the finger print for the image it has created using an algorithm. This can never be perfect and the very rare false duplicate is normal.
You can send me the original and false duplicate and I check them out here when there is time.

If you search for the same images over and over (unlikely, since this is usually done only once to weed out dupes when building up your DAM) and thus find the same "false" positive over and over, just put them into a collection or category and then filter out files in that category or collection. Don't filter the files out of course when you search for other originals.

The Dupe search is usually run only once, when new images come into the database. Then you weed out the dupes and be done. If there is a wrong dupe, just ignore and not delete. Rinse and repeat when you add more files to the database.


Lazlo Nibble

Quote from: Mario on March 12, 2024, 06:32:17 PMThis can never be perfect and the very rare false duplicate is normal.
I think the documentation needs to be updated, then. The Searching page states that "IMatch considers files as duplicates when the image data is identical" (your emphasis). To me, identical means bit-for-bit identical, not "very very close", but I often see "very very close" images flagged as Duplicates on ingestion. Think iPhone burst mode, for example.

Right now I'm consolidating/deduping about 25 years' worth of images that haven't been managed in any consistent way. I've eliminated the files that are bit-for-bit duplicates but now I need to weed out images that are true pixel-for-pixel duplicates regardless of the metadata, filename, etc. That could be done with a hash on the rendered pixel data, which iMatch presumably has already as it's needed to generate the cache thumbnail and the fingerprint.

Mario

#3
To which help section do you refer? Please use the feedback link available at the bottom of each help page to send me information about typos, unclear or wrong content. I can then update the page.

On the Finding Duplicate Files page, I explain what the difference between a Duplicate and a Copy is. If you are only interested in finding true copies (bit-for-bit identical) use the Find Copies, not the Find Duplicates mode.

Jean-Maetso

#4
Mario,
In the Help Page you mentionned, there is at least one element that I find confusing: "The Best File Window layout for Duplicates" may indicate "dimensions difference". I cannot understand how "pixel bit-to-bit identical" images can have different dimensions?
I agree they can have different dates, different file sizes and so on. But please can you explain how they can have different dimensions while being, by definition of "duplicates", identical in terms of image data?
Thanks

Mario

IMatch allows you to search for duplicates (identical or nearly identical picture data, not considering metadata), copies (binary identical files), visually similar images and images matching a sketch you draw.

If you compare via Copy, only 100% binary identical files will be found. That's that. Use this to find multiple copies of images you have on your disks.

The Duplicates algorithm is tolerant regarding slight changes in the pixel dimensions of the image, small variations in color, even slight crops.

If you search for visually similar images (a search mode different from Copy/Duplicates, mind), they images be considered similar even if the have different dimensions, different file formats, different crops etc.

In such cases, and in the workflows of many of the collector / science users of IMatch, it makes sense to display the pixel dimensions and/or file format and/or selected metadata tags in a custom File Window layout in order to help deciding which files to keep and which files to delete.

That's all up to you of course.
Showing dimensions when you only search for Copies indeed makes no sense.

Jean-Maetso

Quote from: Mario on April 02, 2024, 11:04:11 AMIMatch allows you to search for duplicates (identical or nearly identical picture data, not considering metadata), copies (binary identical files), visually similar images and images matching a sketch you draw.
What you are writing here does not seem consistent with the definition of "duplicates" as found in the documentation:
"Files ares duplicates when the image data is identical. [...] This search operation [...] does not deal with similar files".

Thus my remark (and the suggestion by Lazlo to update the documentation).
Thanks

Mario

#7
Send me a feedback via the link and your suggested changes. This is the recommended approach if you find missing or invalid info in the help. It automatically enqueues your feedback into the appropriate ticket system on my end.
I shall maybe replace identical with a more fuzzy word, since a few pixels difference will not change the image fingerprint used for matching - so identical should be changed to "identical or nearly identical" probably.

I have fleshed out the explanation a bit. Check it out. You probably need to clear the cache in your browser or reload the help page a few times with <Ctrl>+<F5>.

Lazlo Nibble

#8
Quote from: Mario on April 02, 2024, 10:03:23 AMIf you are only interested in finding true copies (bit-for-bit identical) use the Find Copies, not the Find Duplicates mode.

I'm interested in finding true, bit-for-bit identical copies based on image data alone. Is there a way to do this in IMatch?

Scenario: I have an existing image library in IMatch. I then find an older backup CD in a drawer somewhere. How do I load the images from that CD into my library and be 100% sure that in doing so I've neither lost any images nor duplicated any unnecessarily? I can't compare the images with Copy because the versions of the images in my existing library will inevitably have additional metadata not in the pre-IMatch versions from the CD (just ingesting the images and allowing a writeback will update the metadata most of the time). But if I compare with Duplicate the fact there's no way to be sure that any "Duplicates" IMatch finds are actual bit-for-bit matches.

A true bit-for-bit image comparison tells me with 100% certainty that I can safely delete the bit-identical images from the backup CD without losing any data, and any images from the CD that don't match images in the existing library can be ingested without creating duplicates.

Mario

QuoteI'm interested in finding true, bit-for-bit identical copies based on image data alone. Is there a way to do this in IMatch?
No.
Only the four image search modes available in the Search menu are available. For your purpose, you'll need a specific software that specializes in this. That's a very unusual requirement.

Damit

It would be great if just image data and not metadata could be compared, but frankly, this program does so much, I can understand the recommendation for a more specific software.  I have a few and I am not sure if they can accomplish that either. A lot of the software I previously used has been replaced by the apps in Imatch. 8)

Mario

QuoteA true bit-for-bit image comparison tells me with 100% certainty
This depends.

1. If the file format uses lossy compression (e.g. JPG) and the two files were compressed with different implementations of that lossy algorithm (same source image, saved at different times or with different software versions), comparing the resulting file pixel by pixel won't get you the right result. Although the images are actually identical.

2. If you work with RAW files, different versions of a WIC codec or LibRaw may produce different "pixels" for the same RAW file, depending on when you load the file. For comparing the JPEG preview image in the RAW, see point 1 above.

If you have a very narrow use case, e.g. non-lossly image format, always the same software used to create the files, creating a tool that actually extracts and compares pixel for pixel will be doable. But that's such a narrow use case, this is not something that I see in IMatch ever.

I'm sure there are tools in the open source or available at a cost which perform this particular type of bit-by-bit comparison of image data alone.