[BBD] Images incorrectly detected as Duplicates

Started by PaulS, July 13, 2023, 02:15:44 AM

Previous topic - Next topic

PaulS

Hi Mario.

Searching this forum did not identify any other people having this issue.  The following post from early 2022 that suggests that it is a problem unique to me:   https://www.photools.com/community/index.php/topic,12178.msg86468.html#msg86468

Quote from: Mario on January 07, 2022, 12:30:16 PMPs.: If you have two images which are wrongly identified as visually or binary identical, please send them to me.
I have never found such files and as far as I recall, no user ever reported this as a problem.

Nevertheless, I have 64 .JPG files that are incorrectly detected as duplicates (29 pairs and 2 groups of three) when I run a Duplicates Search.  They are different images with clear and sometimes significant visual differences.

I have run database diagnostics with no errors or warnings and the problem still occurs.

To see if it was database specific, I copied the problem files to a separate folder, created a new database, and indexed the new folder.  But the new database also incorrectly identifies them as duplicates, so it seems to be something related to the files.

I've uploaded a ZIP containing two groups of two files that are detected as duplicates.

I have no idea what could possibly be wrong and appreciate your help.

Paul

axel.hennig

I can reproduce this partially.

Steps:
1. Create a new database
2. Set under "Preferences -> Indexing" the option "Assign copied (binary identical)"
3. Load the images from Paul (included in his attachement "Duplicates.zip")

The file "2010-10-26 20.29.58 P110.JPG" is assigned to the category "DUPLICATES".

Mario

Quote from: axel.hennig on July 13, 2023, 09:11:53 AMI can reproduce this partially.

Steps:
1. Create a new database
2. Set under "Preferences -> Indexing" the option "Assign copied (binary identical)"
3. Load the images from Paul (included in his attachement "Duplicates.zip")

The file "2010-10-26 20.29.58 P110.JPG" is assigned to the category "DUPLICATES".

@Axel
This is a glitch. Changing this option without closing and re-opening the database had no effect.
I have fixed this for the next release.

I assume the initial setting in your case was "Assign copies same format" and switching to the "binary identical" was not applied and so your test used the wrong setting.
If you close and re-open the database and then remove the files from the DUPLICATES category and then do a forced update of the folder, none of the files will be flagged as a dupe.

@Paul
The "binary identical" algorithm considers none of the files as duplicates / copies.
The visually similar does, because the files are very similar and thus are identified as such. This is just how the algorithm works.

If this happens often to you, switch the mode to "assign copies (binary identical)" under Edit menu > Preferences > Indexing or disable this feature.

This behavior is by design.


axel.hennig

Quote from: Mario on July 13, 2023, 09:34:22 AMIf you close and re-open the database and then remove the files from the DUPLICATES category and then do a forced update of the folder, none of the files will be flagged as a dupe.
I can confirm.

PaulS

@Mario.  I made the changes you suggested in Preferences and removed all files from the DUPLICATES category.  I closed and restarted IMatch.  I then did a forced rescan.  None of the files are placed in the DUPLICATES category.

However, when I then select all files and use the menu command Search - Duplicates, the results windows still show the files as Duplicates.

@Axel.  Thanks for your help.  Are you able to confirm the above?

@Mario  From the IMatch help, "Match considers files as duplicates when the image data is identical. This mode does not consider metadata or other data in the file. Use this to find identical files even after changing and writing-back metadata.  Note:  If the image data has changed (resized or modified in any form), the image will no longer be considered a duplicate."

In my case the image data is not identical and so according to the help should not be detected as a Duplicate.  Visually similar images... is a different command which I am not currently using.

Does IMatch use the actual image data or a proxy such as the thumbnail or CRC to detect duplicates?  If so, perhaps my files have somehow been corrupted and could be fixed?  I've checked all the files using the Metadata Analyst app but did not see any obvious clues.

Mario

#5
QuoteHowever, when I then select all files and use the menu command Search - Duplicates, the results windows still show the files as Duplicates.
This is how the algorithm works. It uses the visual fingerprint of the image to compare them.
There is a slight bit of fuzziness, which is intended because it allows the algorithm to overlook small changes in pixels.
Compared to the "Visually Similar" search available in the Search menu, the duplicates search uses a very small threshold. This algorithm is supposed to ignore small changes in a small percentage of pixels in an image.
Since your images are very similar (just zoom them to thumbnail size or less), they are detected as duplicate candidates.
The same will happen if you do exposure series or similar. That's just how the algorithm works and this cannot be changed.

I will add an extra sentence to the help topic to explain this better.

It does not load every image into memory, extract the bitmap data and compares the bitmap data bit for bit with every other image in the scope. This would take ages.

IMatch uses a checksum (CRC) and the file size when you search for copies. Copies must be bit-by-bit identical, duplicates may have some small differences.

If you are only interested in real binary duplicates, use the "copy" mode.

PaulS

Thanks Mario.

With your explanation I understand that it is not due to some file corruption and everything makes sense.

Overall the algorithm is highly effective - nicely done.

(I noticed that all of the Duplicates in question were within one minute and had exactly the same dimensions or at least the same aspect ratio so I imagine this might also be part of the algorithm.)

Mario

Dimensions and aspect ration (to some extent) play a role in the fingerprinting. Like colors, noise, luminescence etc. It's all folded into an n-dimensional vector in the end.

The general purpose of the algorithm is to find visually similar images (Search menu > Visually Similar Images). 

Using a tight threshold, the same algorithm helps to detect what, under normal conditions, would be considered as visually identical (with a few pixels difference). Which can be very helpful for some users who have large collections and an inflow of images from various sources and devices.

For users who frequently shot bursts or bracket shots, this mode is not that helpful, since most images in a burst or bracket will be considered as duplicates. 

Then the "copies - binary identical" mode is the better choice.