Original vs. Duplicates

Started by sybersitizen, March 23, 2025, 05:16:32 PM

Previous topic - Next topic

sybersitizen

I'd like to confirm how IMatch determines and indicates which of a group of duplicates is the original. Is it normally the one that was ingested first, or is it based on something else?

And what exactly does the red line under the thumbnail tell me? Sometimes there's only one red-lined file in a group of several being displayed in the Result Window for Duplicates, and sometimes there are more than one.

Mario

#1
When IMatch indexes new files, and it finds that a file is a duplicate per your configuration, it adds the file to the Duplicates category. If you have file A and B considered as duplicates and A is in the database when B is added, B will be added to the category.

If you later rescan A for a reason, it will be considered a duple (for B). But since B is already in the Duplicates category, A will not be added.

Usually the files in Duplicates are short-lived. You process them regularly, dealing with the duplicates in a sensible way. And then empty the category.

QuoteAnd what exactly does the red line under the thumbnail tell me?
By default, the DUPLICATES category uses a red color-coding. When you see this color in the categories bar in File Window panels, you know that the file is in that category. Same as for any other color-coded category:

Category Color Bar

sybersitizen

I do understand those general concepts, which I have read in the Help pages, but my questions are still unanswered.

Let me try again ...

1. How does IMatch determine and indicate which of a group of duplicates is the original? Is it normally the one that was ingested first, or is it based on something else? Or maybe it doesn't try at all, and that is for me to decide.

2. What exactly does the red line under the thumbnail tell me when I'm viewing a group of duplicates in the Result Window for Duplicates? Sometimes there's only one red-lined file in the group, and sometimes there are more than one. What distinguishes those from other files in the group that have no red line?

Tveloso

Quote from: sybersitizen on March 23, 2025, 06:10:30 PM1. How does IMatch determine and indicate which of a group of duplicates is the original? Is it normally the one that was ingested first, or is it based on something else? Or maybe it doesn't try at all, and that is for me to decide.
In terms of "marking" a given file as a duplicate, it is strictly the order of indexing.  As Mario described, If you index two files that are dupes, the first one (File A) will be added to the Database without incident.  When the second one (File B) is later added (either in the same indexing session or in a subsequent one), it's File B that will be "flagged" as a duplicate (i.e. File B is the only one added to the Duplicates Category).

In terms of doing a search for duplicates, and a Result Window opens showing "the originals" and one or more matching duplicates, the originals are simply the files that you have selected to perform the search.  So in performing the search, you have designated which files are the originals.

Quote from: sybersitizen on March 23, 2025, 06:10:30 PM2. What exactly does the red line under the thumbnail tell me when I'm viewing a group of duplicates in the Result Window for Duplicates? Sometimes there's only one red-lined file in the group, and sometimes there are more than one. What distinguishes those from other files in the group that have no red line?
As Mario Described, the red line indicates that the file is in a Category that's configured with color coding (which the DUPLICATES Category is configured to do).

So in the scenario above, File B would have a red line, and File A would not.  This is not really directly related to any indication of which is the duplicate.  It simply indicates that the file has been added to a Category that does color coding (which File B has been).  As described in the Help Topic that Mario linked-to, you can set up your own categories to do color coding, and they would contribute their color to the thumbnail's color bar (just as the DUPLICATES category is doing).
--Tony

sybersitizen

#4
Quote from: Tveloso on March 23, 2025, 07:09:52 PMIn terms of doing a search for duplicates, and a Result Window opens showing "the originals" and one or more matching duplicates, the originals are simply the files that you have selected to perform the search.  So in performing the search, you have designated which files are the originals.
Thanks for the comments.

So IMatch doesn't try to distinguish 'originals' from 'duplicates' on its own - the 'original' is simply the photo I started the search with. If that was explained in the Help pages, I just failed to see it.

QuoteAs Mario Described, the red line indicates that the file is in a Category that's configured with color coding (which the DUPLICATES Category is configured to do).

So in the scenario above, File B would have a red line, and File A would not.  This is not really directly related to any indication of which is the duplicate.  It simply indicates that the file has been added to a Category that does color coding (which File B has been).  As described in the Help Topic that Mario linked-to, you can set up your own categories to do color coding, and they would contribute their color to the thumbnail's color bar (just as the DUPLICATES category is doing).
All I use or need is the DUPLICATES category.

As I said, when I look at a group of duplicates, sometimes more than one are red-lined. Sometimes more than one are not red-lined. Sometimes the 'original' that I selected is red-lined and sometimes it is not.

I understand why I can be shown more than one file that is red-lined because those are considered duplicates, of which there can be many. But how can I be shown more than one file that is not red-lined? Shouldn't there only be one non-red-lined file in the group - the one that was indexed first? In an example I'm looking at right now, there are two non-red-lined files shown below the red-lined file that I selected for the search. They can't both be originals - but they are two different file types (JPEG and PSD) that were indexed in the same session. Does that explain it?

Mario

"Originals" are the image you select for any match, including duplicates, visually similar images, sketch match, GPS search etc.
See Finding Duplicate Files for explanation of originals in this context.

QuoteBut how can I be shown more than one file that is not red-lined?
Show where?

There are no "originals" in dupe search. if IMatch considers a file a dupe when ingesting it, it will be added to the category. What you consider the "original" is up to you.

If you don't want IMatch to indicate files it considers duplicates, change the settings under Edit > Preferences > Indexing. Disable the dupe search, switch it to 100% copies only, whatever works best for you.

If you find the red category color bar confusing, just disable the color-coding for the Duplicates category.

Files should be normally not long in Duplicates anyway.

IMatch reports some dupes after ingesting new files, you deal with them, e.g. selecting them and run a Search > for Duplicates/Copies and then decide which of the files (the original you selected for the search or the matches found) to keep. Then the files can be removed from the Duplicates category, making it clean for the next import.

sybersitizen

Quote from: Mario on March 23, 2025, 08:52:02 PM"Originals" are the image you select for any match, including duplicates, visually similar images, sketch match, GPS search etc.
See Finding Duplicate Files for explanation of originals in this context.

Indeed, I had repeatedly missed a key sentence on that page, despite having visited it numerous times. My fault, no one else's.

Quote
QuoteBut how can I be shown more than one file that is not red-lined?
Show where?

In the Result Window for Duplicates when I perform a search for duplicates. That's something I am still not understanding.

I like having IMatch help me with duplicates.

I like the red color coding for duplicates.

IMatch has identified a rather large number of duplicates because I originally had it ingest a collection of about 80,000 images that were a mix of things collected by three family members over decades. Many are binary duplicates and many of them are only slight variations. This will not be an issue going forward with new images.

I am not complaining about anything. I simply want to understand what exactly I'm being shown - and why - when I use the Result Window for Duplicates to consider cleaning up the duplicates. It is becoming a bit clearer.

Mario

When you use the visually similar search, many similar images may show in the matches which are not actually duplicates. Which options do you use for dupe search in the Indexing options?

sybersitizen

#8
Quote from: Mario on March 23, 2025, 10:50:00 PMWhen you use the visually similar search, many similar images may show in the matches which are not actually duplicates.

Yes, though I'm not searching that way. I'm searching for Duplicates.

QuoteWhich options do you use for dupe search in the Indexing options?

Assign copies (visually identical, same format)

I don't know if this is the default or if I chose it at some point. I don't recall changing it.

I have actually only found ONE example so far where I'm seeing more than one non-red-lined file in the Result Window for Duplicates, and in this case one of them is a PSD file and one is a JPEG. A third one that will always be red-lined is also a JPEG. All three are legitimate duplicates of the same scene (with slight differences in editing), but if I understand the documentation correctly, I shouldn't be seeing the PSD because it would not be considered.

Is there any condition where I should expect to see more than one non-red-lined file?

I'm going to do more searches to see if I find more examples like this. If it's just this one unexpected result, I won't have to bother you anymore!  8)

PandDLong

Quote from: sybersitizen on March 24, 2025, 12:37:17 AMAssign copies (visually identical, same format)

I don't know if this is the default or if I chose it at some point. I don't recall changing it.

I have actually only found ONE example so far where I'm seeing more than one non-red-lined file in the Result Window for Duplicates, and in this case one of them is a PSD file and one is a JPEG. A third one that will always be red-lined is also a JPEG. All three are legitimate duplicates of the same scene (with slight differences in editing), but if I understand the documentation correctly, I shouldn't be seeing the PSD because it would not be considered.

Is there any condition where I should expect to see more than one non-red-lined file?

The 'Assign copies (visually identical, same format)' is the logic applied when the file is first indexed into IMatch.  That is why a PSD and a JPEG will not be considered duplicates and nether will be assigned to the Duplicates category (hence no red-line) - they have different formats.

Your third file is a JPEG and is red-lined because at the time of indexing there was a visually identical JPEG already in the database.

When you perform a Search for duplicates from within IMatch, it is not using the indexing rules and that is why it found your two files that were not red-lined (not assigned to the Duplicates category).


I hope that is helpful.

Michael

sybersitizen

Quote from: PandDLong on March 24, 2025, 01:24:25 AMWhen you perform a Search for duplicates from within IMatch, it is not using the indexing rules and that is why it found your two files that were not red-lined (not assigned to the Duplicates category).

Thank you.

So it will find and display what it considers to be 'matching' images even if they are not in the 'Duplicates' category. I never realized that. And so far, I am only seeing that happen when the 'matches' involve more than one file type, though apparently it could happen even if all 'matches' are the same file type.

If I'm stating that correctly, my questions have been answered.

Mario


QuoteSo it will find and display what it considers to be 'matching' images even if they are not in the 'Duplicates' category.
Yes. The dupe searches take the selected files (originals) and search the entire database, as explained in Finding Duplicate Files
The Search > Duplicates does not have any options. It finds all files with image data matching the original.
The Search > Copies does not have any options. If finds all binary copies.
The Search > Visually Similar has many options for you to control.

The ability to let IMatch search for duplicates / duplicates with the same format / copies during indexing and assigning matches found to the DUPLICATES category is meant as a convenience feature.
If IMatch reports dupes after adding new files, you go to this category to see them and deal with them in whatever way you want. For example, select them and run Search > Duplicates to find all duplicates for these files. Then decide which one to keep.
Then you unassign the files (if any are remaining) from the category to make clear ship for the next import.

Note: The fact that a file is not in DUPLICATES does not mean that it has no duplicates!
The user can delete DUPLICATES at any time, or unassign all the files.

If you let indexing search for copies, there might still be visual duplicates in the database that have the same image data, but different metadata.

If you let indexing search for duplicates, but only with the same format, may miss visually duplicate files in other file formats. Search > Duplicates does not limit by extension.

The reason for adding the "same file format" variant for dupe search during indexing is that many users actually have visually duplicate images - images they produced in their RAW processor or image editor from an original RAW file and saved a a DNG, PSD, TIFF or whatever. These users would get lots of "false" positives when they use the "visually duplicates" in the Indexing options. May not be an issue for you, will be an issue for many others.

The Search > Copies option only searches for actual copies, 100% binary identical files. The same option is also available for the dupe search in indexing. What to choose depends on your requirements. You can also turn it off and delete the DUPLICATES category and run a dupe search manually when needed.

All details here:

Searching for Duplicate Files while Indexing

and here:

Finding Duplicate Files

and here:

Cleaning Up Duplicate Files Quickly

sybersitizen

Okay, one more thing. As discussed here ...

https://www.photools.com/community/index.php/topic,14757.0.html

... it appears that it's not possible to remove files from the Duplicates category when using the Result Window for Duplicates. To do so, I instead switch to the Categories view, locate those files in the Duplicates category, and remove them from there. Is that the only way?

Mario

To remove files from any category

a) switch to the category view, select the files and press <U>
b) use the Categories Panel
c) use a Favorite

work in any File Window, including Result Windows.

Un-assigning (Removing) Files from a Category
Un-assigning with the Category Panel
Category Favorites

sybersitizen

#14
I'm probably using the wrong terminology. This is a typical process:

Observe a marked Duplicate (red-lined file) using any panel.
Select it and Search > Duplicates.
Look at the results in the Results Window for Duplicates in order to most easily compare the selected file to any duplicates that were found.

If I want to remove any file in that window from the Duplicates category, I find that I must then switch to the Categories panel.

If I began my search from within the Duplicates category, the reference ('Original') file is still selected, but no others are. That's okay if I only want to remove that one file from the Duplicates category.

If I began my search in some other panel (which is often the case), then none of the files that I saw in the Results Window for Duplicates are selected, so I have to search for them a second time in the Duplicates category before I can remove them.

I might be missing a much easier method.

Mario

As I wrote, use one of the three standard methods.
The Result Window has no idea which files you have selected in other File Windows. You can have selected thousands of files in File Windows at any moment in time.

Tveloso

Quote from: sybersitizen on March 24, 2025, 08:29:26 PMI'm probably using the wrong terminology.
I think you may be right here.

It appears that when you say Categories panel here:
Quote from: sybersitizen on March 24, 2025, 08:29:26 PMLook at the results in the Results Window for Duplicates in order to most easily compare the selected file to any duplicates that were found.

If I want to remove any file in that window from the Duplicates category, I find that I must then switch to the Categories panel.

...you're actually referring to the Categories View:

    Screenshot 2025-03-25 092346.png

There's no need to switch back to the Categories View in order to unassign (from the Duplicates - or any other - Category) a file you're currently seeing in a Result Window.  Doing it that way is the first option Mario presented in his post above:

Un-assigning (Removing) Files from a Category

The second option he mentioned:

Un-assigning with the Category Panel

...allows you to stay in that Result Window, and use the Categories Panel to unassign the files:

    Screenshot 2025-03-25 092421.png

If you switch to that Panel's Current Tab, you will see only the Categories the selected Files are actually currently in (optionally including Data Driven Categories, and @Keywords), and you can unassign from there.

And using a Favorite (the third option) would make it even faster.

But again, unassigning a file from the Duplicates Category has no bearing on whether or not the file is a duplicate.  That's just an ordinary category at that point (IMatch automatically assigned the file to it at indexing time, per the "duplicate detection method" you selected, but once assigned there, it's just another category assignment - which could have been done manually).

Usually in a Duplicates Search Result Window, you would actually be (selectively) deleting files (and in order to delete "the originals", you will need to include them in the results, as Michael explained).

Or, you might actually want to keep a newly added duplicate, and instead delete the result file(s) that were already in the database (perhaps using the Metadata Panel to "transfer some data" to the newly added dupe)...

There are also techniques for doing a second Duplicates Search, from within the Result Window of the first (after having applied Filters to consider only certain files, based on the Folders they are in, or Categories they are assigned, and ensuring that the "the originals" of the first search are not included with the results), to do an initial "mass delete" of the files that started out in the Duplicates Category, based on some criteria the filter has provided...
--Tony

sybersitizen

Quote from: Tveloso on Today at 03:04:46 PMUn-assigning with the Category Panel

...allows you to stay in that Result Window, and use the Categories Panel to unassign the files:

    Screenshot 2025-03-25 092421.png

If you switch to that Panel's Current Tab, you will see only the Categories the selected Files are actually currently in (optionally including Data Driven Categories, and @Keywords), and you can unassign from there.

That's the easier way I was hoping for, and works perfectly.

Thanks very much.

QuoteBut again, unassigning a file from the Duplicates Category has no bearing on whether or not the file is a duplicate.

Understood. As I said, my Duplicates category is populated with a large number of files that were identified as such during the mass indexing of tens of thousands of files. That was fine and helpful, but I now have two separate goals: To gradually reduce the number of files in the duplicates category that are not really duplicates, and to gradually delete any files that are true duplicates. Often I just encounter the red-lined files while doing other unrelated things in IMatch, so I want a quick way to deal with them in one way or the other as they turn up.

The main problem was that I didn't realize until yesterday that in IMatch the term 'duplicate' has two different meanings and the word 'original' has two different meanings. Now I get it.