Why would deleting unused (sample) thesaurus elements be slooooow?

Started by MrPete, December 21, 2024, 08:21:18 PM

Previous topic - Next topic

MrPete

My sweetie is coming up to speed on iMatch.

Her first serious attempt has been quite frustrating, in what I would have thought is a simple task:

* Open Thesaurus Manager (for the first time ever)
* See a lot of elements she doesn't need
* Delete them and save (several top-level items), wanting to do her own structure.

What we observed: this took almost 15 minutes!
* In the log, it shows many ~10 second keyword deletes:
QuotePTThesaurusDatabaseUpdater::DeleteKeyword  'V:\develop\IMatch5\src\IMEngine\PTThesaurusDatabaseUpdater.cpp(414)'
12.21 10:56:50+ 9375 [6A54] 05  M>  <  0 [9375ms #sl]

Context
* Reasonably fast system (i7-8700 (6 core, 12 thread), 48GB RAM, database on 3GB/sec M.2
* Before deleting the elements, we checked several and verified none were in use (which makes sense: she's done nothing to use any existing Thesaurus elements to date.)
* She DOES have some existing keywords in imported photos
* The photo database IS reasonably big: 320k files
* There are a huge number of files pending metadata writeback (310k -- basically the entire imported collection)

I assume there must be something else about open panels or ?? that's causing this to be slow.

Potentially Related

* After waiting through the above, she added a hierarchical keyword to her 10k+ insect photos and saved
* At this point, iMatch is not doing anything at all, and nothing being logged.

YET: depending on which folder is selected in Media & Folders, the keyword pane shows "Updating..." in the center!

* I've tried opening sub-folders and seen that there is NO "Updating" for any of those, yet the parent folder still says "updating". The system is completely idle - yet still "Updating"

Anything I should look for in the log? I've turned on debug logging if that might help.

THANKS!


Mario

Have you enabled the option to apply thesaurus changes to the database?
In that case, IMatch would have to check each deleted keyword in each file in the database to see if it has to be removed.

Because, I've just checked and removing a hierarchy of about 3,000 keywords from the thesaurus with that option off takes maybe one second? It's an all in-memory operation, super-fast. Saving a huge thesaurus takes maybe 3 seconds.
Unless you tell it to apply changes to the database, which takes a lot more work.

See Updating Keywords in the Database from Thesaurus Changes


Quote* After waiting through the above, she added a hierarchical keyword to her 10k+ insect photos and saved
Added the keyword where? Selecting 10K files and then adding the keyword in the Keywords Panel?
Adding keywords in the thesaurus does not impact your files and takes 0.1 seconds.

Quote* There are a huge number of files pending metadata writeback (310k -- basically the entire imported collection)
Normal. Point the mouse cursor at the pen to see the first 10 tags to write.

See Metadata for Beginners for background info and why the rich and complete metadata record IMatch produces when importing files requires a write-back, almost always.

MrPete

Wait a sec.

SO:
* If I look at a thesaurus keyword and search for any photos w/ that keyword, it "instantaneously" knows there are none.
* Yet, if I delete the keyword (or 100 of them), it can't do the same thing to verify whether / which files need updating?

Something seems fishy about that.

What about the ongoing "Updating..." indicator, that seems unrelated to any current or pending activity?


Mario

Do you have the option to apply changes done to the thesaurus to the database enabled?

Quotey photos w/ that keyword, it "instantaneously" knows there are none.
Searching the entire database for a keyword should be fast ;)

I've just made some tests with a 920,000 files database (database on SSD).

Deleting a keyword (3 levels deep) from the thesaurus without applying changes to the database is instant.
Closing the Thesaurus dialog (which saves the thesaurus) takes maybe a second (big thesaurus based on the default IMatch thesaurus + maybe 2,000 extra keywords).

Deleting a 3-level deep keyword from the thesaurus with enabled "apply to database" option and then selecting the "Apply to entire database" in the prompt takes 44 seconds. Which is OK for a database with almost one million files. It's not a super-frequent operation, not many users have databases with one million managed assets etc.

I've also tested with a 100,000 files database and applying the changes to the entire database takes 2.5 seconds.

For your database, I would expect maybe 15 or 20 seconds?

Do this:

- Switch IMatch to debug logging: Help menu > Support
- Repeat the deletion of a keyword from the thesaurus
- Afterwards, Help menu > Support > Copy Log file...
- ZIP the created file and attach.

I can then see what takes how long and maybe provide advice..

Always make sure that your virus checker has an exclusion for the folder (!) containing the database. If an on-access virus checker gets bonkers and scans the IMatch database on every write access, performance will be ruined.

MrPete

QuoteDeleting a keyword (3 levels deep) from the thesaurus without applying changes to the database is instant.
Closing the Thesaurus dialog (which saves the thesaurus) takes maybe a second (big thesaurus based on the default IMatch thesaurus + maybe 2,000 extra keywords).

Deleting a 3-level deep keyword from the thesaurus with enabled "apply to database" option and then selecting the "Apply to entire database" in the prompt takes 44 seconds. Which is OK for a database with almost one million files. It's not a super-frequent operation, not many users have databases with one million managed assets etc.

So... wouldn't it be much faster to modify this a bit? To delete a keyword:
  • Test if the keyword is in the database (very quick)
  • If in database: apply deletion to database. (In fact, since we know which files have the keyword, we don't need to apply to the entire database.)
  • If NOT in database: delete w/o touching DB.

The extra time for the test will be negligible. And it will save 44 seconds per keyword in a million file database. 10 seconds per keyword in our 300k file database.

:)

Mario

You also have to consider features like keyword links, keyword group levels, keyword exclusion levels and suchlike.Aka, the keyword assigned to a file (searchable) is not necessarily the keyword as it appears in the thesaurus.
This process may not be as easy as you think it is. I remember having a real hard time to deal with all the special and edge cases when implementing this a year or two ago.

If you only want to delete a keyword and your use-case is simple:

- Select the corresponding @Keywords category in the Category View
- Select all files with <Ctrl>+<A>
- Press <U> to un-assign them from the keyword.

You can also select multiple categories (same level) and to this to remove multiple keywords at once.

Or, just select files in a File Window and then <Ctrl>+click the keyword in the Keywords Panel you want to remove from all selected files.

The ability in the thesaurus is more aimed at complex scenarios, moving branches, changing links or level attributes etc.

MrPete

Mario, is the following statement reasonably accurate?

(Probably goes in the help file thes_basics.htm under "Updating Keywords in the Database from Thesaurus Changes"?)

QuoteIf you are deleting keywords or hierarchy from the original iMatch Default Thesaurus or other elements that you are confident have never been put to use, you can safely choose "Don't Apply Changes" as the changes you are making have no impact on existing files.


MrPete

Quote from: Mario on December 22, 2024, 04:23:42 PMIf you only want to delete a keyword and your use-case is simple...
You've provided instructions for the "simple" use-case where a keyword is assigned to one or more files.

Yet even then, the instructions don't mention actually removing the keyword from the thesaurus.

My use-case is even simpler: removing keyword(s) from thesaurus that are not and have never been used, such as the default thesaurus. That's why my suggested verbiage above in prior comment. ;)

Mario

I don't follow. Maybe a native speaker thing.

If you only want to remove keywords from the thesaurus, disable the option to apply your changes to the database again. It is off by default. This makes the operation perform in less than one second.

MrPete

Quote from: Mario on December 22, 2024, 11:19:48 PMIf you only want to remove keywords from the thesaurus, disable the option to apply your changes to the database again.

I'm just suggesting that this is an important thing to highlight for those who are new to the software, and want to remove part of all of the built-in default thesaurus items.

To a newbie, these are subtle aspects. Up front, I would not have considered the importance of knowing whether thesaurus items are in use, when deleting them. SO much of the software has excellent performance, it comes as a bit of a shock when I run into something that's quite slow!  ;D

Mario

The option to apply changes to the database is off by default. This means that deleting keywords, even entire hierarchies, has sub-second performance.

When you enable the option to apply your changes to the database, you still get a prompt and can decide to apply them to the entire database, the current scope or to skip it entirely.

If a user, as in your case, let's IMatch apply the changes to the database and the user has already 300K files in the database, the wait time will be, say, 10 to 15 seconds, with a progress bar and estimate. I can live with that.

If there are reports similar to yours pile up, I will consider spending time working again on this to maybe make it faster.

graham1

Quote from: Mario on December 22, 2024, 10:32:50 AMAlways make sure that your virus checker has an exclusion for the folder (!) containing the database. If an on-access virus checker gets bonkers and scans the IMatch database on every write access, performance will be ruined.


It is a somewhat tedious process to exclude folders, processes and file types using the Windows Defender anti-virus interface.  I would recommend using the Sordum Defender Exclusion tool (https://www.sordum.org/10636/defender-exclusion-tool-v1-4/). It enables exclusions to be managed by a drag and drop interface, which also makes it easy to delete exclusions which are no longer needed.  It is completely free to use (unless you choose to make a donation).

Graham