Updating Index ... must it be automatic?

Mario · January 05, 2014, 08:17:01 PM

Quote from: DigPeter on January 05, 2014, 07:18:27 PM
QUESTION I have a list of some 4700 assigned categories (mostly botanical taxa). Is this likely to make processing very prolonged?

5,000 categories is already a large set, but IMatch handles that without difficulty. 20,000 or more categories is normal when users work with data-driven categories a lot. Or even 30,000 to 50,000 categories.

@Keyword categories are the slowest type of category because when you manipulate them, IMatch has to mirror all changes in the metadata cached for your files in the database. IMatch can assign 5000 files to a category in 0.2 seconds. But when you do that for a @Keyword category, IMatch also has to write the keyword into the metadata records of 5000 files. And this can take a few seconds.

Ferdinand · January 06, 2014, 10:31:02 AM

Quote from: DigPeter on December 31, 2013, 03:03:08 PM
I am most sad to have to agree with the above. Despite its superb facilities, IM5 is for me unworkable. I have recently created, for the first time (in build 1.30), a full database of some 35000 images. I am having the same problems as described in other posts. Almost permanent "background" metadata reading/writing and index updating, but in fact not entirely background" as IM5 is often completely unresponsive for minutes at a time. This is even when I have deselected all the automatic activities I can find. I do not think that is equipment related. I have a reasonbly powerful computer with 6GB memory and 500GB HDD. The database and files are in separate folders on the internal drive.

Quote from: DigPeter on January 01, 2014, 11:48:37 PM
Quote from: Ferdinand on January 01, 2014, 11:31:58 AM
When the 2014 lockout bug is fixed, post a screen grab of your metadata preferencs plus a typical file with pre-existing keywords.
The attached zip contains 2 images. They both have DC subject flat keywords and LR hierarchical keywords. The older file also has supplemental categories. I used your scrip to write the hierarchical keywords and to produce the thesaurus.

The zip also contains a screen shot of my metadata prefs.

Metadata 2 is set to default - should I set Alllow create IPTC/EXIF/GPS to 'Yes'?

In Background processing I set Background indexing and Writeback metadata to 'Off'

Thanks for your interest, again.

Peter

I've had the chance to look at your files. I've also played with the database converter.

In my experience so far, the converter doesn't create the search engine index - the first diagnostics run tells you this. As I understand it, if you don't create it yourself from Database > Database Tools > Rebuild Search Engine Index, then IMatch will do it for you in the background. I wonder if this is part of the issue that several people are seeing with IMatch being busy for no apparent reason, if it occurs after a conversion.

Now to your images. I looked mainly at one of them - Tangley30613.jpg. This has the following keyword metadata:

~~~~~~~~~~~~~~~~~~~~~~~~
IPTC Keywords: Christian building, Hants, NW Hants, Peter Photo, UK
XMP Subject: Christian building, Hants, NW Hants, UK
XMP Hierarchical Subject: Location|UK|Hants|NW Hants, Source|Peter Photo, Subject|Building|Christian building
~~~~~~~~~~~~~~~~~~~~~~~~

So this file already has hierarchical keywords.
You've chosen the option "Don't replace existing hierarchical keywords", so the flat keywords won't be read in as hierarchical ones - which is good.
You've chosen the option to "Write hierarchical keywords" and also "Write path elements".
This means that IMatch will want to write *all* the following flat keywords based on the hierarchical ones:

~~~~~~~~~~~~~~~~~~~~~~~~
Location, UK, Hants, NW Hants, Source, Peter Photo, Subject, Building, Christian building
~~~~~~~~~~~~~~~~~~~~~~~~

But some of them are already in the file. I assume that these are the only ones you want in the file. There is a way to get IMatch to understand this.
You've chosen the option to "Lookup keywords via the thesaurus". What you need to do is to mark those elements you don't want written as flat keywords as "Exclude", i.e.

~~~~~~~~~~~~~~~~~~~~~~~~
Location, UK, Source, Subject, Building
~~~~~~~~~~~~~~~~~~~~~~~~

If you do this, then IMatch will see that all the keywords that it needs to write are already there, and it won't want to write any more. I created the thesaurus and marked these nodes as Exclude and tested this and it worked - IMatch didn't want to write anything.

So it's possible that all the activity you're seeing is because IMatch is trying to write out what it thinks are missing keywords, because you haven't marked them as Exclude.

If you're using my script and you've created a thesaurus file, what you need to do before using a new DB, either for conversion or importing of images, is to import the thesaurus file and mark the Exclude nodes, i.e. before any images are added. You have to do this manually - the script can't do it for you like it can for Group (by marking them in [ ]).

There is one other think here that puzzles me. I can see that you've Excluded the first two levels for Location and Subject, but only one level for Source. I didn't think my script could do that?

Mario · January 06, 2014, 10:56:03 AM

The converter does a search engine index rebuild.
If one or more files are found off-line, the search engine index will not be complete (because there is no metadata for these files) and this is what can cause the diagnostics warning. Can this be the case?

Ferdinand · January 06, 2014, 01:46:31 PM

Quote from: Mario on January 06, 2014, 10:56:03 AM
The converter does a search engine index rebuild.
If one or more files are found off-line, the search engine index will not be complete (because there is no metadata for these files) and this is what can cause the diagnostics warning. Can this be the case?

Not my experience from several conversions, and all files were online, AFAIK.

Mario · January 06, 2014, 02:48:25 PM

The test compares the number of files in the database (minus the number of files currently in the processing queue) with the number of files for which there is data in the search engine index.

When you manually rebuild the search engine index and then run the diagnosis again, is the warning gone?
Rebuilding the search engine index is one of the last steps in the converter so the result should be the same...

DigPeter · January 06, 2014, 02:57:42 PM

Very many thanks Ferdinand for this comprehensive reply.

Quote from: Ferdinand on January 06, 2014, 10:31:02 AM
In my experience so far, the converter doesn't create the search engine index - the first diagnostics run tells you this. As I understand it, if you don't create it yourself from Database > Database Tools > Rebuild Search Engine Index, then IMatch will do it for you in the background. I wonder if this is part of the issue that several people are seeing with IMatch being busy for no apparent reason, if it occurs after a conversion.

See Mario's response.

QuoteNow to your images. I looked mainly at one of them - Tangley30613.jpg. This has the following keyword metadata:

~~~~~~~~~~~~~~~~~~~~~~~~
IPTC Keywords: Christian building, Hants, NW Hants, Peter Photo, UK
XMP Subject: Christian building, Hants, NW Hants, UK
XMP Hierarchical Subject: Location|UK|Hants|NW Hants, Source|Peter Photo, Subject|Building|Christian building
~~~~~~~~~~~~~~~~~~~~~~~~

So this file already has hierarchical keywords.
You've chosen the option "Don't replace existing hierarchical keywords", so the flat keywords won't be read in as hierarchical ones - which is good.
You've chosen the option to "Write hierarchical keywords" and also "Write path elements".
This means that IMatch will want to write *all* the following flat keywords based on the hierarchical ones:

~~~~~~~~~~~~~~~~~~~~~~~~
Location, UK, Hants, NW Hants, Source, Peter Photo, Subject, Building, Christian building
~~~~~~~~~~~~~~~~~~~~~~~~

But some of them are already in the file. I assume that these are the only ones you want in the file. There is a way to get IMatch to understand this.

This is only a small selection of my keywords of course. In general I have excluded the top level in Thesaurus (Location, Source, Subject etc) and certain 2nd level keywords (e.g. Building) - this also answers your next point

QuoteYou've chosen the option to "Lookup keywords via the thesaurus". What you need to do is to mark those elements you don't want written as flat keywords as "Exclude", i.e.

~~~~~~~~~~~~~~~~~~~~~~~~
Location, UK, Source, Subject, Building
~~~~~~~~~~~~~~~~~~~~~~~~

Done - see above.

QuoteIf you do this, then IMatch will see that all the keywords that it needs to write are already there, and it won't want to write any more. I created the thesaurus and marked these nodes as Exclude and tested this and it worked - IMatch didn't want to write anything.

So it's possible that all the activity you're seeing is because IMatch is trying to write out what it thinks are missing keywords, because you haven't marked them as Exclude.

But these levels are excluded (which I do manually in IM5), so the over-activity must be caused by something else. Perhaps Mario can enlighten.

QuoteThere is one other think here that puzzles me. I can see that you've Excluded the first two levels for Location and Subject, but only one level for Source. I didn't think my script could do that?

Only Subject has 2nd level exclusions, but not all of them. "Building" is redundant when the next level is "Bridge", "Secular building" for instance, but "Pets" stands on its own when accompanied by the name, so is not excluded.

Your comment has highlighted a small anomaly in the example of my image. I see that IPTC KWs do not include "Peter Photo", but DC\subject does. I assume the reason for this is that in IM3 I do not write this to IPTC KWs, but IM5 writes it DC\subject. But I would not think that this is significant.

Ferdinand · January 07, 2014, 12:02:10 AM

Quote from: Mario on January 06, 2014, 02:48:25 PM
When you manually rebuild the search engine index and then run the diagnosis again, is the warning gone?
Rebuilding the search engine index is one of the last steps in the converter so the result should be the same...

Yes, if I manually rebuild the search engine index, then at the next diagnosis (immediately after) the warning is gone. In each of the conversions that I've done, the diagnostic run has indicated a need to rebuild. I was surprised by this, because I thought your aim was to leave the DB in a ready to be used condition. Perhaps you need to see the conversion logs from such a run?

@Peter - sounds like you're on top of this Exclude thing. I still wonder if there wasn't some write-back that IMatch wanted to do. There weren't any yellow pencils on thumbnails indicating a need to write-back?

I have a hunch that quite a few people will find that IMatch will have a lot of write-back to do after a conversion. This will occur because of all the keywords they've written to IPTC, and how IMatch imports them with default settings. The point of my script was to prepare the keywords in advance to avoid this, and it sounds like you're on top of it.

Mario · January 07, 2014, 10:17:52 AM

Quote from: Ferdinand on January 07, 2014, 12:02:10 AM
Yes, if I manually rebuild the search engine index, then at the next diagnosis (immediately after) the warning is gone. In each of the conversions that I've done, the diagnostic run has indicated a need to rebuild. I was surprised by this, because I thought your aim was to leave the DB in a ready to be used condition. Perhaps you need to see the conversion logs from such a run?

No, found it. The diagnosis was right of course.
The call to the search engine rebuild routine in the Database Converter was disabled. A Left-over from a debugging session...

Ferdinand · January 07, 2014, 01:20:17 PM

So this might be the source of complaints about excess post-conversion activity for no apparent reason?

Mario · January 07, 2014, 02:49:08 PM

No, not really. IMatch does not run full index rebuilds automatically. It only updates the index after metadata of a file has changed.

Do you have a specific report in mind? The log file of that report may tells us what IMatch 5 is doing after opening the converted database.

DigPeter · January 07, 2014, 03:06:23 PM

Quote from: DigPeter on January 05, 2014, 04:58:28 PM
Quote from: Mario on January 04, 2014, 01:12:18 PM
QuoteWith 30000+ files, the system is inoperable, even after the ingest process has finished.

30,000 files is about the size of my smallest test databases. I work with that daily.
There are IMatch 5 with databases of 100,000 or more files. And apparently they can work with the system just fine.

Can you attach a log file from a session? It contains performance data which may tell me something.
Did you disable your virus checker for the folder holding the database?
Is your database on a external slow disk, or even a network drive?

@Mario
Virus checker is disabled for IM.
The database and files are in separate folders on a 500GB internal HDD. I have 6GB of memory in reasonaby powerful computer.

I have created a new database with some 32000 images. Almost all the image files had LR hierarchical subject KWs. I used your converting script to create the IM5 database from the IM3 version. This took about 3.5 hours. Two things needed attention:

- Some 30 files which did not have LR hierarchical subject KWs, had flat @keywords categories outside the hierarchical structure. I corrected this by creating hierarchical subject KWs and deleting the flat @keywords categories. While doing this there were frequent interruptions while metadata was being read and the index updated, causing periods of minutes when there was no response from IM. I closed the database and copied the logfile. This is 0401 in a 4MB zip file which I am sending to you by email.

- A full set of regular categories had been created, but the@keyword categories were incomplete. The reason for this is probably that after the conversion, there were still some 10000 files that needed metadata write back. Automatic Background processing is not selected in preferences. I set the write back in motion. This took about 3 hours. After this there was a lengthy period of metadata reading followed by index updating. During this time there were again frequent unresponsive periods, including a forced closing of IM. During the index updating, IM was almost continually unresponsive and unworkable. I closed IM when the progress message showed that there was still over 15 hours remaining. The two log files 0501_12 and 05_15 refer.

@Mario
Did you get the email containing the log files - sent at 1609 on 5 Jan?

Mario · January 07, 2014, 03:08:49 PM

I get between 40 and 50 emails per day. I have a backlog of several days...

DigPeter · January 07, 2014, 03:11:14 PM

Quote from: Mario on January 07, 2014, 03:08:49 PM
I get between 40 and 50 emails per day. I have a backlog of several days...

OK - all in good time. I will be off line from Thursday for a week.

Mario · January 07, 2014, 04:01:44 PM

These log files tell me that IMatch was writing back data to files.

The _12 file is only for two minutes. The last info I see is that IMatch was beginning writing back data to about 10,000 files.

The _15 file is for the same database. Also IMatch starts to write back data for about 10,000 files.
After about 20 seconds or so of collecting and preparing, IMatch starts to write data to each of the 10,000 files (12:24:34)
The time it needs to write to each file varies between 0.1 and about 1.5 seconds.
The last write-back is at 14:32:42, about 2 hours and 10 minutes later.
IMatch writes about 75 per minute on your machine, including propagation and MWG compliance.

After the write has been completed, IMatch needs to reload the metadata of all files in order to synchronize what's in the database with what's now in the file after ExifTool has completed (ExifTool may update a lot more tags than IMatch writes, e.g. timestamps, digest data, mapping between XMP and other formats etc.).

On your system and for your formats, ExifTool needs between 0.7 and 5 seconds to extract metadata for batches of 10 files. I see that the system load and the amount of tags varies big time. Your files have between 30 and 300 metadata tags each.

The last import is completed at 14:45:43, after about 13 minutes.

IMatch finds some new/updated files and imports them. Re-calculates all collections and a number of data-driven categories. 14:45:59

Then the fun part begins. IMatch starts to update the search engine index (disk-intensive operation).

Deleting the data for 100 files takes 0.6 seconds.
But inserting the data for 100 files requires 159 (!) seconds, more than 2 minutes.
The next batch of 50 files (IMatch adapts the batch size to computer performance) takes 0.07 seconds for the delete, but 27 seconds for the insert.
IMatch further reduces the size of the batches it uses for the search engine update to 25 files. 0.0 seconds to delete the old data, 27 seconds for the insert.
IMatch reduces the batch size down to 2 files (yuck!) but still the insert takes 27 seconds.

The idea behind reducing the batch size automatically is to keep the impact of the search engine index update minimal (don't block the UI). But on your system and for this database, even re-indexing the data for 2 files takes 27 seconds...

Very very strange. Inserting metadata for 50 files takes as long as inserting metadata for only 2 files.
And I assume the disk was busy all the time (if you can recall it)?

This looks like a bad case of fragmentation to me. I'm sure that an Database > Tools > Compact after adding 30,000 files and writing back 10,000 files would have sped up things considerably. Can you try that and post the results?

DigPeter · January 07, 2014, 04:13:20 PM

Quote from: Mario on January 07, 2014, 04:01:44 PM
This looks like a bad case of fragmentation to me. I'm sure that an Database > Tools > Compact after adding 30,000 files and writing back 10,000 files would have sped up things considerably. Can you try that and post the results?

Mario - thanks.
This is virtually a new database, so it must have got fragmented very quickly. I will do as you ask when I return.

Mario · January 07, 2014, 06:21:22 PM

Yeah. Adding 30,000 files in a row, writing back 10,000 files, re-import 10,000 files etc. may cause fragmentation, depending on the file system in use and other factors.

DigPeter · January 18, 2014, 06:53:09 PM

Quote from: Mario on January 07, 2014, 04:01:44 PM
This looks like a bad case of fragmentation to me. I'm sure that an Database > Tools > Compact after adding 30,000 files and writing back 10,000 files would have sped up things considerably. Can you try that and post the results?

@Mario
Database checked, optimised and compacted. Result of diagnostics are attached. Note that there were some errors. I am delighted to report that the database is now more manageable. Delays and lockouts are minimal. Perhaps the probems were teething trouble with a newly converted database.

[attachment deleted by admin]