database file size increased exceptionally

Started by miner1, September 07, 2020, 01:41:18 PM

Previous topic - Next topic

miner1

Hey together,
our database has more than 2.000.000 images and its size is 9 GB. Today I refreshed a folder and Imatch told me, that its going to add app. 1.000 new files. It took quite a while, but the number of images only increased from 5.901 to 6.157???
But, at the same time the overall file size of the databased increased to 20GB  :-*  (even after "diagnistic" and "optimisation").

In the attached file I see some errors, that some files couldn´t be read, but what happend to the database???
Looking foreward to get some help
Glückauf
Gero

Mario

The database in the log file has only 223,149 images, not 2 million.
The database has 8946,39 MB (~ 9GB).

ExifTool reports some problems with metadata. It reports than one file consists only of binary zeros. I don't recall seeing such an error before.
This is also the file IMatch fails to load. Probably a damaged TIF image. You can find the name by searching the log file for W>.

After adding said TIFF file (....ynan_1993_Madsus_C.tif), the database is closed and opened (actually, it is closed and re-opened several times in this log) and then the size is reported as 19781,02 MB. Probably the database system has allocated some extra room? Not sure.

Try to remove the above TIFF image from the database and do a Compact & Optimize again. Maybe some freak side-effect or something.

miner1

Hey Mario, thanks fpr the really quick response!
Oh yes, of course its only 200.000 thousand images, sorry, my fault..
I removed the three files with the warning, rescanned it and did the optimisation twice. The db size is still more than 20GB :-(.
Attached the new report, hope you have an idea.
Glückauf
Gero

JohnZeman

Just for a point of reference, my database has 126,000 files and the database size is 14Gb.

So your 20Gb database size seems about right to me.

sinus

Quote from: JohnZeman on September 08, 2020, 12:04:17 AM
Just for a point of reference, my database has 126,000 files and the database size is 14Gb.

So your 20Gb database size seems about right to me.

Depends on the images, seems also ok to me, why not?
I have 298'000 images and the Database is 17 GB.

And yes, John, references helps often, to know, if the own things (in IMatch) is a kind of normal.  :)
That is why comparisons on the web are so popular.
Best wishes from Switzerland! :-)
Markus

Mario

20 GB is not that much. Maybe the database system just relocated stuff to make it more performant.
If you have removed all the files added in that session and then did a compact & optimize run and the database did not shrink, it is as it is. The size of the database does not impact performance.

miner1

Hmm, the absolut size might be ok, dependig on the consisting images of course, but I am wondering why it increased that much after just adding 200 images...???

What I did today:
I went back to my backuped database (9.1GB), where I hadn´t rescaned that specific folder so far.
I rescanned an other folder where 1200!!! new images were added to (mostly nef and jpg).
After "compact & optimize" the new database has a size of 9.2GB - this is what I expected and what coincides with my experiences so far.
Attached the protocol from today

Thank very much so far!!!
Glückauf
Gero

sinus

Quote from: miner1 on September 08, 2020, 11:38:31 AM
Hmm, the absolut size might be ok, dependig on the consisting images of course, but I am wondering why it increased that much after just adding 200 images...???


Have you tried, what Mario told? "Delete" the corrupted file and try again?
Maybe it has to do with this file, who knows, I have never seen such a warning.
Best wishes from Switzerland! :-)
Markus

miner1

Yep
QuoteI removed the three files with the warning, rescanned it and did the optimisation twice. The db size is still more than 20GB :-(.

Mario

If you cannot reproduce the problem by adding the same files, I don't think its worth spending more time with it. Especially given that this is the first case ever, in all the IMatch years.

miner1

Oh yes I can reproduce it, maybe one misunderstood me due to my germanstyle-English...
- rescanning a folder where new images have been added: OK (db size grows a little bit)
- rescanning the specific folder (also after removing the files which have been marked with "W>"  in the log-file) db-size doubles
I found some Adobe Indesign files (*.indd) in a subfolder (also Word, Excel and PDF-files). Could this be a reason???

Thanks in advance
Glückauf
GERO

sinus

Quote from: miner1 on September 10, 2020, 07:33:14 AM
- rescanning the specific folder (also after removing the files which have been marked with "W>"  in the log-file) db-size doubles
I found some Adobe Indesign files (*.indd) in a subfolder (also Word, Excel and PDF-files). Could this be a reason???

I have a lot of indd, doc, docx, xls, pdf ... no problems.
It seems have to do with this specific folder.

If I were in your shoes, it would be easy. I would create a new database with only some images, and then add again 20 images from this folder and again ... until I would have maybe the files, what the problems creates.

Or do the same with the actual DB, simply add only a few images and look at the size, then the next images and so on.
If the problem is really from some files, then I would catch them on such a way.
Best wishes from Switzerland! :-)
Markus

miner1

Hey Markus,
thank you for your idea!
Unfortunately this "folder" has more than 300 subfolders, some up to 9 levels deep....
Not sure if this is going to work.

Glückauf
GERO

sinus

Hi Gero

This makes it a bit more complicated.
Maybe I would do the wrong thing, but because with a new db IMatch can in index folder after folder.

I would, I guess, divide this folders (the indexing, not change the real folders) into 3-6 pieces and look, if a folder or subfolder gives a problem.

But of course, this is a shot in the dark. But I think (though not the real problem, I do not know), there could be some files, what does the trouble.

Glückauf back.
Best wishes from Switzerland! :-)
Markus

Mario

A database grows in bigger steps, allocating room for a few hundred images at a time when the database capacity is exceeded. Maybe 10 MB or so per step.

The amount of data added to the database is basically the same, independent from the file format. IMatch stores the thumbnail, information about the file, the extracted metadata etc. This varies a bit per file, but only by a few hundred bytes. IMatch does not store a copy of the image in the database or anything.

Maybe this is caused by some badly corrupted metadata which causes ExifTool to pump literally gigabytes of data out (?, never experienced this before, though) and this is what grows the database so much...?

If this is really some freak side-effect (I don't recall having heard about such an effect, ever) of a specific file or files in this folder, the only way to narrow this down would indeed to add one folder at a time (temporarily disabling the Bearbeiten > Einstellungen > Indizierung: Unterverzeichnisse einschließen) and then looking at the database size in Windows Explorer after each folder. If the folder containing the problem files is identified, we should know more.

Unfortunately, without having these 300 folders here in my lab, there is nothing I can do from here. Maybe you can upload the files somewhere (DropBox, OneDrive, GoogleDrive, ...) and send a link to #support

Maybe you can first limit it to the one folder and upload only that.
Frankly, I have no idea where even to start looking.

miner1

Thanks again!
now I am going to do some homework - folder by folder...  :-\

Glückauf
GERO

miner1

Got It!!!

The Problem is caused when there are *.indd files. The db size raised from 9.2GB to 11.6GB with one ind-file.
Well I of course don´t now, if this only happens with my indd files...???

After deleting the file, rescanning that folder, doing the optimisation, the db size again is nearly the same as before adding that single *indd-file  ;).
I´ll send you these files via Wetransfer for testing.

It was not intended to have these indd-files in the database, I didn´t even knew this.
This all appeared, when I moved a folder (with some subfolders, and including the indd-files) within IMATCH. At some time IMATCH didn´t answer, I restarted it and everything seemed to be allright. But some of the files I moved had gone to "nirvana".
So I copied some files from a backup and rescanned the folder in IMATCH and it started to "pump up" the database.....
I am curious what you will find...  :)
Glückauf
GERO

Mario

I will look at the files over the next days.
I have many indd (InDesign, usually) in my test library and I have never experienced this.
I'll know more when I have looked at your files. My input queue is full as always, hence a couple of days...

miner1

;D Thanks for looking Mario ;D

For me the most importand thing is, that I could locate the files causing trouble and that it is not good sign when a DB grows  suddenly...

Glückauf
GERO

sinus

Best wishes from Switzerland! :-)
Markus

Mario

#20
As expected, the metadata in the files is the problem.

The "MediaManagement" XMP section contains about 800 "Ingredients" entries. And this is a structured tag which produces about 1 MB (!) of data per entry in the database.
IMatch manages such structured tags as XML in order to process them (the same format is used i.e. for face regions) and this is a rather chatty format.

The combination of hundreds of these "ingredients" and their size is what makes the database grow so much.
I don't really have an idea what these ingredients are. Adobe defines them as:

Array of References to resources that were incorporated, by inclusion or reference, into this resource.

in the XMP spec.

Either something went really bad in the production of these INDD files or there is some other problem with the data. I don't know.
ExifTool just extracts all the data dutifully and IMatch imports it and places it into the database.

What makes matters worse is that the XMPmm namespace (MultiMedia) is protected as an important namespace by IMatch and hence it cannot be excluded or modified in the The Tag Manager.

I have removed the protection for the XMP::xmpMM namespace for the next version of IMatch.
This allows you (and potential other users) to configure IMatch to ignore the "ingredients" tag (or remove the data if it has already been imported).
The next database diagnosis and a following optimize will reduce the database size again.

miner1

Hello erverybody and thank you very much for the support!
Here is what I found out:
The Indesign file contains a layout of hundreds of sketches and images of archaeological finds - for printing a publication. These objects are not in the file itself, they are referenced.
To give you an impression I will send you a pdf version of the file via wetransfer, my colleague gave me.

For us the indd-file is "as usual". The only difference is our workflow: We didn´t copy these hundreds and hundreds of files to specific folder outside IMATCH as we did before.
Formerly we allways found some mistakes in e.g. the labeling, the white balance etc. of these finds (files). These changes had to be copied back to the original files/folders later... in theory  :-\ . So, this time we wanted to do it straight foreward.

I am happy you discovered the reason for this  ;D. Not sure if I really understood it  completely ???.  I'm not sure if your approach is a final solution - nevertheless it is a solution!!!

In our database, there are so many folders and sub-, subfolders. I, and sure others too, didn´t check all the folders before scanning. Although tis easily can happen when refreshing ("aktualisieren") is hit, and something has changed in the depth of the folders.
Could it make sense to handle these files as "unknown" within IMATCH? There are existing so many file-formats IMATCH can display but not handling them...

Again thank you so much for the (fast) support  ;D!
Glückauf
GERO



Mario

#22
What would making handling these files as "unknown" do?
The problem is not the INDD format but the metadata you place in your files. The 700 references (in the sample images provided) to external documents produce approximately 1 MB of metadata each in the IMatch database => 700 MB of data added for each of your INDD files. This is just how this structured data is transferred, not a bug.
Having so many external references in the chatty Ingredients metadata format is rather uncommon.

I recommend to disable the XMP::xmpMM:Ingredients tag in the Tag Manager. IMatch then does not need to import the tons of ingredients data and the database will not grow.


Mario

Very good.

There are also some other large chunks of metadata in your files you may want to consider for exclusion.

Manifest Link Form, History Instance ID  Manifest Reference Instance ID, Manifest Reference Last URL etc. Each of these tags contains hundreds of copies of the same value, GUIDs or other data not useful for humans (unless you really need this in IMatch for sorting, searching, filtering, data-driven categeories etc.).

Look at one of your files in the ExifTool Commands Processor to see the data it contains (especially the XMP-xmpMM namespace) and exclude the tags you don't need in the Tag Manager.

miner1

Thank you very much Mario!
No, I don´t need these files for searching etc. These *.indd-files are in the db just because we wanted to keep the INDESIGN file near the location where all the containd references are.
Hopefully this was helpfull for other users too...  :o

Glückauf
Gero

Mario

I have disabled these structured tags by default now (affects only new installations).
These tags are not uncommon, but 700 of them in a single file and each delivered 1 MB of metadata is truly unusual. Else this would have come up earlier.