Indexing textlayers

Started by rienvanham, December 30, 2021, 07:15:22 PM

Previous topic - Next topic

rienvanham

Hi Mario,

As discussed hereby a feature request to add indexing of the textlayers in PDF (and, if possible, in DOC(x), XLS(x)-files. In my work I have a huge amount of files (> 100.000 PDF's) in which I have to search very often. Until for short I used Copernic Desktop Search but recently I switched to Everything 1.5 (which is still in alfa but works quit well).

It would be great if iMatch could also index these files but I'm aware that not many people would like to pay for this extra feature.

Thanks in advance!

Regards,

Rien.

Ger

+1

With the increasing amount of data files, indexing of file contents is becoming more important; mainly for office (powerpoint, excel, word) and PDF.

If IMatch offers this, I certainly will have my data indexed (in a separate database) for content. If this functionality is up to IMatch standards :P, I am willing to pay extra/separate.

ger

Mario

Just to shed some light on this, from a developers perspective...

What such a feature would require is:

1. Buy/implement code that can extract text from PDF and Office documents. I'm quite sure that this will prove to be challenging and/or expensive (for commercial use).
2. Implement a full-text indexing system / storage facility in the IMatch database. This is required to store the extracted text and make it searchable.
3. Deal with all the language-specific issues, like per-language stopword and stemming lists (which words to ignore when indexing, how to fold singulars/plurals into common forms) etc.
4. Deal with things I have not thought about yet.

I would guesstimate that this means weeks or, more likely, months of development, tuning, testing, documenting, help writing etc. Expensive.
I have no general problem with this, if it benefits a large share of the existing IMatch user base. Or generates many new users for IMatch.

Ger

Fully agree with that. It's certainly not "just another app". Actually - I think it's much more a different application than an add-on to IMatch.

But organizing data is important. We all have thousands of files scattered in many directories and drives. I have an enormous archive with 10+ years of emails, presentations, spreadsheets and other documents. Finding something you know you did some years ago is not easy.

When speaking about "up to IMatch standards", I think about an application that is versatile, flexible and powerful. Not just a plain search for individual words in these files, but apply intelligent mechanisms. Searching in context, related expressions etc.


Mario

There is specialized text retrieval / search software available for this purpose.
Which can search local Office/Outlook and SharePoint and Office 365 in the cloud. PDF files and many other file formats, too.
Many companies with many employees make a living from developing and supporting such applications.

Windows indexing services also does a decent job when you enable the file formats you want to search in the indexing options.

sinus

Rien,
you wrote in another post "I'm not talking about OCR-ing, just pulling out the textlayer and index this...."
What exactly is "textlayer"? I really do not know, but it seems to be another thing then the whole text, I guess, only a fraction on it.
Best wishes from Switzerland! :-)
Markus

digedag

-1

IMatch serves me exclusively for the management of image files for photos and graphics (mostly in pixel- rarely in vector format).
The support of the very (!) many other formats are for me a nice-to-have, but rather not necessary.

As Mario already wrote, there is specialized software for everything. I prefer to have an app for a specific purpose, but one that does it REALLY WELL.
A universal app can possibly do a lot, but then usually only mediocre.

For managing (and partly editing) documents like pdf, doc(x), xls(x), ppt(x) etc. I use PaperPort (https://www.kofax.com/products/paperport).
I don't need to have all these multiple functions in IMatch.

However, as I have come to know Mario over the years, he is one who likes to do it perfectly though ....
That would drive effort (and price) for things that might (?) be used by only a few.

Just my personal opinion.

Jingo

I too only use Imatch for photo management and tend to be a firm believer in choosing specific software designed for specific tasks like music management apps (Mediamonkey) and document tracking (OpenKM).  I also use XYPlorer as a windows explorer replacement and it too offers some tagging and preview abilities for files.  Just my 2 cents as well!

Stefanjan

I also only use imatch for photos. Did initially add other folders to imatch but that did not work for me.

My computer folders, email client,  onenote etc and sync between devices are organised enough so I can usually easily find what I'm looking for.

sinus

I use IMatch for all my files.
Words, InDesign, Blender, Excel, txt, pdf and what else ... (ah, yes, coffee  ;D)

I do know IMatch quite good, hence I like to do all DAM with IMatch.
I have a good (for me) filename-specificatons, hence only with the name I can find a lot of pics.
If I do want use more text, I simply work with metadata or keywords, no problem to find all kind of files.

Further I can use IM also with the timeline, and so on.
Ahem, I do even manage my tax-files (.a20, a21 ...) for the income-taxes  :o ;D

IMatch manages all ...  :)
Best wishes from Switzerland! :-)
Markus

Mario

A 'trick' often used is to copy the abstract/intro/summary of PDF files and Office files into the XMP description.
This makes the relevant contents immediately searchable in IMatch.

If your PDF files have proper metadata, you can use the PDF tags IMatch manages directly.
If you have a so-so mix, you can use a Metadata Template to copy the PDF description into the XMP description, and fill the XMP description for other files by hand.
Same for PDF keywords.

A similar workflow is applicable for Office documents of course.

With a filter or a data-driven category you can quickly identify which PDF/Office documents have descriptions / titles / keywords in their native metadata.

Doing this may take a few seconds for each file when they come into the IMatch database, but once the metadata has been standardized in XMP, it will live and be useful forever.

sinus

Quote from: Mario on January 03, 2022, 05:16:36 PM
A 'trick' often used is to copy the abstract/intro/summary of PDF files and Office files into the XMP description.
...

Thanks, Mario

This is a very good trick!
I will check/try this  :D
Best wishes from Switzerland! :-)
Markus

Mario

Some PDF files have title, description and keywords in the native PDF data. Many don't. Depends on the source of the PDF.
In project work, I've found it worthwhile to standardize PDF files to always use XMP title, description, keywords. Easy to do with MD templates and a great plus when it comes to searching, organization using data-driven categories etc. There are also usable timestamps in PDF metadata which can be used to fill XMP create data and date subject created.

All very easy to handle with simple IMatch features like Metadata Templates, Variables, Favorites etc.

Aubrey

I primarily use IMatch for photos.

Given finite time for development, I would prefer that the time was allocated to enhancement of features for management of photos.

Aubrey

-1. !

lbo

Quote from: Mario on January 02, 2022, 11:19:53 AM
Windows indexing services also does a decent job when you enable the file formats you want to search in the indexing options.

Agreed. As I wrote before, Windows Search is (was?) one of the best and most underrated functions in Windows (i.e. better than Copernic). What a pity that Microsoft began to hide it in Windows 10 (reassigned Win+F shortcut) and crippled the UI so bad in 2019.

You might check how difficult it would be to use the Windows Search index in IMatch.

The question is still how many IMatch users want to search document text in IMatch. I don't, therefore no vote from me to invest work time.

Mario

QuoteYou might check how difficult it would be to use the Windows Search index in IMatch.

I know there is an API (programming interface) of sorts.
But how would that integrate into IMatch? The search index depends on the settings the user has made, and does not necessarily correlate with anything in IMatch (it may not index the same folders IMatch has indexed, it may find results in folders IMatch does not know about etc.) Figuring all that out (and assuming that Indexing service really does its job) will be quite a project. And adds a ton of external dependencies...

rienvanham

Hi Sinus,

Many PDF's consist of a picture where you look at. E.g. if you scan a document and safe it to PDF you have "only" a picture, not the text. You can't copy the text from this layer. But, if you "OCR" the picture (what means that the software tries to "read" the picture and finds out which characters are printed, it will become text. This text (in a separate layer) is (can be) put behind the imagelayer (depends on what you want).