Proper Syntax For Filename Search

Started by Darius1968, January 14, 2014, 01:48:55 AM

Previous topic - Next topic

Darius1968

Hi, I have files of which there may be in the filename the words straps, pumps, jacket OR straps, pumps, sweater.  In a prior topic I got a reply that in this version of IMatch, there's nor Boolean equivalent with which I can say find the filenames that have straps, pumps, and exclude just those that also have jacket.  My question is that given this scenario and the limitation of not having Boolean functions, what is the equivalent workaround, using either the search bar or the filter?   

Mario

IMatch can search file names via the File Name Filter. It also searches file names when you use the search bar in the File Window toolbar.

But IMatch does not split file names into individual words for filtering or searching, which would be required in order to allow you to perform Boolean searches on individual parts of your file names. It looks like you have used your file names to store keywords instead of using the standard keywords offered by IPTC and XMP metadata.

Is the file name the only place where you have stored this information? If you have it in keywords or categories you have much more features to work with.

sinus

Quote from: Darius1968 on January 14, 2014, 01:48:55 AM
Hi, I have files of which there may be in the filename the words straps, pumps, jacket OR straps, pumps, sweater.  In a prior topic I got a reply that in this version of IMatch, there's nor Boolean equivalent with which I can say find the filenames that have straps, pumps, and exclude just those that also have jacket.  My question is that given this scenario and the limitation of not having Boolean functions, what is the equivalent workaround, using either the search bar or the filter?

Darius, a workaround could also be, to use a script to split your filename and put the words into keywords. Such a script I could not do, but I am sure, it would be quite an easy script for our scripter.

It depends on the amount of files, how often you must search and how your filenames looks.
Best wishes from Switzerland! :-)
Markus

Darius1968

Mario, unfortunately the files acquired filenames to serve as keyword storage before I started using IMatch, and sinus, I've thought about the prospect of a script serving to as you put it, fill the file with the keywords in the metadata, but like you, I'm no good at scripting. 
Mario, is that what IMatch 3.6 did, split the filename into separate words?  What I don't get is why would IMatch have to actually split up into words.  can't it do a search to find one string, and then an search to find another? 

sinus

Quote from: Darius1968 on January 14, 2014, 12:03:50 PM
Mario, unfortunately the files acquired filenames to serve as keyword storage before I started using IMatch, and sinus, I've thought about the prospect of a script serving to as you put it, fill the file with the keywords in the metadata, but like you, I'm no good at scripting. 
Mario, is that what IMatch 3.6 did, split the filename into separate words?  What I don't get is why would IMatch have to actually split up into words.  can't it do a search to find one string, and then an search to find another?

Are your filename consistent (always the same system)? Maybe you could show us some filenames here?

Your last question I do not know. I GUESS, that building an index for quick searching has to build from word like

horse
apel
woman

and a word like

horse-and woman-in Cansas

would be difficult to building an index.

I guess, you mean, does not matter, how long a search runs, but you would be able to search simply for a fraction of a filename, independent how the filename is constructed. Well, I simply do not know.
Best wishes from Switzerland! :-)
Markus

BenAW

Quote from: Darius1968 on January 14, 2014, 01:48:55 AMMy question is that given this scenario and the limitation of not having Boolean functions, what is the equivalent workaround, using either the search bar or the filter?
I would try a multi-stage approach  8)
First set a filter on eg "straps" and Bookmark the result (or assign a pin, dot)
Next set a filter on eg "pumps" and use the Collection you assigned in the first step as source.
Again assign a pin or a dot.
You now have a collection of images with straps AND pumps in the filename to work with.
Another filter on this collection should give you the result you need.

Darius1968

Actually, I just found that if you do a search using the search bar in the file window, the default behavior of IMatch here is to require both or all words to be found.  The only thing is that using the search bar of the file window does a search on every metadata, not just the file name.  This is okay, however.  Am I wrong?  Is there a way to specify you want the search bar to just consider the file name? 

Mario

All features in the search bar are explained in the help. Click into the search bar and press <F1>. To much to reiterate here.
The help shows how you can specify groups to limit the search, and how to use OR and AND. If you questions not answered in the help, please just reply here and I'll do my best.

Note: The search bar is based on the IMatch search engine, which is based on the built-in search engine in the database system I use.

This allows it to search very fast, e.g. 100,000 files in about one second. But this also implies some limitations. The search engine creates a "word index" of the data I feed into it. For example, a metadata value like

"A quick brown fox"

results in three entries in the search engine: "quick", "brown" and "fox".
This allows the user to perform Boolean queries like "quick" AND "fox".

A file name like "strapspumpsjacket.jpg" results in only one index entry: "strapspumpsjacket". And since it's all one word, the search engine cannot perform Boolean queries on it. It cannot use AND or OR within the same word. It depends on whether the tokenizer in the search engine considers the separators you have used in the file names as word-breaks or not, e.g. "straps pumps jacket", "straps-pumps-jacket", "straps_pumps_jacket".

The tokenizer uses the following rules to break a value into terms (words):

A term is a contiguous sequence of eligible characters, where eligible characters are all alphanumeric characters and all characters with Unicode codepoint values greater than or equal to 128. All other characters are discarded when splitting a document into terms. Their only contribution is to separate adjacent terms.

All uppercase characters within the ASCII range (Unicode codepoints less than 128), are transformed to their lowercase equivalents as part of the tokenization process. Thus, full-text queries are case-insensitive when using the simple tokenizer.

Implementing a "better" and more intelligent tokenizer is possible but not easy and only required for certain fringe cases. This is on my long-term to do list, for a later revision.

I've made a few tests and when I use file names like the ones above, I can use queries like

straps AND jacket

and even

(straps AND pumps AND jacket) OR (straps AND pumps AND sweater)

in the file window search bar. Although this is stretching the purpose a bit.