Ignore diacritics (accents, tildes etc.) in category search and filter

Started by akirot, May 22, 2019, 08:01:13 AM

Previous topic - Next topic

akirot

I mainly use categories to organize my images. Also I use native/correct spelling in foreign languages.

The search and filter means (at the bottom of the category view and category panel) already are case insensitive - which is great and fast.

The feature request is to ignore diacritics (accents, tildes etc.) too.

Thus e.g. searching/filtering for
- "montana" finds "Montaña" (Spanish) and
- "iles" finds "Îles" (French)

This is similar to searching on the web.


sinus

Best wishes from Switzerland! :-)
Markus


Mario

OK, this is one of the feature requests which immediately attracted a number of +1. Which means it most likely would be something that is helpful for many.

Unfortunately, I have no real idea how to implement this at this time.
Many search functions in IMatch are based on built-in search functions in 3rd party toolkits or features provided by the Windows API (programming interfaces).
Some searches are performed by the database engine (for speed), which has no support for this kind of advanced search functions.

Then there are regular expressions, which would also need to be able to handle this? But that would either require pre-processing anything that's feed into the reg exp. Or the user just taking care for these special characters when defining the regexp?

And for implementing my own search functions, I would have to find a way to identify these "special" characters and somehow "transform/fold" them to their base form - which of course depends on the language the user is working in.

Just ignoring ñ would not help, because it would possible lead to wrong matches. The ñ would have to be transformed to n explicitly, in both the search term and everything that's feed into it. Which could be millions of text fragments (not for the category search of course, but for the File Window search, which has the same 'problem'.

In addition to diacritics used in Latin languages there are similar characters in Greek, Hebrew, Arabian etc.


Wikipedia lists a few dozen potential diacritics and related characters: https://en.wikipedia.org/wiki/Diacritic

A quick StackOverflow (programmer board) showed that there are about 50 or so characters which need to be mapped to their base form.
I think this is a problem that could be solved, at least for the most frequent cases.

But for which IMatch features would you need this most.
There are about two dozend or so "search" features in IMatch, including the File Window search and Filter panel...

And of course we would need a way to turn this on/off, depending on what the user wants for that particular feature in that particular situation.
Which requires changes to many UI features as well, plus documentation updates for all these features.

mastodon

+1 (Hungarian), although I can imagine that this is quite hard.

ubacher

I could imagine a button besides the field where the user inputs a search string which converts (the most common?) non-latin characters in the search string.

Would that not be a relatively simple fix?

Mario

But then the characters would still be in the category names, folder names etc.
The button would need to trigger a totally new search routine, which processes the search string and each category name before searching.

akirot

Yes, it's a challenge - and would be a real benefit.

A possible implementation could be:

- Add additional fields (not user visible) to the database which contain the respective search value. E.g. for category "Montaña" this field contains "montana" (or better "mntn", there are various approaches or algorithms). They must be (re)calculated when the user adds or changes a field (e.g. a category).

- As soon as the user searches for a specific value the entered value is transformed by the same algorithm used to provide the search value introduced above.
This transformed value then is internally used for the search on the additional fields.

- I can imagine a toggle at the UI beside the field where the search value is entered to switch between the current search and this abstract speed search.
(If you want to do some marketing this could be could "AI powered search" - just kidding.)

This approach does not need additional 3rd party toolkit specifics, it's controlled just by code you provide.

Mario

Thank you for your comments and suggestions.

QuoteThis approach does not need additional 3rd party toolkit specifics, it's controlled just by code you provide.

IMatch uses features to search the trees which are implemented in the 3rd party tree control.

QuoteAdd additional fields (not user visible) to the database which contain the respective search value. E.g. for category "Montaña" this field contains "montana" (or better "mntn",

Adding "extra" data for each folder / category and metadata tag that possible could contain diacritics to aid a later search would be a maintenance nightmare.
This can only work with on-the-fly mapping during the actual search.

Mario

@All


Please provide some sample words you use in your database for folders, categories etc. so I have some test cases.
By all means, implementing this will be a massive change and my roadmap is full. So don't put your hopes up too high.

akirot

@Mario:

instead of providing single example words: all Spanish and French diacritics (for my mainly German/English/Latin categories)

akirot

Slightly off topic:
Am I correct, we are not allowed to edit posts (what I could understand with all the spammers around)?
I just tried to correct a typo in my second post in this thread, a "could" should read "called" (apparently an autocorrection).

Mario

Quote from: akirot on May 24, 2019, 06:53:21 AM
Slightly off topic:
Am I correct, we are not allowed to edit posts (what I could understand with all the spammers around)?
I just tried to correct a typo in my second post in this thread, a "could" should read "called" (apparently an autocorrection).

You can edit your posts only for a certain time - this is an Anti-SPAM measure.

I think that users here don't care much about typos, spelling or working. As long as we understand what the other user means, we're good.
Just look at all the tpoes in my psts.

Mario

As anticipated/feared this is a RBCW (real big can of worms) to open...
I mean, just look at the corresponding section of the Unicode Standard

It's not sufficient to do simple replacements (like replacing the "ç" in "Les garçons" with "c".
Similar issues need to be solved when transforming text to upper-case or lower-case. In some languages this actually replaces characters.
Some languages require actual folding of strings, combining multiple characters into one. Which in turn may cause issues with algorithms used by IMatch which do a find on a string and then use the resulting position to perform a task. The position may be different for folded and non-folded strings and that has to be respected everywhere.

And the database system handles UNICODE, but does not deal with folding or diacritics. So if we want 'non diacritic-sensitive' searching, this also needs to be solved.

A truly complex issue.

When I had a bit of free time over the past weeks ;D I've read about all this and what kind of support Windows offers for all this. And, frankly, even Windows has a hard time to handle all this. But I've figured out some ways to do it and implemented some new classes I can use in my code where this may be useful. Not sure yet in which features I will be able to support this.

For some features both a "precise" and a "loose" match may be needed. And some UI features to enable/disable it.
For other features using a "loose" match may cause performance problems. For example, File Window search bar or all filters based on metadata. So we need to make this optional here as well.
This is a time-consuming process and there are lots of things to do for IMatch 2020...

Anyway, I did a quick test today, using the Category View (search/filter) as a testbed. Results look promising. If I activate the folding option and run a filter on the word garcons, IMatch finds categories with and without the ç:



or, for some Spanish words:




I think this is what the OP akirot had in mind?

akirot

Hi Mario,
yes, exactly this is what I have in mind.
Your approach looks very very promising!

Mario

IMatch 2020 ignores diacritics in the Category View / Panel and Media & Folder View search features.

A global option (Edit > Preferences > Application: Search Engine) controls if the File Window search bar and the Metadata Search in the Filter Panel ignore diacritics. This new option is enabled by default.

A new option to ignore (fold) diacritics has been added to data-driven categories. This allows you to fold words like "hôtel" and "hotel" into one category.