RegEx Question

Started by Damit, March 13, 2023, 07:39:39 PM

Previous topic - Next topic

Damit

_[+|,|!|;|\-|\(|\)|\{|\}|\[|\]|[0-9|a-z ]*

I have been trying hard to get a grasp on regular expressions. I have been using the above regex and it works, but in trying to understand it, I don't understand how it is valid. Should there not be a ] closing the set that begins with +|,|!

I understand an underscore must start the phrase for there to be a match, but doesn't the open bracket following it signify that I am going to provide a set of values, any of which will cause a match, and the set must be closed with a closed bracket?
 
I see that I have provided that + or , or ! or ; or - or ( or ) or { or } or [ or ] or 0-9 or a-z or space will cause a match but I do not see where the open bracket before the + is closed. 

I would think it should look like this:
_[+|,|!|;|\-|\(|\)|\{|\}|\[|\]|[0-9|a-z ]]*

What is especially confusing is that using the regexp tool_[+|,|!|;|\-|\(|\)|\{|\}|\[|\]] also works, but now requires an open bracket in the file name.
What am I missing?



Mario

I can only recommend the many sources and tutorials about (PERL) regexp available on the internet.
My use of regexp is only basic, my file naming convention stringent and I have needed a regular expression as complex as the one you have created. I doubt many IMatch users have.
-- Mario
IMatch Developer
Forum Administrator
http://www.photools.com  -  Contact & Support - Follow me on 𝕏 - Like photools.com on Facebook

sinus

Quote from: Damit on March 13, 2023, 07:39:39 PM_[+|,|!|;|\-|\(|\)|\{|\}|\[|\]|[0-9|a-z ]*

I have been trying hard to get a grasp on regular expressions. I have been using the above regex and it works, but in trying to understand it, I don't understand how it is valid. ...

From my point of view:
If it works, all is fine.  :)

Of course, it is more satisfying to know why something is correct.
But sometimes, my experience, it is not worth to dig deeper into something, except it makes fun and you have a lot of time.

I do not know regex, but who knows, maybe with such complex creations there is also an error in the regex code.
How many times Microsoft or other companies must create new updates because of errors?

And worse, sometimes an error is not detected a long time, affecting maybe only very few persons and they do not know, why it works not and think, they do something wrong.
But at the end is is the basic source, what has an error.

I personally, with IMatch, I work quite a lot with variables. To be honest, sometimes they are very, very complex and I have troubles to understand it again, if I look at it a month later. But they work, and if they work, puh, then I save it and let my finger off the keyboard and I am happy for a moment.  ;D  

If your regex works, I would use it ... it is sometimes hard to understand everything, reminds me, that I do often not understand women. 8)
Best wishes from Switzerland! :-)
Markus

Tveloso

Quote from: Damit on March 13, 2023, 07:39:39 PM_[+|,|!|;|\-|\(|\)|\{|\}|\[|\]|[0-9|a-z ]*

I have been trying hard to get a grasp on regular expressions. I have been using the above regex and it works, but in trying to understand it, I don't understand how it is valid. Should there not be a ] closing the set that begins with +|,|!
It appears that you have confused one aspect of the Group (groups of characters in parentheses, such as (abc|xyz)), with the character list ("individual characters" in square brackets, such as [abcxyz]), with regard to the use of the pipe (|) as an or operator.

You don't need (and probably shouldn't use) the pipe in the list...the or is implied there.

So this regular expression:

    (abc|xyz)

...matches only the first part of this search string:

    "abcxyz"

While this one:

    (abc|xyz)+

...or this one:

    (abc|xyz)*

...will match the entire string:

    "abcxyz"

But with the list expression:

    [abcxyz]

...only the first character of the search string is matched:

    "abcxyz"

...unless you specify the zero or more, or one or more modifier:

    [abcxyz]+

...and then the entire string is matched:

    "abcxyz"

No pipes are needed there...each character in the search string is "checked against" every character in the list, with an implicit or between them.

And unlike the group structure, which will not match this string:

    "zyxcba"

...the list structure will still match it:

    "zyxcba"

Your expression:

    _[+|,|!|;|\-|\(|\)|\{|\}|\[|\]|[0-9|a-z ]*

...does appear to contain unbalanced square brackets, but I believe that many of the characters that must normally be escaped in other parts of an expression, need not be, inside the square brackets...so the apparent opening bracket to the two ranges, along with the escaped one, are probably both being treated as literals:

    _[+|,|!|;|\-|\(|\)|\{|\}|\[|\]|[0-9|a-z ]*

...and are redundant (as are all the pipes therein), so the brackets are probably balanced for that reason.
--Tony

Damit

Quote from: Tveloso on March 14, 2023, 04:02:35 AMYour expression:

    _[+|,|!|;|\-|\(|\)|\{|\}|\[|\]|[0-9|a-z ]*

...does appear to contain unbalanced square brackets, but I believe that many of the characters that must normally be escaped in other parts of an expression, need not be, inside the square brackets...so the apparent opening bracket to the two ranges, along with the escaped one, are probably both being treated as literals:

    _[+|,|!|;|\-|\(|\)|\{|\}|\[|\]|[0-9|a-z ]*

...and are redundant (as are all the pipes therein), so the brackets are probably balanced for that reason.
Thanks Tony! 
I appreciate you taking the time to try to help me understand this. You are right, I am confused!
I think I get it. Basically you are saying that, for some reason, regex is treating the escaped special characters as literals. I am curious as to why, as they should not, but I guess that is the rabbit hole Sinus is warning me about. It seems you are saying that I could have written it like:

_[+,!;\-\(\)\{\}\{\}[0-9|a-z ]]

I wonder if _[* ] would work just as well. Basically stating that the must be an underscore followed by any amount of any other character or spaces.

Regexp is a hard thing for me to grasp and when I begin to understand it, I usually stop using it for a month or two and end up losing any grasp I had.  It seems it is something you have to do repeatedly for a long period of time to have it fully sink in.

Quote from: sinus on March 13, 2023, 09:45:15 PMBut they work, and if they work, puh, then I save it and let my finger off the keyboard and I am happy for a moment.  ;D  

If your regex works, I would use it ... it is sometimes hard to understand everything, reminds me, that I do often not understand women. 8)
Wise words, my friend.  I have to resist the rabbit hole!
 

Tveloso

Quote from: Damit on March 14, 2023, 01:22:35 PMRegexp is a hard thing for me to grasp and when I begin to understand it, I usually stop using it for a month or two and end up losing any grasp I had. 
Same here.  When I have occasion to work with a new RegEx, and it's not working for me, when I finally realize why, it's often an "oh right, of course!...I used to know that".

I fear that by carrying on with this conversation, I'll be leading you down that rabbit hole Markus is warning about.  Please forgive me, but...

Quote from: Damit on March 14, 2023, 01:22:35 PMIt seems you are saying that I could have written it like:

_[+,!;\-\(\)\{\}\{\}[0-9|a-z ]]
Yes, and even though some (maybe all?) of the characters you have escaped there, need not be escaped, I would still do it...including the square brackets that, based upon your original expression seemed to function equally well as literals, whether they were escaped or not.  So I would write it something like this:

    _[0-9a-z ,!;\+\-\(\)\{\}\[\]]

The two ranges (0-9a-z) listed first...just to get the "standard characters" out of the way...(notice there is no Pipe between them).

Then a few literal special characters, followed by some escaped special characters (to ensure they're treated as literals as well).  Here, I "moved" the plus sign out of the unescaped group, over to the escaped group. 

Again, I'm not sure what actually needs to be escaped inside square brackets (but I'm pretty sure that most of what we're escaping there, need not be escaped).  Indeed, it's possible that the expression could actually be written like this:

    _[0-9a-z ,!;+-(){}[]]

But escaping those special characters makes it clear that we intend for them to be literals, and not something to be interpreted by the RegEx engine.

Quote from: Damit on March 14, 2023, 01:22:35 PMI wonder if _[* ] would work just as well. Basically stating that the must be an underscore followed by any amount of any other character or spaces.
That would actually be written as:

    _.+

That says an underscore followed by at least one (or any number of) any other character (including spaces).

But that expression will match a lot more than the first one, which lists explicit characters to be matched.  So if these expressions are meant to match FileNames, that short "catchall one":

    _.+\.(jpe*g|nef)

...would match a FileNames like this:

    _DSC_1234#1.NEF
    _DSC_1234#1.JPEG
    _DSC_1234#1.JPG


...but the first expression, with the explicit character list:

    _[0-9a-z ,!;\+\-\(\)\{\}\[\]]+\.(jpe*g|nef)

...will not match those fileNames (because the pound sign is not among the allowed characters listed in the expression).
--Tony

Damit

I am glad I am not alone in loosing grasp of this stuff.  Thank you so much, Tony, for your explanation. It cleared a lot of things up for me.

Now I really have to start thinking about a naming system because it is all willy nilly at this point and my spider sense (Mario has crept into my subconscious) is telling me that I really need to be more disciplined with it. I am worried if I do not, bad things may happen.

It would help to figure which versions have metadata interchanged between them.

Another rabbit hole, Markus! :o  I believe I read about your naming system somewhere. I am going to look that up!


Mario

A consistent and simple naming schema goes a long way.

Advanced RegExp trickery as you now have to rely on is not really that often needed. Usually the very flexible default rules IMatch uses for versions work very well and users don't need to touch regexp.
Refexp is only needed to solve tricky edge cases.

Of course different users want different things... :)

My naming scheme uses a prefix for client/project and a globally unique sequential number (private photos use the PRV client code). I manage several hundred thousand of my own files and each has a unique file name. Versions use a postfix to indicate their meaning and purpose. This makes them super-easy to find with a simple regexp. And also very fast.

Other users want to include date and time information in their file names.
Which was a thing 10 or 20 years ago, when no DAMs were around and Windows Explorer could not display and sort by date and time.

I have also seen users creating really complex file names with camera brand and model, date, even some keywords or a description of sorts.

Or even more complex naming schemes, encoding workflow info, software, versions and whatnot in the file name.

Many users have used different naming schemes over the years.
Or no scheme at all  ;D , just named files as they felt in the moment.

IMatch does not really care and supports whatever users come up with.
But, sometimes, dealing with all this will require to construct complex regular expressions. And this, unless you are an expert and do it every day, will require some trial and error.

This is why the version dialog has a Test button. Why the RegExp Tester App exists. And why the IMatch help links to prominent online resources for formulating and testing regular expressions.

In my opinion, it is usually better to sit down, give it a THINK, and come up with a simple file naming schema that works for you. And then use the Renamer in IMatch to rename files to match your schema. And then use it in the future for all new files.

I'm also aware that this cannot work for all users. Or would cause too much of a hassle or would cost too much time. It is how it is.

Refactoring is a big thing in software development. Basically it means to rework existing code when you have learned how to make it better or things change. I refactor code in IMatch all the time, adding tiny improvements or just making the code better to maintain. This keeps the code base nice and shiny. Good for me, good for users.

I apply the same principles to my image collection.
A better naming schema? Great. Do it.
A more consistent keywording schema and a better controlled vocabulary? Excellent. Do it.

I don't waste time changing things just for the reason of changing things, though.
-- Mario
IMatch Developer
Forum Administrator
http://www.photools.com  -  Contact & Support - Follow me on 𝕏 - Like photools.com on Facebook