Using Globs

From unkrig.de
Revision as of 21:18, 1 May 2024 by Admin (talk | contribs) (Created page with "This article documents the "Globs" feature implemented in [http://commons.unkrig.de/commons-text/apidocs/index.html?de/unkrig/commons/text/pattern/package-summary.html commons.unkrig.de]. === Introduction === Surely you have used globs before (maybe even without knowing that they call them "globs"): *.txt dir/*.doc execut?r.txt Globs are a widely spread concept that is used throughout the UNIX, MICROSOFT and other "worlds". All implementations support at least th...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This article documents the "Globs" feature implemented in commons.unkrig.de.

Introduction

Surely you have used globs before (maybe even without knowing that they call them "globs"):

*.txt
dir/*.doc
execut?r.txt

Globs are a widely spread concept that is used throughout the UNIX, MICROSOFT and other "worlds".

All implementations support at least the following elements:

Construct Matches
? Any character except the file separator ("/" and/or "\")
* Zero or more characters except the file separator
x The character x

Some implementations add more features:

Construct Matches
[abc] Exactly one of the characters "a", "b" and "c"
[^abc]
or
[!abc]
Any character except "a", "b" or "c"
[A-Za-z] Any (latin) letter
** Zero or more characters (including the file separator)
\X "X", even if "X" is a meta character

Since version 1.7 JAVA provides its own implementation of "glob matching", which adds another (quite uncommon) construct:

Construct Matches
{alpha,beta,gamma} Any of "alpha", "beta" or "gamma"

All in all, globs are very practical, though not very powerful. This is why de.unkrig.commons provides yet another implementation, as follows:

Regular expressions features

First, de.unkrig.commons.text.pattern.Pattern2 combines the simplicity of globs with the full power of JAVA regular expressions. It does so by modifying a few characters in the glob before feeding it to java.util.text.pattern.Pattern.compile(), making that effectively a glob compiler:

Glob construct Regex construct Matches
? [^/\\!]* Any character except the file separator and "!"
* [^/\\!]* Zero or more characters except the file separator and "!"
** [^!]* Zero or more characters except "!"
*** .* Zero or more characters
. \. The dot is a literal (not a character class as in a regular expression)

If you have not yet worked with regular expressions: JAVA regular expressions are very powerful and introduce many constructs and meta characters. Find the complete reference documentation here.

Since "?" and "*" are not quantifiers as in regular expression ("?" == zero-or-one, "*" == zero-or-more), one has to use "{0,1}" and "{0,}" instead. The alternative notation for "." ("any character") is "[^]".

Find the reference documentation here.

The API of de.unkrig.commons.text.pattern.Pattern2 is totally compatible with that of java.util.pattern.Pattern, plus it adds a new compilation Flag Pattern2.WILDCARD that modifies the expression syntax from regex to globs.

Includes-Excludes

Second, it adds includes-excludes. This involves two new meta characters, "," and "~":

Construct Matches
pattern1,pattern2 Any string that matches pattern1 or pattern2
pattern1~pattern2 Any string that matches pattern1, but not pattern2
~pattern1 Any string that does not match pattern1
pattern1,pattern2~pattern3~pattern4,pattern5 Any string that matches pattern1 or pattern2, but not pattern3 nor pattern4; plus any string that matches pattern5

This may sound complicated, but the very simple rule is: The patterns are applied right-to-left, and the first match determines the result.

This comes with de.unkrig.commons.text.pattern.Glob.compile() and the new INCLUDES_EXCLUDES compilation flag.

Replacements

Third, it adds replacements. This involves one new meta character, "=":

Construct Matches Replaces with
*.c=$0.bak Any string ending with ".c" The original string plus ".bak" ("$0" represents the "entire match")
(*).(*)=$1.$2$2 Any string containing a dot The original string, with the file name extension doubled

This comes with the Glob.replace() API.

Notice that "$" is a meta character only in the replacement.

Also notice that when combining replacements with includes-excludes (see above), replacements are specific for each include (while it does not make any sense to use replacements with excludes, though).

Conclusion

All these features can be combined mercilessly, e.g.:

*=$0$0~*.bak
(*).docx{0,1}=$1.txt~*blabla*

Using it is simple:

import de.unkrig.commons.text.pattern.Glob;
import de.unkrig.commons.text.pattern.Pattern2;

Glob glob = Glob.compile("*.c=$0.C,*.h=$0.H", Pattern2.WILDCARD | Glob.INCLUDES_EXCLUDES | Glob.REPLACEMENT);
glob.match("foo.c");     // returns true
glob.replace("foo.h");   // returns "foo.H"
glob.replace("foo.cpp"); // returns null

Remember: The Pattern2.WILDCARD modifies the regex compilation to understand wildcard characters. Glob.INCLUDES_EXCLUDES activates the recognition of "," and "~". Finally, Glob.REPLACEMENT activates the recognition of "=".