Using Globs

From unkrig.de
Jump to navigation Jump to search

This article documents the "Globs" feature implemented in commons.unkrig.de.

Introduction[edit]

Surely you have used globs before (maybe even without knowing that they call them "globs"):

*.txt
dir/*.doc
execut?r.txt

Globs are a widely spread concept that is used throughout the UNIX, MICROSOFT and other "worlds".

All implementations support at least the following elements:

Construct Matches
? Any character except the file separator ("/" and/or "\")
* Zero or more characters except the file separator
x The character x

Some implementations add more features:

Construct Matches
[abc] Exactly one of the characters "a", "b" and "c"
[^abc]
or
[!abc]
Any character except "a", "b" or "c"
[A-Za-z] Any (latin) letter
** Zero or more characters (including the file separator)
\X "X", even if "X" is a meta character

Since version 1.7 JAVA provides its own implementation of "glob matching", which adds another (quite uncommon) construct:

Construct Matches
{alpha,beta,gamma} Any of "alpha", "beta" or "gamma"

All in all, globs are very practical, though not very powerful. This is why de.unkrig.commons provides yet another implementation, as follows:

Regular expressions features[edit]

First, de.unkrig.commons.text.pattern.Pattern2 combines the simplicity of globs with the full power of JAVA regular expressions. It does so by modifying a few characters in the glob before feeding it to java.util.text.pattern.Pattern.compile(), making that effectively a glob compiler:

Glob construct Regex construct Matches
? [^/\\!]* Any character except the file separator and "!"
* [^/\\!]* Zero or more characters except the file separator and "!"
** [^!]* Zero or more characters except "!"
*** .* Zero or more characters
. \. The dot is a literal (not a character class as in a regular expression)

If you have not yet worked with regular expressions: JAVA regular expressions are very powerful and introduce many constructs and meta characters. Find the complete reference documentation here.

Since "?" and "*" are not quantifiers as in regular expression ("?" == zero-or-one, "*" == zero-or-more), one has to use "{0,1}" and "{0,}" instead. The alternative notation for "." ("any character") is "[^]".

Find the reference documentation here.

The API of de.unkrig.commons.text.pattern.Pattern2 is totally compatible with that of java.util.pattern.Pattern, plus it adds a new compilation Flag Pattern2.WILDCARD that modifies the expression syntax from regex to globs.

Includes-Excludes[edit]

Second, it adds includes-excludes. This involves two new meta characters, "," and "~":

Construct Matches
pattern1,pattern2 Any string that matches pattern1 or pattern2
pattern1~pattern2 Any string that matches pattern1, but not pattern2
~pattern1 Any string that does not match pattern1
pattern1,pattern2~pattern3~pattern4,pattern5 Any string that matches pattern1 or pattern2, but not pattern3 nor pattern4; plus any string that matches pattern5

This may sound complicated, but the very simple rule is: The patterns are applied right-to-left, and the first match determines the result.

This comes with de.unkrig.commons.text.pattern.Glob.compile() and the new INCLUDES_EXCLUDES compilation flag.

Replacements[edit]

Third, it adds replacements. This involves one new meta character, "=":

Construct Matches Replaces with
*.c=$0.bak Any string ending with ".c" The original string plus ".bak" ("$0" represents the "entire match")
(*).(*)=$1.$2$2 Any string containing a dot The original string, with the file name extension doubled

This comes with the Glob.replace() API.

Notice that "$" is a meta character only in the replacement.

Also notice that when combining replacements with includes-excludes (see above), replacements are specific for each include (while it does not make any sense to use replacements with excludes, though).

Conclusion[edit]

All these features can be combined mercilessly, e.g.:

*=$0$0~*.bak
(*).docx{0,1}=$1.txt~*blabla*

Using it is simple:

import de.unkrig.commons.text.pattern.Glob;
import de.unkrig.commons.text.pattern.Pattern2;

Glob glob = Glob.compile("*.c=$0.C,*.h=$0.H", Pattern2.WILDCARD | Glob.INCLUDES_EXCLUDES | Glob.REPLACEMENT);
glob.match("foo.c");     // returns true
glob.replace("foo.h");   // returns "foo.H"
glob.replace("foo.cpp"); // returns null

Remember: The Pattern2.WILDCARD modifies the regex compilation to understand wildcard characters. Glob.INCLUDES_EXCLUDES activates the recognition of "," and "~". Finally, Glob.REPLACEMENT activates the recognition of "=".