Using Globs
This article documents the "Globs" feature implemented in commons.unkrig.de.
Introduction
Surely you have used globs before (maybe even without knowing that they call them "globs"):
*.txt dir/*.doc execut?r.txt
Globs are a widely spread concept that is used throughout the UNIX, MICROSOFT and other "worlds".
All implementations support at least the following elements:
Construct | Matches |
---|---|
? | Any character except the file separator ("/ " and/or "\ ")
|
* | Zero or more characters except the file separator |
x | The character x |
Some implementations add more features:
Construct | Matches |
---|---|
[abc] | Exactly one of the characters "a", "b" and "c" |
[^abc] or [!abc] |
Any character except "a", "b" or "c" |
[A-Za-z] | Any (latin) letter |
** | Zero or more characters (including the file separator) |
\X | "X", even if "X" is a meta character |
Since version 1.7 JAVA provides its own implementation of "glob matching", which adds another (quite uncommon) construct:
Construct | Matches |
---|---|
{alpha,beta,gamma} | Any of "alpha", "beta" or "gamma" |
All in all, globs are very practical, though not very powerful. This is why de.unkrig.commons
provides yet another implementation, as follows:
Regular expressions features
First,
de.unkrig.commons.text.pattern.Pattern2
combines the simplicity of globs with the full power of
JAVA regular expressions.
It does so by modifying a few characters in the glob before feeding it to java.util.text.pattern.Pattern.compile()
, making that effectively a glob compiler:
Glob construct | Regex construct | Matches |
---|---|---|
? | [^/\\!]* | Any character except the file separator and "!" |
* | [^/\\!]* | Zero or more characters except the file separator and "!" |
** | [^!]* | Zero or more characters except "!" |
*** | .* | Zero or more characters |
. | \. | The dot is a literal (not a character class as in a regular expression) |
If you have not yet worked with regular expressions: JAVA regular expressions are very powerful and introduce many constructs and meta characters. Find the complete reference documentation here.
Since "?" and "*" are not quantifiers as in regular expression ("?" == zero-or-one, "*" == zero-or-more), one has to use "{0,1}" and "{0,}" instead. The alternative notation for "." ("any character") is "[^]".
Find the reference documentation here.
The API of
de.unkrig.commons.text.pattern.Pattern2
is totally compatible with that of
java.util.pattern.Pattern
,
plus it adds a new compilation Flag Pattern2.WILDCARD
that modifies the expression syntax from regex to globs.
Includes-Excludes
Second, it adds includes-excludes. This involves two new meta characters, ",
" and "~
":
Construct | Matches |
---|---|
pattern1,pattern2 | Any string that matches pattern1 or pattern2 |
pattern1~pattern2 | Any string that matches pattern1, but not pattern2 |
~pattern1 | Any string that does not match pattern1 |
pattern1,pattern2~pattern3~pattern4,pattern5 | Any string that matches pattern1 or pattern2, but not pattern3 nor pattern4; plus any string that matches pattern5 |
This may sound complicated, but the very simple rule is: The patterns are applied right-to-left, and the first match determines the result.
This comes with de.unkrig.commons.text.pattern.Glob.compile() and the new INCLUDES_EXCLUDES compilation flag.
Replacements
Third, it adds replacements. This involves one new meta character, "=":
Construct | Matches | Replaces with |
---|---|---|
*.c=$0.bak | Any string ending with ".c" | The original string plus ".bak" ("$0" represents the "entire match") |
(*).(*)=$1.$2$2 | Any string containing a dot | The original string, with the file name extension doubled |
This comes with the Glob.replace() API.
Notice that "$" is a meta character only in the replacement.
Also notice that when combining replacements with includes-excludes (see above), replacements are specific for each include (while it does not make any sense to use replacements with excludes, though).
Conclusion
All these features can be combined mercilessly, e.g.:
*=$0$0~*.bak (*).docx{0,1}=$1.txt~*blabla*
Using it is simple:
import de.unkrig.commons.text.pattern.Glob; import de.unkrig.commons.text.pattern.Pattern2; Glob glob = Glob.compile("*.c=$0.C,*.h=$0.H", Pattern2.WILDCARD | Glob.INCLUDES_EXCLUDES | Glob.REPLACEMENT); glob.match("foo.c"); // returns true glob.replace("foo.h"); // returns "foo.H" glob.replace("foo.cpp"); // returns null
Remember: The Pattern2.WILDCARD modifies the regex compilation to understand wildcard characters. Glob.INCLUDES_EXCLUDES activates the recognition of "," and "~". Finally, Glob.REPLACEMENT activates the recognition of "=".