Using Globs
This article documents the "Globs" feature implemented in commons.unkrig.de.
Introduction[edit]
Surely you have used globs before (maybe even without knowing that they call them "globs"):
*.txt dir/*.doc execut?r.txt
Globs are a widely spread concept that is used throughout the UNIX, MICROSOFT and other "worlds".
All implementations support at least the following elements:
Construct | Matches |
---|---|
? | Any character except the file separator ("/ " and/or "\ ")
|
* | Zero or more characters except the file separator |
x | The character x |
Some implementations add more features:
Construct | Matches |
---|---|
[abc] | Exactly one of the characters "a", "b" and "c" |
[^abc] or [!abc] |
Any character except "a", "b" or "c" |
[A-Za-z] | Any (latin) letter |
** | Zero or more characters (including the file separator) |
\X | "X", even if "X" is a meta character |
Since version 1.7 JAVA provides its own implementation of "glob matching", which adds another (quite uncommon) construct:
Construct | Matches |
---|---|
{alpha,beta,gamma} | Any of "alpha", "beta" or "gamma" |
All in all, globs are very practical, though not very powerful. This is why de.unkrig.commons
provides yet another implementation, as follows:
Regular expressions features[edit]
First,
de.unkrig.commons.text.pattern.Pattern2
combines the simplicity of globs with the full power of
JAVA regular expressions.
It does so by modifying a few characters in the glob before feeding it to java.util.text.pattern.Pattern.compile()
, making that effectively a glob compiler:
Glob construct | Regex construct | Matches |
---|---|---|
? | [^/\\!]* | Any character except the file separator and "!" |
* | [^/\\!]* | Zero or more characters except the file separator and "!" |
** | [^!]* | Zero or more characters except "!" |
*** | .* | Zero or more characters |
. | \. | The dot is a literal (not a character class as in a regular expression) |
If you have not yet worked with regular expressions: JAVA regular expressions are very powerful and introduce many constructs and meta characters. Find the complete reference documentation here.
Since "?" and "*" are not quantifiers as in regular expression ("?" == zero-or-one, "*" == zero-or-more), one has to use "{0,1}" and "{0,}" instead. The alternative notation for "." ("any character") is "[^]".
Find the reference documentation here.
The API of
de.unkrig.commons.text.pattern.Pattern2
is totally compatible with that of
java.util.pattern.Pattern
,
plus it adds a new compilation Flag Pattern2.WILDCARD
that modifies the expression syntax from regex to globs.
Includes-Excludes[edit]
Second, it adds includes-excludes. This involves two new meta characters, ",
" and "~
":
Construct | Matches |
---|---|
pattern1,pattern2 | Any string that matches pattern1 or pattern2 |
pattern1~pattern2 | Any string that matches pattern1, but not pattern2 |
~pattern1 | Any string that does not match pattern1 |
pattern1,pattern2~pattern3~pattern4,pattern5 | Any string that matches pattern1 or pattern2, but not pattern3 nor pattern4; plus any string that matches pattern5 |
This may sound complicated, but the very simple rule is: The patterns are applied right-to-left, and the first match determines the result.
This comes with de.unkrig.commons.text.pattern.Glob.compile() and the new INCLUDES_EXCLUDES compilation flag.
Replacements[edit]
Third, it adds replacements. This involves one new meta character, "=":
Construct | Matches | Replaces with |
---|---|---|
*.c=$0.bak | Any string ending with ".c" | The original string plus ".bak" ("$0" represents the "entire match") |
(*).(*)=$1.$2$2 | Any string containing a dot | The original string, with the file name extension doubled |
This comes with the Glob.replace() API.
Notice that "$" is a meta character only in the replacement.
Also notice that when combining replacements with includes-excludes (see above), replacements are specific for each include (while it does not make any sense to use replacements with excludes, though).
Conclusion[edit]
All these features can be combined mercilessly, e.g.:
*=$0$0~*.bak (*).docx{0,1}=$1.txt~*blabla*
Using it is simple:
import de.unkrig.commons.text.pattern.Glob; import de.unkrig.commons.text.pattern.Pattern2; Glob glob = Glob.compile("*.c=$0.C,*.h=$0.H", Pattern2.WILDCARD | Glob.INCLUDES_EXCLUDES | Glob.REPLACEMENT); glob.match("foo.c"); // returns true glob.replace("foo.h"); // returns "foo.H" glob.replace("foo.cpp"); // returns null
Remember: The Pattern2.WILDCARD modifies the regex compilation to understand wildcard characters. Glob.INCLUDES_EXCLUDES activates the recognition of "," and "~". Finally, Glob.REPLACEMENT activates the recognition of "=".