Interface SMStringRegexp
- All Superinterfaces:
SimilarityMeasure
,SMString
- All Known Implementing Classes:
SMStringRegexpImpl
,SMStringWildcardImpl
Similarity
The regular expression query can only indicate, if an expression matches or doesn't match. So the similarity can only be 1 or 0.
For example
- If the query is "t.st" and the case is "test", then the similarity is 1. . is an operator that matches with any character.
- If the query is "^[aet]est" and the case is "test", then the similarity is 1. [aet] is a set of characters, ^ compares if the beginnings of the lines matches.
- If the query is "test\D." and the case is "test1", then the similarity is 0. \D is an operator, that matches with any non-digits. For example "testX" would match.
Brief Background
A regular expression consists of a character string where some characters are given special meaning with regard to pattern matching. Regular expressions have been in use from the early days of computing, and provide a powerful and efficient way to parse, interpret and search and replace text within an application.
Supported Syntax
Within a regular expression, the following characters have special meaning:
Meaning | Syntax | |
Positional Operators | ||
^ | matches at the beginning of a line | EGREP, PERL5, POSIX |
$ | matches at the end of a line | EGREP, PERL5, POSIX |
\A | matches the start of the entire string | PERL5 |
\Z | matches the end of the entire string | PERL5 |
\b | matches at a word break | PERL5 |
\B | matches at a non-word break (opposite of \b) | PERL5 |
\< | matches at the start of a word | EGREP |
\> | matches at the end of a word | EGREP |
One-Character Operators | ||
. | matches any single character | EGREP, PERL5, POSIX |
.* | matches zero or more characters | EGREP, PERL5, POSIX |
\d | matches any decimal digit | PERL5 |
\D | matches any non-digit | PERL5 |
\n | matches a newline character | PERL5 |
\r | matches a return character | PERL5 |
\s | matches any whitespace character | PERL5 |
\S | matches any non-whitespace character | PERL5 |
\t | matches a horizontal tab character | PERL5 |
\w | matches any word (alphanumeric) character | PERL5 |
\W | matches any non-word character | PERL5 |
\x | matches the character x, if x is not one of the above listed escape sequences | PERL5 |
Character Class Operators | ||
[abc] | matches any character in the set a, b or c | EGREP, PERL5, POSIX |
[^abc] | matches any character not in the set a, b or c | EGREP, PERL5, POSIX |
[a-z] | matches any character in the range a to z (both inclusive) | EGREP, PERL5, POSIX |
Special Sequences in Character Classes | ||
[:alnum:] | Any alphanumeric character | EGREP, PERL5, POSIX |
[:alpha:] | Any alphabetic character | EGREP, PERL5, POSIX |
[:blank:] | A space or horizontal tab | EGREP, PERL5, POSIX |
[:cntrl:] | A control character | EGREP, PERL5, POSIX |
[:digit:] | A decimal digit | EGREP, PERL5, POSIX |
[:graph:] | A non-space, non-control character | EGREP, PERL5, POSIX |
[:lower:] | A lowercase character | EGREP, PERL5, POSIX |
[:print:] | Same as graph, but also space and tab | EGREP, PERL5, POSIX |
[:punct:] | A punctual character | EGREP, PERL5, POSIX |
[:space:] | Any whitespace character, including newline and return | EGREP, PERL5, POSIX |
[:upper:] | An uppercase letter | EGREP, PERL5, POSIX |
[:xdigit:] | A valid hexadecimal digit | EGREP, PERL5, POSIX |
Subexpressions and Backreferences | ||
(abc) | matches whatever the expression abc would match, and saves it as a subexpression, also used for grouping | EGREP, PERL5, POSIX |
(?:...) | pure grouping operator, doesn't save contents | EGREP, PERL5, POSIX |
(?#...) | embedded comment, ignored by enginge | EGREP, PERL5, POSIX |
\n | where 0 \< n \< 10, matches the same thing the nth subepression matched | EGREP, PERL5, POSIX |
Branching (Alternation) Operator | ||
a|b | matches whatever the expression a would match, or whatever the expression b would match | EGREP, PERL5 |
Repeating Operators | (operate on the previous atomic expression) | |
? | matches the preceding expression or the null string | EGREP, PERL5 |
* | matches the null string or any number of repetitions of the preceding expression | EGREP, PERL5 |
+ | matches one or more repetitions of the preceding expression | EGREP, PERL5 |
{m} | matches exactly m repetitions of the one-character expression | EGREP, PERL5 |
{m,n} | matches between m and n repetitions of the preceding expression (inclusive) | EGREP, PERL5 |
{m,} | matches m or more repetitions of the preceding expression | EGREP, PERL5 |
Lookahead Lookahead refers to the ability to match part of an expression without consuming any of the input text. There are two variations to this:
(?=foo) | matches at any position where foo would match, but does not consume any characters of the input |
(?!foo) | matches at any position where foo would not match, but does not consume any characters of the input |
Unsupported Syntax
Some flavors of regular expression utilities support additional escape sequences, and this is not meant to be an exhaustive list. In the future, gnu.regexp may support some or all of the following:
(?mods) | inlined compilation/execution modifiers | PERL5 |
\G | end of previous match | PERL5 |
[.symbol.] | collating symbol in class expression | POSIX |
[=class=] | equivalence class in class expression | POSIX |
s/foo/bar/ | style expressions as in sed and awk (note: these can be accomplished through other means in the API) |
Java Integration
In a Java environment, a regular expression operates on a string of Unicode characters, represented either as an instance of java.lang.String or as an array of the primitive char type. This means that the unit of matching is a Unicode character, not a single byte. Generally this will not present problems in a Java program, because Java takes pains to ensure that all textual data uses the Unicode standard.
Because Java string processing takes care of certain escape sequences, they are not implemented in gnu.regexp. You should be aware that the following escape sequences are handled by the Java compiler if found in the Java source:
\b | backspace |
\f | form feed |
\n | newline |
\r | carriage return |
\t | horizontal tab |
\" | double quote |
\' | single quote |
\\ | backslash |
\xxx | character, in octal (000-377) |
\\uxxxx | Unicode character, in hexadecimal (0000-FFFF) |
In addition, note that the \\u escape sequences are meaningful anywhere in a Java program, not merely within a singly- or doubly-quoted character string, and are converted prior to any of the other escape sequences. For example, the line gnu.regexp.RE exp = new gnu.regexp.RE("\n"); would be converted by first replacing \ with a backslash, then converting \n to a newline. By the time the RE constructor is called, it will be passed a String object containing only the Unicode newline character.
The POSIX character classes (above), and the equivalent shorthand escapes (\d, \w and the like) are implemented to use the java.lang.Character static functions whenever possible. For example, \w and [:alnum:] (the latter only from within a class expression) will invoke the Java function Character.isLetterOrDigit() when executing. It is always better to use the POSIX expressions than a range such as [a-zA-Z0-9], because the latter will not match any letter characters in non-ISO 9660 encodings (for example, the umlaut character, "ü").
Online References
- Author:
- Rainer Maximini
-
Field Summary
Modifier and TypeFieldDescriptionstatic final String
Name of similarity measure is "StringRegexp".static final RegExpSyntax
The default syntax isRegExpSyntax.PERL5
.Fields inherited from interface de.uni_trier.wi2.procake.similarity.SimilarityMeasure
LOG_ORDER_NAME_NOT_FOUND
-
Method Summary
Methods inherited from interface de.uni_trier.wi2.procake.similarity.SimilarityMeasure
compute, getDataClass, getName, getSystemName, isForceOverride, isReusable, setForceOverride
-
Field Details
-
NAME
Name of similarity measure is "StringRegexp".- See Also:
-
SYNTAX_DEFAULT
The default syntax isRegExpSyntax.PERL5
.
-
-
Method Details
-
getSyntax
RegExpSyntax getSyntax() -
setSyntax
-