SMStringRegexp (ProCAKE Framework 6.0.0 API)

All Superinterfaces:: SimilarityMeasure, SMString

All Known Implementing Classes:: SMStringRegexpImpl, SMStringWildcardImpl

public interface SMStringRegexp extends SMString

The query will be interpreted as regular expression in the specified syntax. A regular expression is a sequence of characters that define a pattern. The default syntax ist PERL5, the implementation also supports EGREP and POSIX. The different types of syntax support several expressions. These are listed below.

Similarity

The regular expression query can only indicate, if an expression matches or doesn't match. So the similarity can only be 1 or 0.

For example

If the query is "t.st" and the case is "test", then the similarity is 1. . is an operator that matches with any character.
If the query is "^[aet]est" and the case is "test", then the similarity is 1. [aet] is a set of characters, ^ compares if the beginnings of the lines matches.
If the query is "test\D." and the case is "test1", then the similarity is 0. \D is an operator, that matches with any non-digits. For example "testX" would match.

Brief Background

A regular expression consists of a character string where some characters are given special meaning with regard to pattern matching. Regular expressions have been in use from the early days of computing, and provide a powerful and efficient way to parse, interpret and search and replace text within an application.

Supported Syntax

Within a regular expression, the following characters have special meaning:

Valid regular expression primitives
	Meaning	Syntax
Positional Operators
^	matches at the beginning of a line	EGREP, PERL5, POSIX
$	matches at the end of a line	EGREP, PERL5, POSIX
\A	matches the start of the entire string	PERL5
\Z	matches the end of the entire string	PERL5
\b	matches at a word break	PERL5
\B	matches at a non-word break (opposite of \b)	PERL5
\<	matches at the start of a word	EGREP
\>	matches at the end of a word	EGREP

One-Character Operators
.	matches any single character	EGREP, PERL5, POSIX
.*	matches zero or more characters	EGREP, PERL5, POSIX
\d	matches any decimal digit	PERL5
\D	matches any non-digit	PERL5
\n	matches a newline character	PERL5
\r	matches a return character	PERL5
\s	matches any whitespace character	PERL5
\S	matches any non-whitespace character	PERL5
\t	matches a horizontal tab character	PERL5
\w	matches any word (alphanumeric) character	PERL5
\W	matches any non-word character	PERL5
\x	matches the character x, if x is not one of the above listed escape sequences	PERL5

Character Class Operators
[abc]	matches any character in the set a, b or c	EGREP, PERL5, POSIX
[^abc]	matches any character not in the set a, b or c	EGREP, PERL5, POSIX
[a-z]	matches any character in the range a to z (both inclusive)	EGREP, PERL5, POSIX

Special Sequences in Character Classes
[:alnum:]	Any alphanumeric character	EGREP, PERL5, POSIX
[:alpha:]	Any alphabetic character	EGREP, PERL5, POSIX
[:blank:]	A space or horizontal tab	EGREP, PERL5, POSIX
[:cntrl:]	A control character	EGREP, PERL5, POSIX
[:digit:]	A decimal digit	EGREP, PERL5, POSIX
[:graph:]	A non-space, non-control character	EGREP, PERL5, POSIX
[:lower:]	A lowercase character	EGREP, PERL5, POSIX
[:print:]	Same as graph, but also space and tab	EGREP, PERL5, POSIX
[:punct:]	A punctual character	EGREP, PERL5, POSIX
[:space:]	Any whitespace character, including newline and return	EGREP, PERL5, POSIX
[:upper:]	An uppercase letter	EGREP, PERL5, POSIX
[:xdigit:]	A valid hexadecimal digit	EGREP, PERL5, POSIX

Subexpressions and Backreferences
(abc)	matches whatever the expression abc would match, and saves it as a subexpression, also used for grouping	EGREP, PERL5, POSIX
(?:...)	pure grouping operator, doesn't save contents	EGREP, PERL5, POSIX
(?#...)	embedded comment, ignored by enginge	EGREP, PERL5, POSIX
\n	where 0 \< n \< 10, matches the same thing the nth subepression matched	EGREP, PERL5, POSIX

Branching (Alternation) Operator
a\|b	matches whatever the expression a would match, or whatever the expression b would match	EGREP, PERL5

Repeating Operators	(operate on the previous atomic expression)
?	matches the preceding expression or the null string	EGREP, PERL5
*	matches the null string or any number of repetitions of the preceding expression	EGREP, PERL5
+	matches one or more repetitions of the preceding expression	EGREP, PERL5
{m}	matches exactly m repetitions of the one-character expression	EGREP, PERL5
{m,n}	matches between m and n repetitions of the preceding expression (inclusive)	EGREP, PERL5
{m,}	matches m or more repetitions of the preceding expression	EGREP, PERL5

Stingy (Minimal) Maching If a repeating operator (above) is immediately followed by a ?, the repeating operator will stop at the smallest number of repetitions that can complete the rest of the match.

Lookahead Lookahead refers to the ability to match part of an expression without consuming any of the input text. There are two variations to this:

Lookahead variants
(?=foo)	matches at any position where foo would match, but does not consume any characters of the input
(?!foo)	matches at any position where foo would not match, but does not consume any characters of the input

Unsupported Syntax

Some flavors of regular expression utilities support additional escape sequences, and this is not meant to be an exhaustive list. In the future, gnu.regexp may support some or all of the following:

Unsupported Syntax (might be outdated)
(?mods)	inlined compilation/execution modifiers	PERL5
\G	end of previous match	PERL5
[.symbol.]	collating symbol in class expression	POSIX
[=class=]	equivalence class in class expression	POSIX
s/foo/bar/	style expressions as in sed and awk (note: these can be accomplished through other means in the API)

Java Integration

In a Java environment, a regular expression operates on a string of Unicode characters, represented either as an instance of java.lang.String or as an array of the primitive char type. This means that the unit of matching is a Unicode character, not a single byte. Generally this will not present problems in a Java program, because Java takes pains to ensure that all textual data uses the Unicode standard.

Because Java string processing takes care of certain escape sequences, they are not implemented in gnu.regexp. You should be aware that the following escape sequences are handled by the Java compiler if found in the Java source:

Escape sequences that are handled by the Java compiler
\b	backspace
\f	form feed
\n	newline
\r	carriage return
\t	horizontal tab
\"	double quote
\'	single quote
\\	backslash
\xxx	character, in octal (000-377)
\\uxxxx	Unicode character, in hexadecimal (0000-FFFF)

In addition, note that the \\u escape sequences are meaningful anywhere in a Java program, not merely within a singly- or doubly-quoted character string, and are converted prior to any of the other escape sequences. For example, the line gnu.regexp.RE exp = new gnu.regexp.RE("\n"); would be converted by first replacing \ with a backslash, then converting \n to a newline. By the time the RE constructor is called, it will be passed a String object containing only the Unicode newline character.

The POSIX character classes (above), and the equivalent shorthand escapes (\d, \w and the like) are implemented to use the java.lang.Character static functions whenever possible. For example, \w and [:alnum:] (the latter only from within a class expression) will invoke the Java function Character.isLetterOrDigit() when executing. It is always better to use the POSIX expressions than a range such as [a-zA-Z0-9], because the latter will not match any letter characters in non-ISO 9660 encodings (for example, the umlaut character, "ü").

Online References

Syntax and Usage Notes of the Package gnu.regexp
GNU Library:Regular Expressions

Author:: Rainer Maximini

Field Summary

Fields

Modifier and Type

Field

Description

static final String

NAME

Name of similarity measure is "StringRegexp".

static final RegExpSyntax

SYNTAX_DEFAULT

The default syntax is RegExpSyntax.PERL5.

Fields inherited from interface de.uni_trier.wi2.procake.similarity.SimilarityMeasure
LOG_ORDER_NAME_NOT_FOUND
Method Summary

Modifier and Type

Method

Description

RegExpSyntax

getSyntax()

void

setSyntax(RegExpSyntax style)

Methods inherited from interface de.uni_trier.wi2.procake.similarity.SimilarityMeasure
compute, getDataClass, getName, getSystemName, isForceOverride, isReusable, setForceOverride

Field Details
- NAME
  
  static final String NAME
  
  Name of similarity measure is "StringRegexp".
  See Also:
  
  Constant Field Values
- SYNTAX_DEFAULT
  
  static final RegExpSyntax SYNTAX_DEFAULT
  
  The default syntax is RegExpSyntax.PERL5.
Method Details
- getSyntax
  
  RegExpSyntax getSyntax()
- setSyntax
  
  void setSyntax(RegExpSyntax style)

Interface SMStringRegexp

Similarity

Brief Background

Supported Syntax

Unsupported Syntax

Java Integration

Online References

Field Summary

Fields inherited from interface de.uni_trier.wi2.procake.similarity.SimilarityMeasure

Method Summary

Methods inherited from interface de.uni_trier.wi2.procake.similarity.SimilarityMeasure

Field Details

NAME

SYNTAX_DEFAULT

Method Details

getSyntax

setSyntax