A mutable set of Unicode characters
0 <=
n <= 65536
A mutable set of Unicode characters. Objects of this class represent character classes used in regular expressions. Such classes specify a subset of the set of all Unicode characters, which in this implementation is the characters from U+0000 to U+FFFF, ignoring surrogates.This class supports two APIs. The first is modeled after Java 2's
java.util.Set
interface, although this class does not implement that interface. All methods ofSet
are supported, with the modification that they take a character range or single character instead of anObject
, and they take aUnicodeSet
instead of aCollection
.The second API is the
applyPattern()
/toPattern()
API from theFormat
-derived classes. Unlike the methods that add characters, add categories, and control the logic of the set, the methodapplyPattern()
sets all attributes of aUnicodeSet
at once, based on a string pattern.In addition, the set complement operation is supported through the
complement()
method.Pattern syntax
Patterns are accepted by the constructors and theapplyPattern()
methods and returned by thetoPattern()
method. These patterns follow a syntax similar to that employed by version 8 regular expression character classes:Patterns specify individual characters, ranges of characters, and Unicode character categories. When elements are concatenated, they specify their union. To complement a set, place a '^' immediately after the opening '[' or '[:'. In any other location, '^' has no special meaning.pattern := ('[' '^'? item* ']') | ('[:' '^'? category ':]')
item := char | (char '-' char) | pattern-expr
pattern-expr := pattern | pattern-expr pattern | pattern-expr op pattern
op := '&' | '-'
special := '[' | ']' | '-'
char := any character that is not special | ('\' any character) | ('\\u' hex hex hex hex)
hex := any hex digit, as defined by Character.digit(c, 16)
Legend:
a:=b
a
may be replaced byb
a?
zero or one instance of a
a*
one or more instances of a
a|b
either a
orb
'a'
the literal string between the quotes Ranges are indicated by placing two a '-' between two characters, as in "a-z". This specifies the range of all characters from the left to the right, in Unicode order. If the left and right characters are the same, then the range consists of just that character. If the left character is greater than the right character it is a syntax error. If a '-' occurs as the first character after the opening '[' or '[^', or if it occurs as the last character before the closing ']', then it is taken as a literal. Thus "[a\-b]", "[-ab]", and "[ab-]" all indicate the same set of three characters, 'a', 'b', and '-'.
Sets may be intersected using the '&' operator or the asymmetric set difference may be taken using the '-' operator, for example, "[[:L:]&[\u0000-\u0FFF]]" indicates the set of all Unicode letters with values less than 4096. Operators ('&' and '|') have equal precedence and bind left-to-right. Thus "[[:L:]-[a-z]-[\u0100-\u01FF]]" is equivalent to "[[[:L:]-[a-z]]-[\u0100-\u01FF]]". This only really matters for difference; intersection is commutative.
[a]
The set containing 'a' [a-z]
The set containing 'a' through 'z' and all letters in between, in Unicode order [^a-z]
The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+FFFF [[pat1][pat2]]
The union of sets specified by pat1 and pat2 [[pat1]&[pat2]]
The intersection of sets specified by pat1 and pat2 [[pat1]-[pat2]]
The asymmetric difference of sets specified by pat1 and pat2 [:Lu:]
The set of characters belonging to the given Unicode category, as defined by Character.getType()
; in this case, Unicode uppercase letters[:L:]
The set of characters belonging to all Unicode categories starting wih 'L', that is, [[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]
.Character categories. Character categories are specified using the POSIX-like syntax '[:Lu:]'. The complement of a category is specified by inserting '^' after the opening '[:'. The following category names are recognized. Actual determination of category data uses
Unicode::getType()
, so it reflects the underlying data used byUnicode
.Normative Mn = Mark, Non-Spacing Mc = Mark, Spacing Combining Me = Mark, Enclosing Nd = Number, Decimal Digit Nl = Number, Letter No = Number, Other Zs = Separator, Space Zl = Separator, Line Zp = Separator, Paragraph Cc = Other, Control Cf = Other, Format Cs = Other, Surrogate Co = Other, Private Use Cn = Other, Not Assigned Informative Lu = Letter, Uppercase Ll = Letter, Lowercase Lt = Letter, Titlecase Lm = Letter, Modifier Lo = Letter, Other Pc = Punctuation, Connector Pd = Punctuation, Dash Ps = Punctuation, Open Pe = Punctuation, Close Pi = Punctuation, Initial quote Pf = Punctuation, Final quote Po = Punctuation, Other Sm = Symbol, Math Sc = Symbol, Currency Sk = Symbol, Modifier So = Symbol, Other
IllegalArgumentException
if the pattern
contains a syntax error.IllegalArgumentException
if the pattern
contains a syntax error.true
, all spaces in the
pattern are ignored, except those preceded by '\\'. Spaces are
those characters for which Character.isSpaceChar()
is true
.
IllegalArgumentException
if the given
category is invalid.Character.getType()
.
IllegalArgumentException
if the pattern
contains a syntax error.true
, all spaces in the
pattern are ignored. Spaces are those characters for which
Character.isSpaceChar()
is true
.
Characters preceded by '\\' are escaped, losing any special
meaning they otherwise have. Spaces may be included by
escaping them.
IllegalArgumentException
if the pattern
contains a syntax error.0 <=
n <= 65536
.
last > first
then an empty range is added, leaving the set unchanged.
last > first
then an empty range is
removed, leaving the set unchanged.
this = new CharSet("[\u0000-\uFFFF]").removeAll(this)
.
alphabetic index hierarchy of classes
this page has been generated automatically by doc++
(c)opyright by Malte Zöckler, Roland Wunderling
contact: doc++@zib.de