From fa095a4504cbe668e4244547e2c141597bea4ecf Mon Sep 17 00:00:00 2001 From: Andreas Rottmann Date: Mon, 14 Sep 2009 12:32:44 +0200 Subject: Imported Upstream version 0.9.1 --- doc/libunistring_8.html | 2071 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 2071 insertions(+) create mode 100644 doc/libunistring_8.html (limited to 'doc/libunistring_8.html') diff --git a/doc/libunistring_8.html b/doc/libunistring_8.html new file mode 100644 index 00000000..def5e04a --- /dev/null +++ b/doc/libunistring_8.html @@ -0,0 +1,2071 @@ + + + + + +GNU libunistring: 8. Unicode character classification and properties <unictype.h> + + + + + + + + + + + + + + + + + + + + + + + + + + +
[ << ][ >> ]           [Top][Contents][Index][ ? ]
+ +
+ + +

8. Unicode character classification and properties <unictype.h>

+ +

This include file declares functions that classify Unicode characters +and that test whether Unicode characters have specific properties. +

+

The classification assigns a “general category” to every Unicode +character. This is similar to the classification provided by ISO C in +<wctype.h>. +

+

Properties are the data that guides various text processing algorithms +in the presence of specific Unicode characters. +

+ +
+ + +

8.1 General category

+ +

Every Unicode character or code point has a general category assigned +to it. This classification is important for most algorithms that work on +Unicode text. +

+

The GNU libunistring library provides two kinds of API for working with +general categories. The object oriented API uses a variable to denote +every predefined general category value or combinations thereof. The +low-level API uses a bit mask instead. The advantage of the object oriented +API is that if only a few predefined general category values are used, +the data tables are relatively small. When you combine general category +values (using uc_general_category_or, uc_general_category_and, +or uc_general_category_and_not), or when you use the low level +bit masks, a big table is used thats holds the complete general category +information for all Unicode characters. +

+ +
+ + +

8.1.1 The object oriented API for general category

+ +
+
Type: uc_general_category_t + +
+

This data type denotes a general category value. It is an immediate type that +can be copied by simple assignment, without involving memory allocation. It is +not an array type. +

+ +

The following are the predefined general category value. Additional general +categories may be added in the future. +

+
+
Constant: uc_general_category_t UC_CATEGORY_L + +
+
Constant: uc_general_category_t UC_CATEGORY_Lu + +
+
Constant: uc_general_category_t UC_CATEGORY_Ll + +
+
Constant: uc_general_category_t UC_CATEGORY_Lt + +
+
Constant: uc_general_category_t UC_CATEGORY_Lm + +
+
Constant: uc_general_category_t UC_CATEGORY_Lo + +
+
Constant: uc_general_category_t UC_CATEGORY_M + +
+
Constant: uc_general_category_t UC_CATEGORY_Mn + +
+
Constant: uc_general_category_t UC_CATEGORY_Mc + +
+
Constant: uc_general_category_t UC_CATEGORY_Me + +
+
Constant: uc_general_category_t UC_CATEGORY_N + +
+
Constant: uc_general_category_t UC_CATEGORY_Nd + +
+
Constant: uc_general_category_t UC_CATEGORY_Nl + +
+
Constant: uc_general_category_t UC_CATEGORY_No + +
+
Constant: uc_general_category_t UC_CATEGORY_P + +
+
Constant: uc_general_category_t UC_CATEGORY_Pc + +
+
Constant: uc_general_category_t UC_CATEGORY_Pd + +
+
Constant: uc_general_category_t UC_CATEGORY_Ps + +
+
Constant: uc_general_category_t UC_CATEGORY_Pe + +
+
Constant: uc_general_category_t UC_CATEGORY_Pi + +
+
Constant: uc_general_category_t UC_CATEGORY_Pf + +
+
Constant: uc_general_category_t UC_CATEGORY_Po + +
+
Constant: uc_general_category_t UC_CATEGORY_S + +
+
Constant: uc_general_category_t UC_CATEGORY_Sm + +
+
Constant: uc_general_category_t UC_CATEGORY_Sc + +
+
Constant: uc_general_category_t UC_CATEGORY_Sk + +
+
Constant: uc_general_category_t UC_CATEGORY_So + +
+
Constant: uc_general_category_t UC_CATEGORY_Z + +
+
Constant: uc_general_category_t UC_CATEGORY_Zs + +
+
Constant: uc_general_category_t UC_CATEGORY_Zl + +
+
Constant: uc_general_category_t UC_CATEGORY_Zp + +
+
Constant: uc_general_category_t UC_CATEGORY_C + +
+
Constant: uc_general_category_t UC_CATEGORY_Cc + +
+
Constant: uc_general_category_t UC_CATEGORY_Cf + +
+
Constant: uc_general_category_t UC_CATEGORY_Cs + +
+
Constant: uc_general_category_t UC_CATEGORY_Co + +
+
Constant: uc_general_category_t UC_CATEGORY_Cn + +
+
+ +

The following are alias names for predefined General category values. +

+
+
Macro: uc_general_category_t UC_LETTER + +
+

This is another name for UC_CATEGORY_L. +

+ +
+
Macro: uc_general_category_t UC_UPPERCASE_LETTER + +
+

This is another name for UC_CATEGORY_Lu. +

+ +
+
Macro: uc_general_category_t UC_LOWERCASE_LETTER + +
+

This is another name for UC_CATEGORY_Ll. +

+ +
+
Macro: uc_general_category_t UC_TITLECASE_LETTER + +
+

This is another name for UC_CATEGORY_Lt. +

+ +
+
Macro: uc_general_category_t UC_MODIFIER_LETTER + +
+

This is another name for UC_CATEGORY_Lm. +

+ +
+
Macro: uc_general_category_t UC_OTHER_LETTER + +
+

This is another name for UC_CATEGORY_Lo. +

+ +
+
Macro: uc_general_category_t UC_MARK + +
+

This is another name for UC_CATEGORY_M. +

+ +
+
Macro: uc_general_category_t UC_NON_SPACING_MARK + +
+

This is another name for UC_CATEGORY_Mn. +

+ +
+
Macro: uc_general_category_t UC_COMBINING_SPACING_MARK + +
+

This is another name for UC_CATEGORY_Mc. +

+ +
+
Macro: uc_general_category_t UC_ENCLOSING_MARK + +
+

This is another name for UC_CATEGORY_Me. +

+ +
+
Macro: uc_general_category_t UC_NUMBER + +
+

This is another name for UC_CATEGORY_N. +

+ +
+
Macro: uc_general_category_t UC_DECIMAL_DIGIT_NUMBER + +
+

This is another name for UC_CATEGORY_Nd. +

+ +
+
Macro: uc_general_category_t UC_LETTER_NUMBER + +
+

This is another name for UC_CATEGORY_Nl. +

+ +
+
Macro: uc_general_category_t UC_OTHER_NUMBER + +
+

This is another name for UC_CATEGORY_No. +

+ +
+
Macro: uc_general_category_t UC_PUNCTUATION + +
+

This is another name for UC_CATEGORY_P. +

+ +
+
Macro: uc_general_category_t UC_CONNECTOR_PUNCTUATION + +
+

This is another name for UC_CATEGORY_Pc. +

+ +
+
Macro: uc_general_category_t UC_DASH_PUNCTUATION + +
+

This is another name for UC_CATEGORY_Pd. +

+ +
+
Macro: uc_general_category_t UC_OPEN_PUNCTUATION + +
+

This is another name for UC_CATEGORY_Ps (“start punctuation”). +

+ +
+
Macro: uc_general_category_t UC_CLOSE_PUNCTUATION + +
+

This is another name for UC_CATEGORY_Pe (“end punctuation”). +

+ +
+
Macro: uc_general_category_t UC_INITIAL_QUOTE_PUNCTUATION + +
+

This is another name for UC_CATEGORY_Pi. +

+ +
+
Macro: uc_general_category_t UC_FINAL_QUOTE_PUNCTUATION + +
+

This is another name for UC_CATEGORY_Pf. +

+ +
+
Macro: uc_general_category_t UC_OTHER_PUNCTUATION + +
+

This is another name for UC_CATEGORY_Po. +

+ +
+
Macro: uc_general_category_t UC_SYMBOL + +
+

This is another name for UC_CATEGORY_S. +

+ +
+
Macro: uc_general_category_t UC_MATH_SYMBOL + +
+

This is another name for UC_CATEGORY_Sm. +

+ +
+
Macro: uc_general_category_t UC_CURRENCY_SYMBOL + +
+

This is another name for UC_CATEGORY_Sc. +

+ +
+
Macro: uc_general_category_t UC_MODIFIER_SYMBOL + +
+

This is another name for UC_CATEGORY_Sk. +

+ +
+
Macro: uc_general_category_t UC_OTHER_SYMBOL + +
+

This is another name for UC_CATEGORY_So. +

+ +
+
Macro: uc_general_category_t UC_SEPARATOR + +
+

This is another name for UC_CATEGORY_Z. +

+ +
+
Macro: uc_general_category_t UC_SPACE_SEPARATOR + +
+

This is another name for UC_CATEGORY_Zs. +

+ +
+
Macro: uc_general_category_t UC_LINE_SEPARATOR + +
+

This is another name for UC_CATEGORY_Zl. +

+ +
+
Macro: uc_general_category_t UC_PARAGRAPH_SEPARATOR + +
+

This is another name for UC_CATEGORY_Zp. +

+ +
+
Macro: uc_general_category_t UC_OTHER + +
+

This is another name for UC_CATEGORY_C. +

+ +
+
Macro: uc_general_category_t UC_CONTROL + +
+

This is another name for UC_CATEGORY_Cc. +

+ +
+
Macro: uc_general_category_t UC_FORMAT + +
+

This is another name for UC_CATEGORY_Cf. +

+ +
+
Macro: uc_general_category_t UC_SURROGATE + +
+

This is another name for UC_CATEGORY_Cs. All code points in this +category are invalid characters. +

+ +
+
Macro: uc_general_category_t UC_PRIVATE_USE + +
+

This is another name for UC_CATEGORY_Co. +

+ +
+
Macro: uc_general_category_t UC_UNASSIGNED + +
+

This is another name for UC_CATEGORY_Cn. Some code points in this +category are invalid characters. +

+ +

The following functions combine general categories, like in a boolean algebra, +except that there is no ‘not’ operation. +

+
+
Function: uc_general_category_t uc_general_category_or (uc_general_category_t category1, uc_general_category_t category2) + +
+

Returns the union of two general categories. +This corresponds to the unions of the two sets of characters. +

+ +
+
Function: uc_general_category_t uc_general_category_and (uc_general_category_t category1, uc_general_category_t category2) + +
+

Returns the intersection of two general categories as bit masks. +This does not correspond to the intersection of the two sets of +characters. +

+ +
+
Function: uc_general_category_t uc_general_category_and_not (uc_general_category_t category1, uc_general_category_t category2) + +
+

Returns the intersection of a general category with the complement of a +second general category, as bit masks. +This does not correspond to the intersection with complement, when +viewing the categories as sets of characters. +

+ +

The following functions associate general categories with their name. +

+
+
Function: const char * uc_general_category_name (uc_general_category_t category) + +
+

Returns the name of a general category. +Returns NULL if the general category corresponds to a bit mask that does not +have a name. +

+ +
+
Function: uc_general_category_t uc_general_category_byname (const char *category_name) + +
+

Returns the general category given by name, e.g. "Lu". +

+ +

The following functions view general categories as sets of Unicode characters. +

+
+
Function: uc_general_category_t uc_general_category (ucs4_t uc) + +
+

Returns the general category of a Unicode character. +

+

This function uses a big table. +

+ +
+
Function: bool uc_is_general_category (ucs4_t uc, uc_general_category_t category) + +
+

Tests whether a Unicode character belongs to a given category. +The category argument can be a predefined general category or the +combination of several predefined general categories. +

+ +
+ + +

8.1.2 The bit mask API for general category

+ +

The following are the predefined general category value as bit masks. +Additional general categories may be added in the future. +

+
+
Macro: uint32_t UC_CATEGORY_MASK_L + +
+
Macro: uint32_t UC_CATEGORY_MASK_Lu + +
+
Macro: uint32_t UC_CATEGORY_MASK_Ll + +
+
Macro: uint32_t UC_CATEGORY_MASK_Lt + +
+
Macro: uint32_t UC_CATEGORY_MASK_Lm + +
+
Macro: uint32_t UC_CATEGORY_MASK_Lo + +
+
Macro: uint32_t UC_CATEGORY_MASK_M + +
+
Macro: uint32_t UC_CATEGORY_MASK_Mn + +
+
Macro: uint32_t UC_CATEGORY_MASK_Mc + +
+
Macro: uint32_t UC_CATEGORY_MASK_Me + +
+
Macro: uint32_t UC_CATEGORY_MASK_N + +
+
Macro: uint32_t UC_CATEGORY_MASK_Nd + +
+
Macro: uint32_t UC_CATEGORY_MASK_Nl + +
+
Macro: uint32_t UC_CATEGORY_MASK_No + +
+
Macro: uint32_t UC_CATEGORY_MASK_P + +
+
Macro: uint32_t UC_CATEGORY_MASK_Pc + +
+
Macro: uint32_t UC_CATEGORY_MASK_Pd + +
+
Macro: uint32_t UC_CATEGORY_MASK_Ps + +
+
Macro: uint32_t UC_CATEGORY_MASK_Pe + +
+
Macro: uint32_t UC_CATEGORY_MASK_Pi + +
+
Macro: uint32_t UC_CATEGORY_MASK_Pf + +
+
Macro: uint32_t UC_CATEGORY_MASK_Po + +
+
Macro: uint32_t UC_CATEGORY_MASK_S + +
+
Macro: uint32_t UC_CATEGORY_MASK_Sm + +
+
Macro: uint32_t UC_CATEGORY_MASK_Sc + +
+
Macro: uint32_t UC_CATEGORY_MASK_Sk + +
+
Macro: uint32_t UC_CATEGORY_MASK_So + +
+
Macro: uint32_t UC_CATEGORY_MASK_Z + +
+
Macro: uint32_t UC_CATEGORY_MASK_Zs + +
+
Macro: uint32_t UC_CATEGORY_MASK_Zl + +
+
Macro: uint32_t UC_CATEGORY_MASK_Zp + +
+
Macro: uint32_t UC_CATEGORY_MASK_C + +
+
Macro: uint32_t UC_CATEGORY_MASK_Cc + +
+
Macro: uint32_t UC_CATEGORY_MASK_Cf + +
+
Macro: uint32_t UC_CATEGORY_MASK_Cs + +
+
Macro: uint32_t UC_CATEGORY_MASK_Co + +
+
Macro: uint32_t UC_CATEGORY_MASK_Cn + +
+
+ +

The following function views general categories as sets of Unicode characters. +

+
+
Function: bool uc_is_general_category_withtable (ucs4_t uc, uint32_t bitmask) + +
+

Tests whether a Unicode character belongs to a given category. +The bitmask argument can be a predefined general category bitmask or the +combination of several predefined general category bitmasks. +

+

This function uses a big table comprising all general categories. +

+ +
+ + +

8.2 Canonical combining class

+ +

Every Unicode character or code point has a canonical combining class +assigned to it. +

+

What is the meaning of the canonical combining class? Essentially, it +indicates the priority with which a combining character is attached to its +base character. The characters for which the canonical combining class is 0 +are the base characters, and the characters for which it is greater than 0 are +the combining characters. Combining characters are rendered +near/attached/around their base character, and combining characters with small +combining classes are attached "first" or "closer" to the base character. +

+

The canonical combining class of a character is a number in the range +0..255. The possible values are described in the Unicode Character Database +http://www.unicode.org/Public/UNIDATA/UCD.html. The list here is +not definitive; more values can be added in future versions. +

+
+
Constant: int UC_CCC_NR + +
+

The canonical combining class value for “Not Reordered” characters. +The value is 0. +

+ +
+
Constant: int UC_CCC_OV + +
+

The canonical combining class value for “Overlay” characters. +

+ +
+
Constant: int UC_CCC_NK + +
+

The canonical combining class value for “Nukta” characters. +

+ +
+
Constant: int UC_CCC_KV + +
+

The canonical combining class value for “Kana Voicing” characters. +

+ +
+
Constant: int UC_CCC_VR + +
+

The canonical combining class value for “Virama” characters. +

+ +
+
Constant: int UC_CCC_ATBL + +
+

The canonical combining class value for “Attached Below Left” characters. +

+ +
+
Constant: int UC_CCC_ATB + +
+

The canonical combining class value for “Attached Below” characters. +

+ +
+
Constant: int UC_CCC_ATAR + +
+

The canonical combining class value for “Attached Above Right” characters. +

+ +
+
Constant: int UC_CCC_BL + +
+

The canonical combining class value for “Below Left” characters. +

+ +
+
Constant: int UC_CCC_B + +
+

The canonical combining class value for “Below” characters. +

+ +
+
Constant: int UC_CCC_BR + +
+

The canonical combining class value for “Below Right” characters. +

+ +
+
Constant: int UC_CCC_L + +
+

The canonical combining class value for “Left” characters. +

+ +
+
Constant: int UC_CCC_R + +
+

The canonical combining class value for “Right” characters. +

+ +
+
Constant: int UC_CCC_AL + +
+

The canonical combining class value for “Above Left” characters. +

+ +
+
Constant: int UC_CCC_A + +
+

The canonical combining class value for “Above” characters. +

+ +
+
Constant: int UC_CCC_AR + +
+

The canonical combining class value for “Above Right” characters. +

+ +
+
Constant: int UC_CCC_DB + +
+

The canonical combining class value for “Double Below” characters. +

+ +
+
Constant: int UC_CCC_DA + +
+

The canonical combining class value for “Double Above” characters. +

+ +
+
Constant: int UC_CCC_IS + +
+

The canonical combining class value for “Iota Subscript” characters. +

+ +

The following function looks up the canonical combining class of a character. +

+
+
Function: int uc_combining_class (ucs4_t uc) + +
+

Returns the canonical combining class of a Unicode character. +

+ +
+ + +

8.3 Bidirectional category

+ +

Every Unicode character or code point has a bidirectional category +assigned to it. +

+

The bidirectional category guides the bidirectional algorithm +(http://www.unicode.org/reports/tr9/). The possible values are +the following. +

+
+
Constant: int UC_BIDI_L + +
+

The bidirectional category for `Left-to-Right`” characters. +

+ +
+
Constant: int UC_BIDI_LRE + +
+

The bidirectional category for “Left-to-Right Embedding” characters. +

+ +
+
Constant: int UC_BIDI_LRO + +
+

The bidirectional category for “Left-to-Right Override” characters. +

+ +
+
Constant: int UC_BIDI_R + +
+

The bidirectional category for “Right-to-Left” characters. +

+ +
+
Constant: int UC_BIDI_AL + +
+

The bidirectional category for “Right-to-Left Arabic” characters. +

+ +
+
Constant: int UC_BIDI_RLE + +
+

The bidirectional category for “Right-to-Left Embedding” characters. +

+ +
+
Constant: int UC_BIDI_RLO + +
+

The bidirectional category for “Right-to-Left Override” characters. +

+ +
+
Constant: int UC_BIDI_PDF + +
+

The bidirectional category for “Pop Directional Format” characters. +

+ +
+
Constant: int UC_BIDI_EN + +
+

The bidirectional category for “European Number” characters. +

+ +
+
Constant: int UC_BIDI_ES + +
+

The bidirectional category for “European Number Separator” characters. +

+ +
+
Constant: int UC_BIDI_ET + +
+

The bidirectional category for “European Number Terminator” characters. +

+ +
+
Constant: int UC_BIDI_AN + +
+

The bidirectional category for “Arabic Number” characters. +

+ +
+
Constant: int UC_BIDI_CS + +
+

The bidirectional category for “Common Number Separator” characters. +

+ +
+
Constant: int UC_BIDI_NSM + +
+

The bidirectional category for “Non-Spacing Mark” characters. +

+ +
+
Constant: int UC_BIDI_BN + +
+

The bidirectional category for “Boundary Neutral” characters. +

+ +
+
Constant: int UC_BIDI_B + +
+

The bidirectional category for “Paragraph Separator” characters. +

+ +
+
Constant: int UC_BIDI_S + +
+

The bidirectional category for “Segment Separator” characters. +

+ +
+
Constant: int UC_BIDI_WS + +
+

The bidirectional category for “Whitespace” characters. +

+ +
+
Constant: int UC_BIDI_ON + +
+

The bidirectional category for “Other Neutral” characters. +

+ +

The following functions implement the association between a bidirectional +category and its name. +

+
+
Function: const char * uc_bidi_category_name (int category) + +
+

Returns the name of a bidirectional category. +

+ +
+
Function: int uc_bidi_category_byname (const char *category_name) + +
+

Returns the bidirectional category given by name, e.g. "LRE". +

+ +

The following functions view bidirectional categories as sets of Unicode +characters. +

+
+
Function: int uc_bidi_category (ucs4_t uc) + +
+

Returns the bidirectional category of a Unicode character. +

+ +
+
Function: bool uc_is_bidi_category (ucs4_t uc, int category) + +
+

Tests whether a Unicode character belongs to a given bidirectional category. +

+ +
+ + +

8.4 Decimal digit value

+ +

Decimal digits (like the digits from ‘0’ to ‘9’) exist in many +scripts. The following function converts a decimal digit character to its +numerical value. +

+
+
Function: int uc_decimal_value (ucs4_t uc) + +
+

Returns the decimal digit value of a Unicode character. +The return value is an integer in the range 0..9, or -1 for characters that +do not represent a decimal digit. +

+ +
+ + +

8.5 Digit value

+ +

Digit characters are like decimal digit characters, possibly in special forms, +like as superscript, subscript, or circled. The following function converts a +digit character to its numerical value. +

+
+
Function: int uc_digit_value (ucs4_t uc) + +
+

Returns the digit value of a Unicode character. +The return value is an integer in the range 0..9, or -1 for characters that +do not represent a digit. +

+ +
+ + +

8.6 Numeric value

+ +

There are also characters that represent numbers without a digit system, like +the Roman numerals, and fractional numbers, like 1/4 or 3/4. +

+

The following type represents the numeric value of a Unicode character. +

+
Type: uc_fraction_t + +
+

This is a structure type with the following fields: +

 
int numerator;
+int denominator;
+
+

An integer n is represented by numerator = n, +denominator = 1. +

+ +

The following function converts a number character to its numerical value. +

+
+
Function: uc_fraction_t uc_numeric_value (ucs4_t uc) + +
+

Returns the numeric value of a Unicode character. +The return value is a fraction, or the pseudo-fraction { 0, 0 } for +characters that do not represent a number. +

+ +
+ + +

8.7 Mirrored character

+ +

Character mirroring is used to associate the closing parenthesis character +to the opening parenthesis character, the closing brace character with the +opening brace character, and so on. +

+

The following function looks up the mirrored character of a Unicode character. +

+
+
Function: bool uc_mirror_char (ucs4_t uc, ucs4_t *puc) + +
+

Stores the mirrored character of a Unicode character uc in +*puc and returns true, if it exists. Otherwise it +stores uc unmodified in *puc and returns false. +

+ +
+ + +

8.8 Properties

+ +

This section defines boolean properties of Unicode characters. This +means, a character either has the given property or does not have it. +In other words, the property can be viewed as a subset of the set of +Unicode characters. +

+

The GNU libunistring library provides two kinds of API for working with +properties. The object oriented API uses a type uc_property_t +to designate a property. In the function-based API, which is a bit more +low level, a property is merely a function. +

+ +
+ + +

8.8.1 Properties as objects – the object oriented API

+ +

The following type designates a property on Unicode characters. +

+
+
Type: uc_property_t + +
+

This data type denotes a boolean property on Unicode characters. It is an +immediate type that can be copied by simple assignment, without involving +memory allocation. It is not an array type. +

+ +

Many Unicode properties are predefined. +

+

The following are general properties. +

+
+
Constant: uc_property_t UC_PROPERTY_WHITE_SPACE + +
+
Constant: uc_property_t UC_PROPERTY_ALPHABETIC + +
+
Constant: uc_property_t UC_PROPERTY_OTHER_ALPHABETIC + +
+
Constant: uc_property_t UC_PROPERTY_NOT_A_CHARACTER + +
+
Constant: uc_property_t UC_PROPERTY_DEFAULT_IGNORABLE_CODE_POINT + +
+
Constant: uc_property_t UC_PROPERTY_OTHER_DEFAULT_IGNORABLE_CODE_POINT + +
+
Constant: uc_property_t UC_PROPERTY_DEPRECATED + +
+
Constant: uc_property_t UC_PROPERTY_LOGICAL_ORDER_EXCEPTION + +
+
Constant: uc_property_t UC_PROPERTY_VARIATION_SELECTOR + +
+
Constant: uc_property_t UC_PROPERTY_PRIVATE_USE + +
+
Constant: uc_property_t UC_PROPERTY_UNASSIGNED_CODE_VALUE + +
+
+ +

The following properties are related to case folding. +

+
+
Constant: uc_property_t UC_PROPERTY_UPPERCASE + +
+
Constant: uc_property_t UC_PROPERTY_OTHER_UPPERCASE + +
+
Constant: uc_property_t UC_PROPERTY_LOWERCASE + +
+
Constant: uc_property_t UC_PROPERTY_OTHER_LOWERCASE + +
+
Constant: uc_property_t UC_PROPERTY_TITLECASE + +
+
Constant: uc_property_t UC_PROPERTY_SOFT_DOTTED + +
+
+ +

The following properties are related to identifiers. +

+
+
Constant: uc_property_t UC_PROPERTY_ID_START + +
+
Constant: uc_property_t UC_PROPERTY_OTHER_ID_START + +
+
Constant: uc_property_t UC_PROPERTY_ID_CONTINUE + +
+
Constant: uc_property_t UC_PROPERTY_OTHER_ID_CONTINUE + +
+
Constant: uc_property_t UC_PROPERTY_XID_START + +
+
Constant: uc_property_t UC_PROPERTY_XID_CONTINUE + +
+
Constant: uc_property_t UC_PROPERTY_PATTERN_WHITE_SPACE + +
+
Constant: uc_property_t UC_PROPERTY_PATTERN_SYNTAX + +
+
+ +

The following properties have an influence on shaping and rendering. +

+
+
Constant: uc_property_t UC_PROPERTY_JOIN_CONTROL + +
+
Constant: uc_property_t UC_PROPERTY_GRAPHEME_BASE + +
+
Constant: uc_property_t UC_PROPERTY_GRAPHEME_EXTEND + +
+
Constant: uc_property_t UC_PROPERTY_OTHER_GRAPHEME_EXTEND + +
+
Constant: uc_property_t UC_PROPERTY_GRAPHEME_LINK + +
+
+ +

The following properties relate to bidirectional reordering. +

+
+
Constant: uc_property_t UC_PROPERTY_BIDI_CONTROL + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_LEFT_TO_RIGHT + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_HEBREW_RIGHT_TO_LEFT + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_ARABIC_RIGHT_TO_LEFT + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_EUROPEAN_DIGIT + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_EUR_NUM_SEPARATOR + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_EUR_NUM_TERMINATOR + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_ARABIC_DIGIT + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_COMMON_SEPARATOR + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_BLOCK_SEPARATOR + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_SEGMENT_SEPARATOR + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_WHITESPACE + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_NON_SPACING_MARK + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_BOUNDARY_NEUTRAL + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_PDF + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_EMBEDDING_OR_OVERRIDE + +
+
Constant: uc_property_t UC_PROPERTY_BIDI_OTHER_NEUTRAL + +
+
+ +

The following properties deal with number representations. +

+
+
Constant: uc_property_t UC_PROPERTY_HEX_DIGIT + +
+
Constant: uc_property_t UC_PROPERTY_ASCII_HEX_DIGIT + +
+
+ +

The following properties deal with CJK. +

+
+
Constant: uc_property_t UC_PROPERTY_IDEOGRAPHIC + +
+
Constant: uc_property_t UC_PROPERTY_UNIFIED_IDEOGRAPH + +
+
Constant: uc_property_t UC_PROPERTY_RADICAL + +
+
Constant: uc_property_t UC_PROPERTY_IDS_BINARY_OPERATOR + +
+
Constant: uc_property_t UC_PROPERTY_IDS_TRINARY_OPERATOR + +
+
+ +

Other miscellaneous properties are: +

+
+
Constant: uc_property_t UC_PROPERTY_ZERO_WIDTH + +
+
Constant: uc_property_t UC_PROPERTY_SPACE + +
+
Constant: uc_property_t UC_PROPERTY_NON_BREAK + +
+
Constant: uc_property_t UC_PROPERTY_ISO_CONTROL + +
+
Constant: uc_property_t UC_PROPERTY_FORMAT_CONTROL + +
+
Constant: uc_property_t UC_PROPERTY_DASH + +
+
Constant: uc_property_t UC_PROPERTY_HYPHEN + +
+
Constant: uc_property_t UC_PROPERTY_PUNCTUATION + +
+
Constant: uc_property_t UC_PROPERTY_LINE_SEPARATOR + +
+
Constant: uc_property_t UC_PROPERTY_PARAGRAPH_SEPARATOR + +
+
Constant: uc_property_t UC_PROPERTY_QUOTATION_MARK + +
+
Constant: uc_property_t UC_PROPERTY_SENTENCE_TERMINAL + +
+
Constant: uc_property_t UC_PROPERTY_TERMINAL_PUNCTUATION + +
+
Constant: uc_property_t UC_PROPERTY_CURRENCY_SYMBOL + +
+
Constant: uc_property_t UC_PROPERTY_MATH + +
+
Constant: uc_property_t UC_PROPERTY_OTHER_MATH + +
+
Constant: uc_property_t UC_PROPERTY_PAIRED_PUNCTUATION + +
+
Constant: uc_property_t UC_PROPERTY_LEFT_OF_PAIR + +
+
Constant: uc_property_t UC_PROPERTY_COMBINING + +
+
Constant: uc_property_t UC_PROPERTY_COMPOSITE + +
+
Constant: uc_property_t UC_PROPERTY_DECIMAL_DIGIT + +
+
Constant: uc_property_t UC_PROPERTY_NUMERIC + +
+
Constant: uc_property_t UC_PROPERTY_DIACRITIC + +
+
Constant: uc_property_t UC_PROPERTY_EXTENDER + +
+
Constant: uc_property_t UC_PROPERTY_IGNORABLE_CONTROL + +
+
+ +

The following function looks up a property by its name. +

+
+
Function: uc_property_t uc_property_byname (const char *property_name) + +
+

Returns the property given by name, e.g. "White space". If a property +with the given name exists, the result will satisfy the +uc_property_is_valid predicate. Otherwise the result will not satisfy +this predicate and must not be passed to functions that expect an +uc_property_t argument. +

+

This function references a big table of all predefined properties. Its use +can significantly increase the size of your application. +

+ +
+
Function: bool uc_property_is_valid (uc_property_t property) + +
+

Returns true when the given property is valid, or false +otherwise. +

+ +

The following function views a property as a set of Unicode characters. +

+
+
Function: bool uc_is_property (ucs4_t uc, uc_property_t property) + +
+

Tests whether the Unicode character uc has the given property. +

+ +
+ + +

8.8.2 Properties as functions – the functional API

+ +

The following are general properties. +

+
+
Function: bool uc_is_property_white_space (ucs4_t uc) + +
+
Function: bool uc_is_property_alphabetic (ucs4_t uc) + +
+
Function: bool uc_is_property_other_alphabetic (ucs4_t uc) + +
+
Function: bool uc_is_property_not_a_character (ucs4_t uc) + +
+
Function: bool uc_is_property_default_ignorable_code_point (ucs4_t uc) + +
+
Function: bool uc_is_property_other_default_ignorable_code_point (ucs4_t uc) + +
+
Function: bool uc_is_property_deprecated (ucs4_t uc) + +
+
Function: bool uc_is_property_logical_order_exception (ucs4_t uc) + +
+
Function: bool uc_is_property_variation_selector (ucs4_t uc) + +
+
Function: bool uc_is_property_private_use (ucs4_t uc) + +
+
Function: bool uc_is_property_unassigned_code_value (ucs4_t uc) + +
+
+ +

The following properties are related to case folding. +

+
+
Function: bool uc_is_property_uppercase (ucs4_t uc) + +
+
Function: bool uc_is_property_other_uppercase (ucs4_t uc) + +
+
Function: bool uc_is_property_lowercase (ucs4_t uc) + +
+
Function: bool uc_is_property_other_lowercase (ucs4_t uc) + +
+
Function: bool uc_is_property_titlecase (ucs4_t uc) + +
+
Function: bool uc_is_property_soft_dotted (ucs4_t uc) + +
+
+ +

The following properties are related to identifiers. +

+
+
Function: bool uc_is_property_id_start (ucs4_t uc) + +
+
Function: bool uc_is_property_other_id_start (ucs4_t uc) + +
+
Function: bool uc_is_property_id_continue (ucs4_t uc) + +
+
Function: bool uc_is_property_other_id_continue (ucs4_t uc) + +
+
Function: bool uc_is_property_xid_start (ucs4_t uc) + +
+
Function: bool uc_is_property_xid_continue (ucs4_t uc) + +
+
Function: bool uc_is_property_pattern_white_space (ucs4_t uc) + +
+
Function: bool uc_is_property_pattern_syntax (ucs4_t uc) + +
+
+ +

The following properties have an influence on shaping and rendering. +

+
+
Function: bool uc_is_property_join_control (ucs4_t uc) + +
+
Function: bool uc_is_property_grapheme_base (ucs4_t uc) + +
+
Function: bool uc_is_property_grapheme_extend (ucs4_t uc) + +
+
Function: bool uc_is_property_other_grapheme_extend (ucs4_t uc) + +
+
Function: bool uc_is_property_grapheme_link (ucs4_t uc) + +
+
+ +

The following properties relate to bidirectional reordering. +

+
+
Function: bool uc_is_property_bidi_control (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_left_to_right (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_hebrew_right_to_left (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_arabic_right_to_left (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_european_digit (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_eur_num_separator (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_eur_num_terminator (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_arabic_digit (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_common_separator (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_block_separator (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_segment_separator (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_whitespace (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_non_spacing_mark (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_boundary_neutral (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_pdf (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_embedding_or_override (ucs4_t uc) + +
+
Function: bool uc_is_property_bidi_other_neutral (ucs4_t uc) + +
+
+ +

The following properties deal with number representations. +

+
+
Function: bool uc_is_property_hex_digit (ucs4_t uc) + +
+
Function: bool uc_is_property_ascii_hex_digit (ucs4_t uc) + +
+
+ +

The following properties deal with CJK. +

+
+
Function: bool uc_is_property_ideographic (ucs4_t uc) + +
+
Function: bool uc_is_property_unified_ideograph (ucs4_t uc) + +
+
Function: bool uc_is_property_radical (ucs4_t uc) + +
+
Function: bool uc_is_property_ids_binary_operator (ucs4_t uc) + +
+
Function: bool uc_is_property_ids_trinary_operator (ucs4_t uc) + +
+
+ +

Other miscellaneous properties are: +

+
+
Function: bool uc_is_property_zero_width (ucs4_t uc) + +
+
Function: bool uc_is_property_space (ucs4_t uc) + +
+
Function: bool uc_is_property_non_break (ucs4_t uc) + +
+
Function: bool uc_is_property_iso_control (ucs4_t uc) + +
+
Function: bool uc_is_property_format_control (ucs4_t uc) + +
+
Function: bool uc_is_property_dash (ucs4_t uc) + +
+
Function: bool uc_is_property_hyphen (ucs4_t uc) + +
+
Function: bool uc_is_property_punctuation (ucs4_t uc) + +
+
Function: bool uc_is_property_line_separator (ucs4_t uc) + +
+
Function: bool uc_is_property_paragraph_separator (ucs4_t uc) + +
+
Function: bool uc_is_property_quotation_mark (ucs4_t uc) + +
+
Function: bool uc_is_property_sentence_terminal (ucs4_t uc) + +
+
Function: bool uc_is_property_terminal_punctuation (ucs4_t uc) + +
+
Function: bool uc_is_property_currency_symbol (ucs4_t uc) + +
+
Function: bool uc_is_property_math (ucs4_t uc) + +
+
Function: bool uc_is_property_other_math (ucs4_t uc) + +
+
Function: bool uc_is_property_paired_punctuation (ucs4_t uc) + +
+
Function: bool uc_is_property_left_of_pair (ucs4_t uc) + +
+
Function: bool uc_is_property_combining (ucs4_t uc) + +
+
Function: bool uc_is_property_composite (ucs4_t uc) + +
+
Function: bool uc_is_property_decimal_digit (ucs4_t uc) + +
+
Function: bool uc_is_property_numeric (ucs4_t uc) + +
+
Function: bool uc_is_property_diacritic (ucs4_t uc) + +
+
Function: bool uc_is_property_extender (ucs4_t uc) + +
+
Function: bool uc_is_property_ignorable_control (ucs4_t uc) + +
+
+ +
+ + +

8.9 Scripts

+ +

The Unicode characters are subdivided into scripts. +

+

The following type is used to represent a script: +

+
+
Type: uc_script_t + +
+

This data type is a structure type that refers to statically allocated +read-only data. It contains the following fields: +

 
const char *name;
+
+ +

The name field contains the name of the script. +

+ + +

The following functions look up a script. +

+
+
Function: const uc_script_t * uc_script (ucs4_t uc) + +
+

Returns the script of a Unicode character. Returns NULL if uc does not +belong to any script. +

+ +
+
Function: const uc_script_t * uc_script_byname (const char *script_name) + +
+

Returns the script given by its name, e.g. "HAN". Returns NULL if a +script with the given name does not exist. +

+ +

The following function views a script as a set of Unicode characters. +

+
+
Function: bool uc_is_script (ucs4_t uc, const uc_script_t *script) + +
+

Tests whether a Unicode character belongs to a given script. +

+ +

The following gives a global picture of all scripts. +

+
+
Function: void uc_all_scripts (const uc_script_t **scripts, size_t *count) + +
+

Get the list of all scripts. Stores a pointer to an array of all scripts in +*scripts and the length of this array in *count. +

+ +
+ + +

8.10 Blocks

+ +

The Unicode characters are subdivided into blocks. A block is an interval of +Unicode code points. +

+

The following type is used to represent a block. +

+
+
Type: uc_block_t + +
+

This data type is a structure type that refers to statically allocated data. +It contains the following fields: +

 
ucs4_t start;
+ucs4_t end;
+const char *name;
+
+ +

The start field is the first Unicode code point in the block. +

+

The end field is the last Unicode code point in the block. +

+

The name field is the name of the block. +

+ + +

The following function looks up a block. +

+
+
Function: const uc_block_t * uc_block (ucs4_t uc) + +
+

Returns the block a character belongs to. +

+ +

The following function views a block as a set of Unicode characters. +

+
+
Function: bool uc_is_block (ucs4_t uc, const uc_block_t *block) + +
+

Tests whether a Unicode character belongs to a given block. +

+ +

The following gives a global picture of all block. +

+
+
Function: void uc_all_blocks (const uc_block_t **blocks, size_t *count) + +
+

Get the list of all blocks. Stores a pointer to an array of all blocks in +*blocks and the length of this array in *count. +

+ +
+ + +

8.11 ISO C and Java syntax

+ +

The following properties are taken from language standards. The supported +language standards are ISO C 99 and Java. +

+
+
Function: bool uc_is_c_whitespace (ucs4_t uc) + +
+

Tests whether a Unicode character is considered whitespace in ISO C 99. +

+ +
+
Function: bool uc_is_java_whitespace (ucs4_t uc) + +
+

Tests whether a Unicode character is considered whitespace in Java. +

+ +

The following enumerated values are the possible return values of the functions +uc_c_ident_category and uc_java_ident_category. +

+
+
Constant: int UC_IDENTIFIER_START + +
+

This return value means that the given character is valid as first or +subsequent character in an identifier. +

+ +
+
Constant: int UC_IDENTIFIER_VALID + +
+

This return value means that the given character is valid as subsequent +character only. +

+ +
+
Constant: int UC_IDENTIFIER_INVALID + +
+

This return value means that the given character is not valid in an identifier. +

+ +
+
Constant: int UC_IDENTIFIER_IGNORABLE + +
+

This return value (only for Java) means that the given character is ignorable. +

+ +

The following function determine whether a given character can be a constituent +of an identifier in the given programming language. +

+ +
+
Function: int uc_c_ident_category (ucs4_t uc) + +
+

Returns the categorization of a Unicode character with respect to the ISO C 99 +identifier syntax. +

+ + +
+
Function: int uc_java_ident_category (ucs4_t uc) + +
+

Returns the categorization of a Unicode character with respect to the Java +identifier syntax. +

+ +
+ + +

8.12 Classifications like in ISO C

+ +

The following character classifications mimic those declared in the ISO C +header files <ctype.h> and <wctype.h>. These functions are +deprecated, because this set of functions was designed with ASCII in mind and +cannot reflect the more diverse reality of the Unicode character set. But +they can be a quick-and-dirty porting aid when migrating from wchar_t +APIs to Unicode strings. +

+
+
Function: bool uc_is_alnum (ucs4_t uc) + +
+

Tests for any character for which uc_is_alpha or uc_is_digit is +true. +

+ +
+
Function: bool uc_is_alpha (ucs4_t uc) + +
+

Tests for any character for which uc_is_upper or uc_is_lower is +true, or any character that is one of a locale-specific set of characters for +which none of uc_is_cntrl, uc_is_digit, uc_is_punct, or +uc_is_space is true. +

+ +
+
Function: bool uc_is_cntrl (ucs4_t uc) + +
+

Tests for any control character. +

+ +
+
Function: bool uc_is_digit (ucs4_t uc) + +
+

Tests for any character that corresponds to a decimal-digit character. +

+ +
+
Function: bool uc_is_graph (ucs4_t uc) + +
+

Tests for any character for which uc_is_print is true and +uc_is_space is false. +

+ +
+
Function: bool uc_is_lower (ucs4_t uc) + +
+

Tests for any character that corresponds to a lowercase letter or is one +of a locale-specific set of characters for which none of uc_is_cntrl, +uc_is_digit, uc_is_punct, or uc_is_space is true. +

+ +
+
Function: bool uc_is_print (ucs4_t uc) + +
+

Tests for any printing character. +

+ +
+
Function: bool uc_is_punct (ucs4_t uc) + +
+

Tests for any printing character that is one of a locale-specific set of +characters for which neither uc_is_space nor uc_is_alnum is true. +

+ +
+
Function: bool uc_is_space (ucs4_t uc) + +
+

Test for any character that corresponds to a locale-specific set of characters +for which none of uc_is_alnum, uc_is_graph, or uc_is_punct +is true. +

+ +
+
Function: bool uc_is_upper (ucs4_t uc) + +
+

Tests for any character that corresponds to an uppercase letter or is one +of a locale-specific set of characters for which none of uc_is_cntrl, +uc_is_digit, uc_is_punct, or uc_is_space is true. +

+ +
+
Function: bool uc_is_xdigit (ucs4_t uc) + +
+

Tests for any character that corresponds to a hexadecimal-digit character. +

+ +
+
Function: bool uc_is_blank (ucs4_t uc) + +
+

Tests for any character that corresponds to a standard blank character or +a locale-specific set of characters for which uc_is_alnum is false. +

+
+ + + + + + + + + + + + +
[ << ][ >> ]           [Top][Contents][Index][ ? ]
+

+ + This document was generated by Bruno Haible on July, 1 2009 using texi2html 1.78a. + +
+ +

+ + -- cgit v1.2.3