This include file declares functions that classify Unicode characters
and that test whether Unicode characters have specific properties.
The classification assigns a “general category” to every Unicode
character. This is similar to the classification provided by ISO C in
<wctype.h>
.
Properties are the data that guides various text processing algorithms
in the presence of specific Unicode characters.
Every Unicode character or code point has a general category assigned
to it. This classification is important for most algorithms that work on
Unicode text.
The GNU libunistring library provides two kinds of API for working with
general categories. The object oriented API uses a variable to denote
every predefined general category value or combinations thereof. The
low-level API uses a bit mask instead. The advantage of the object oriented
API is that if only a few predefined general category values are used,
the data tables are relatively small. When you combine general category
values (using uc_general_category_or
, uc_general_category_and
,
or uc_general_category_and_not
), or when you use the low level
bit masks, a big table is used thats holds the complete general category
information for all Unicode characters.
- Type: uc_general_category_t
This data type denotes a general category value. It is an immediate type that
can be copied by simple assignment, without involving memory allocation. It is
not an array type.
The following are the predefined general category value. Additional general
categories may be added in the future.
The UC_CATEGORY_*
constants reflect the systematic general category
values assigned by the Unicode Consortium. Whereas the other UC_*
macros are aliases, for use when readable code is preferred.
- Constant: uc_general_category_t UC_CATEGORY_L
- Macro: uc_general_category_t UC_LETTER
This represents the general category “Letter”.
- Constant: uc_general_category_t UC_CATEGORY_LC
- Macro: uc_general_category_t UC_CASED_LETTER
- Constant: uc_general_category_t UC_CATEGORY_Lu
- Macro: uc_general_category_t UC_UPPERCASE_LETTER
This represents the general category “Letter, uppercase”.
- Constant: uc_general_category_t UC_CATEGORY_Ll
- Macro: uc_general_category_t UC_LOWERCASE_LETTER
This represents the general category “Letter, lowercase”.
- Constant: uc_general_category_t UC_CATEGORY_Lt
- Macro: uc_general_category_t UC_TITLECASE_LETTER
This represents the general category “Letter, titlecase”.
- Constant: uc_general_category_t UC_CATEGORY_Lm
- Macro: uc_general_category_t UC_MODIFIER_LETTER
This represents the general category “Letter, modifier”.
- Constant: uc_general_category_t UC_CATEGORY_Lo
- Macro: uc_general_category_t UC_OTHER_LETTER
This represents the general category “Letter, other”.
- Constant: uc_general_category_t UC_CATEGORY_M
- Macro: uc_general_category_t UC_MARK
This represents the general category “Marker”.
- Constant: uc_general_category_t UC_CATEGORY_Mn
- Macro: uc_general_category_t UC_NON_SPACING_MARK
This represents the general category “Marker, nonspacing”.
- Constant: uc_general_category_t UC_CATEGORY_Mc
- Macro: uc_general_category_t UC_COMBINING_SPACING_MARK
This represents the general category “Marker, spacing combining”.
- Constant: uc_general_category_t UC_CATEGORY_Me
- Macro: uc_general_category_t UC_ENCLOSING_MARK
This represents the general category “Marker, enclosing”.
- Constant: uc_general_category_t UC_CATEGORY_N
- Macro: uc_general_category_t UC_NUMBER
This represents the general category “Number”.
- Constant: uc_general_category_t UC_CATEGORY_Nd
- Macro: uc_general_category_t UC_DECIMAL_DIGIT_NUMBER
This represents the general category “Number, decimal digit”.
- Constant: uc_general_category_t UC_CATEGORY_Nl
- Macro: uc_general_category_t UC_LETTER_NUMBER
This represents the general category “Number, letter”.
- Constant: uc_general_category_t UC_CATEGORY_No
- Macro: uc_general_category_t UC_OTHER_NUMBER
This represents the general category “Number, other”.
- Constant: uc_general_category_t UC_CATEGORY_P
- Macro: uc_general_category_t UC_PUNCTUATION
This represents the general category “Punctuation”.
- Constant: uc_general_category_t UC_CATEGORY_Pc
- Macro: uc_general_category_t UC_CONNECTOR_PUNCTUATION
This represents the general category “Punctuation, connector”.
- Constant: uc_general_category_t UC_CATEGORY_Pd
- Macro: uc_general_category_t UC_DASH_PUNCTUATION
This represents the general category “Punctuation, dash”.
- Constant: uc_general_category_t UC_CATEGORY_Ps
- Macro: uc_general_category_t UC_OPEN_PUNCTUATION
This represents the general category “Punctuation, open”, a.k.a. “start punctuation”.
- Constant: uc_general_category_t UC_CATEGORY_Pe
- Macro: uc_general_category_t UC_CLOSE_PUNCTUATION
This represents the general category “Punctuation, close”, a.k.a. “end punctuation”.
- Constant: uc_general_category_t UC_CATEGORY_Pi
- Macro: uc_general_category_t UC_INITIAL_QUOTE_PUNCTUATION
This represents the general category “Punctuation, initial quote”.
- Constant: uc_general_category_t UC_CATEGORY_Pf
- Macro: uc_general_category_t UC_FINAL_QUOTE_PUNCTUATION
This represents the general category “Punctuation, final quote”.
- Constant: uc_general_category_t UC_CATEGORY_Po
- Macro: uc_general_category_t UC_OTHER_PUNCTUATION
This represents the general category “Punctuation, other”.
- Constant: uc_general_category_t UC_CATEGORY_S
- Macro: uc_general_category_t UC_SYMBOL
This represents the general category “Symbol”.
- Constant: uc_general_category_t UC_CATEGORY_Sm
- Macro: uc_general_category_t UC_MATH_SYMBOL
This represents the general category “Symbol, math”.
- Constant: uc_general_category_t UC_CATEGORY_Sc
- Macro: uc_general_category_t UC_CURRENCY_SYMBOL
This represents the general category “Symbol, currency”.
- Constant: uc_general_category_t UC_CATEGORY_Sk
- Macro: uc_general_category_t UC_MODIFIER_SYMBOL
This represents the general category “Symbol, modifier”.
- Constant: uc_general_category_t UC_CATEGORY_So
- Macro: uc_general_category_t UC_OTHER_SYMBOL
This represents the general category “Symbol, other”.
- Constant: uc_general_category_t UC_CATEGORY_Z
- Macro: uc_general_category_t UC_SEPARATOR
This represents the general category “Separator”.
- Constant: uc_general_category_t UC_CATEGORY_Zs
- Macro: uc_general_category_t UC_SPACE_SEPARATOR
This represents the general category “Separator, space”.
- Constant: uc_general_category_t UC_CATEGORY_Zl
- Macro: uc_general_category_t UC_LINE_SEPARATOR
This represents the general category “Separator, line”.
- Constant: uc_general_category_t UC_CATEGORY_Zp
- Macro: uc_general_category_t UC_PARAGRAPH_SEPARATOR
This represents the general category “Separator, paragraph”.
- Constant: uc_general_category_t UC_CATEGORY_C
- Macro: uc_general_category_t UC_OTHER
This represents the general category “Other”.
- Constant: uc_general_category_t UC_CATEGORY_Cc
- Macro: uc_general_category_t UC_CONTROL
This represents the general category “Other, control”.
- Constant: uc_general_category_t UC_CATEGORY_Cf
- Macro: uc_general_category_t UC_FORMAT
This represents the general category “Other, format”.
- Constant: uc_general_category_t UC_CATEGORY_Cs
- Macro: uc_general_category_t UC_SURROGATE
This represents the general category “Other, surrogate”.
All code points in this category are invalid characters.
- Constant: uc_general_category_t UC_CATEGORY_Co
- Macro: uc_general_category_t UC_PRIVATE_USE
This represents the general category “Other, private use”.
- Constant: uc_general_category_t UC_CATEGORY_Cn
- Macro: uc_general_category_t UC_UNASSIGNED
This represents the general category “Other, not assigned”.
Some code points in this category are invalid characters.
The following functions combine general categories, like in a boolean algebra,
except that there is no ‘not’ operation.
- Function: uc_general_category_t uc_general_category_or (uc_general_category_t category1, uc_general_category_t category2)
Returns the union of two general categories.
This corresponds to the unions of the two sets of characters.
- Function: uc_general_category_t uc_general_category_and (uc_general_category_t category1, uc_general_category_t category2)
Returns the intersection of two general categories as bit masks.
This does not correspond to the intersection of the two sets of
characters.
- Function: uc_general_category_t uc_general_category_and_not (uc_general_category_t category1, uc_general_category_t category2)
Returns the intersection of a general category with the complement of a
second general category, as bit masks.
This does not correspond to the intersection with complement, when
viewing the categories as sets of characters.
The following functions associate general categories with their name.
- Function: const char * uc_general_category_name (uc_general_category_t category)
Returns the name of a general category, more precisely, the abbreviated name.
Returns NULL if the general category corresponds to a bit mask that does not
have a name.
- Function: const char * uc_general_category_long_name (uc_general_category_t category)
Returns the long name of a general category.
Returns NULL if the general category corresponds to a bit mask that does not
have a name.
- Function: uc_general_category_t uc_general_category_byname (const char *category_name)
Returns the general category given by name, e.g. "Lu"
, or by long
name, e.g. "Uppercase Letter"
.
This lookup ignores spaces, underscores, or hyphens as word separators and is
case-insignificant.
The following functions view general categories as sets of Unicode characters.
- Function: uc_general_category_t uc_general_category (ucs4_t uc)
Returns the general category of a Unicode character.
This function uses a big table.
- Function: bool uc_is_general_category (ucs4_t uc, uc_general_category_t category)
Tests whether a Unicode character belongs to a given category.
The category argument can be a predefined general category or the
combination of several predefined general categories.
The following are the predefined general category value as bit masks.
Additional general categories may be added in the future.
- Macro: uint32_t UC_CATEGORY_MASK_L
- Macro: uint32_t UC_CATEGORY_MASK_LC
- Macro: uint32_t UC_CATEGORY_MASK_Lu
- Macro: uint32_t UC_CATEGORY_MASK_Ll
- Macro: uint32_t UC_CATEGORY_MASK_Lt
- Macro: uint32_t UC_CATEGORY_MASK_Lm
- Macro: uint32_t UC_CATEGORY_MASK_Lo
- Macro: uint32_t UC_CATEGORY_MASK_M
- Macro: uint32_t UC_CATEGORY_MASK_Mn
- Macro: uint32_t UC_CATEGORY_MASK_Mc
- Macro: uint32_t UC_CATEGORY_MASK_Me
- Macro: uint32_t UC_CATEGORY_MASK_N
- Macro: uint32_t UC_CATEGORY_MASK_Nd
- Macro: uint32_t UC_CATEGORY_MASK_Nl
- Macro: uint32_t UC_CATEGORY_MASK_No
- Macro: uint32_t UC_CATEGORY_MASK_P
- Macro: uint32_t UC_CATEGORY_MASK_Pc
- Macro: uint32_t UC_CATEGORY_MASK_Pd
- Macro: uint32_t UC_CATEGORY_MASK_Ps
- Macro: uint32_t UC_CATEGORY_MASK_Pe
- Macro: uint32_t UC_CATEGORY_MASK_Pi
- Macro: uint32_t UC_CATEGORY_MASK_Pf
- Macro: uint32_t UC_CATEGORY_MASK_Po
- Macro: uint32_t UC_CATEGORY_MASK_S
- Macro: uint32_t UC_CATEGORY_MASK_Sm
- Macro: uint32_t UC_CATEGORY_MASK_Sc
- Macro: uint32_t UC_CATEGORY_MASK_Sk
- Macro: uint32_t UC_CATEGORY_MASK_So
- Macro: uint32_t UC_CATEGORY_MASK_Z
- Macro: uint32_t UC_CATEGORY_MASK_Zs
- Macro: uint32_t UC_CATEGORY_MASK_Zl
- Macro: uint32_t UC_CATEGORY_MASK_Zp
- Macro: uint32_t UC_CATEGORY_MASK_C
- Macro: uint32_t UC_CATEGORY_MASK_Cc
- Macro: uint32_t UC_CATEGORY_MASK_Cf
- Macro: uint32_t UC_CATEGORY_MASK_Cs
- Macro: uint32_t UC_CATEGORY_MASK_Co
- Macro: uint32_t UC_CATEGORY_MASK_Cn
The following function views general categories as sets of Unicode characters.
- Function: bool uc_is_general_category_withtable (ucs4_t uc, uint32_t bitmask)
Tests whether a Unicode character belongs to a given category.
The bitmask argument can be a predefined general category bitmask or the
combination of several predefined general category bitmasks.
This function uses a big table comprising all general categories.
Every Unicode character or code point has a canonical combining class
assigned to it.
What is the meaning of the canonical combining class? Essentially, it
indicates the priority with which a combining character is attached to its
base character. The characters for which the canonical combining class is 0
are the base characters, and the characters for which it is greater than 0 are
the combining characters. Combining characters are rendered
near/attached/around their base character, and combining characters with small
combining classes are attached "first" or "closer" to the base character.
The canonical combining class of a character is a number in the range
0..255. The possible values are described in the Unicode Character Database
https://www.unicode.org/Public/UNIDATA/UCD.html. The list here is
not definitive; more values can be added in future versions.
- Constant: int UC_CCC_NR
The canonical combining class value for “Not Reordered” characters.
The value is 0.
- Constant: int UC_CCC_OV
The canonical combining class value for “Overlay” characters.
- Constant: int UC_CCC_NK
The canonical combining class value for “Nukta” characters.
- Constant: int UC_CCC_KV
The canonical combining class value for “Kana Voicing” characters.
- Constant: int UC_CCC_VR
The canonical combining class value for “Virama” characters.
- Constant: int UC_CCC_ATBL
The canonical combining class value for “Attached Below Left” characters.
- Constant: int UC_CCC_ATB
The canonical combining class value for “Attached Below” characters.
- Constant: int UC_CCC_ATA
The canonical combining class value for “Attached Above” characters.
- Constant: int UC_CCC_ATAR
The canonical combining class value for “Attached Above Right” characters.
- Constant: int UC_CCC_BL
The canonical combining class value for “Below Left” characters.
- Constant: int UC_CCC_B
The canonical combining class value for “Below” characters.
- Constant: int UC_CCC_BR
The canonical combining class value for “Below Right” characters.
- Constant: int UC_CCC_L
The canonical combining class value for “Left” characters.
- Constant: int UC_CCC_R
The canonical combining class value for “Right” characters.
- Constant: int UC_CCC_AL
The canonical combining class value for “Above Left” characters.
- Constant: int UC_CCC_A
The canonical combining class value for “Above” characters.
- Constant: int UC_CCC_AR
The canonical combining class value for “Above Right” characters.
- Constant: int UC_CCC_DB
The canonical combining class value for “Double Below” characters.
- Constant: int UC_CCC_DA
The canonical combining class value for “Double Above” characters.
- Constant: int UC_CCC_IS
The canonical combining class value for “Iota Subscript” characters.
The following functions associate canonical combining classes with their name.
- Function: const char * uc_combining_class_name (int ccc)
Returns the name of a canonical combining class, more precisely, the
abbreviated name.
Returns NULL if the canonical combining class is a numeric value without a
name.
- Function: const char * uc_combining_class_long_name (int ccc)
Returns the long name of a canonical combining class.
Returns NULL if the canonical combining class is a numeric value without a
name.
- Function: int uc_combining_class_byname (const char *ccc_name)
Returns the canonical combining class given by name, e.g. "BL"
, or by
long name, e.g. "Below Left"
.
This lookup ignores spaces, underscores, or hyphens as word separators and is
case-insignificant.
The following function looks up the canonical combining class of a character.
- Function: int uc_combining_class (ucs4_t uc)
Returns the canonical combining class of a Unicode character.
Every Unicode character or code point has a bidi class assigned to it.
Before Unicode 4.0, this concept was known as bidirectional category.
The bidi class guides the bidirectional algorithm
(https://www.unicode.org/reports/tr9/). The possible values are
the following.
- Constant: int UC_BIDI_L
The bidi class for `Left-to-Right`” characters.
- Constant: int UC_BIDI_LRE
The bidi class for “Left-to-Right Embedding” characters.
- Constant: int UC_BIDI_LRO
The bidi class for “Left-to-Right Override” characters.
- Constant: int UC_BIDI_R
The bidi class for “Right-to-Left” characters.
- Constant: int UC_BIDI_AL
The bidi class for “Right-to-Left Arabic” characters.
- Constant: int UC_BIDI_RLE
The bidi class for “Right-to-Left Embedding” characters.
- Constant: int UC_BIDI_RLO
The bidi class for “Right-to-Left Override” characters.
- Constant: int UC_BIDI_PDF
The bidi class for “Pop Directional Format” characters.
- Constant: int UC_BIDI_EN
The bidi class for “European Number” characters.
- Constant: int UC_BIDI_ES
The bidi class for “European Number Separator” characters.
- Constant: int UC_BIDI_ET
The bidi class for “European Number Terminator” characters.
- Constant: int UC_BIDI_AN
The bidi class for “Arabic Number” characters.
- Constant: int UC_BIDI_CS
The bidi class for “Common Number Separator” characters.
- Constant: int UC_BIDI_NSM
The bidi class for “Non-Spacing Mark” characters.
- Constant: int UC_BIDI_BN
The bidi class for “Boundary Neutral” characters.
- Constant: int UC_BIDI_B
The bidi class for “Paragraph Separator” characters.
- Constant: int UC_BIDI_S
The bidi class for “Segment Separator” characters.
- Constant: int UC_BIDI_WS
The bidi class for “Whitespace” characters.
- Constant: int UC_BIDI_ON
The bidi class for “Other Neutral” characters.
- Constant: int UC_BIDI_LRI
The bidi class for “Left-to-Right Isolate” characters.
- Constant: int UC_BIDI_RLI
The bidi class for “Right-to-Left Isolate” characters.
- Constant: int UC_BIDI_FSI
The bidi class for “First Strong Isolate” characters.
- Constant: int UC_BIDI_PDI
The bidi class for “Pop Directional Isolate” characters.
The following functions implement the association between a bidirectional
category and its name.
- Function: const char * uc_bidi_class_name (int bidi_class)
- Function: const char * uc_bidi_category_name (int category)
Returns the name of a bidi class, more precisely, the abbreviated name.
- Function: const char * uc_bidi_class_long_name (int bidi_class)
Returns the long name of a bidi class.
- Function: int uc_bidi_class_byname (const char *bidi_class_name)
- Function: int uc_bidi_category_byname (const char *category_name)
Returns the bidi class given by name, e.g. "LRE"
, or by long name,
e.g. "Left-to-Right Embedding"
.
This lookup ignores spaces, underscores, or hyphens as word separators and is
case-insignificant.
The following functions view bidirectional categories as sets of Unicode
characters.
- Function: int uc_bidi_class (ucs4_t uc)
- Function: int uc_bidi_category (ucs4_t uc)
Returns the bidi class of a Unicode character.
- Function: bool uc_is_bidi_class (ucs4_t uc, int bidi_class)
- Function: bool uc_is_bidi_category (ucs4_t uc, int category)
Tests whether a Unicode character belongs to a given bidi class.
Decimal digits (like the digits from ‘0’ to ‘9’) exist in many
scripts. The following function converts a decimal digit character to its
numerical value.
- Function: int uc_decimal_value (ucs4_t uc)
Returns the decimal digit value of a Unicode character.
The return value is an integer in the range 0..9, or -1 for characters that
do not represent a decimal digit.
Digit characters are like decimal digit characters, possibly in special forms,
like as superscript, subscript, or circled. The following function converts a
digit character to its numerical value.
- Function: int uc_digit_value (ucs4_t uc)
Returns the digit value of a Unicode character.
The return value is an integer in the range 0..9, or -1 for characters that
do not represent a digit.
There are also characters that represent numbers without a digit system, like
the Roman numerals, and fractional numbers, like 1/4 or 3/4.
The following type represents the numeric value of a Unicode character.
- Type: uc_fraction_t
This is a structure type with the following fields:
| int numerator;
int denominator;
|
An integer n is represented by numerator = n
,
denominator = 1
.
The following function converts a number character to its numerical value.
- Function: uc_fraction_t uc_numeric_value (ucs4_t uc)
Returns the numeric value of a Unicode character.
The return value is a fraction, or the pseudo-fraction { 0, 0 }
for
characters that do not represent a number.
Character mirroring is used to associate the closing parenthesis character
to the opening parenthesis character, the closing brace character with the
opening brace character, and so on.
The following function looks up the mirrored character of a Unicode character.
- Function: bool uc_mirror_char (ucs4_t uc, ucs4_t *puc)
Stores the mirrored character of a Unicode character uc in
*puc
and returns true
, if it exists. Otherwise it
stores uc unmodified in *puc
and returns false
.
Note: It is possible for this function to return true
and set
*puc
to 0xFFFD
.
This happens when the character has the bidi mirror property (that is, it
should be displayed through a mirrored glyph) but this mirrored glyph
does not exist as a Unicode character; thus a rendering engine needs to
synthesize it artificially or pick it from an appropriate font.
This affects mostly mathematical operators.
See section “Bidi Mirrored” of the Unicode standard.
When Arabic characters are rendered, after bidi reordering has taken
place, the shape of the glyphs are modified so that many adjacent glyphs
are joined. Two character properties describe how this “Arabic shaping”
takes place: the joining type and the joining group.
The joining type of a character describes on which of the left and right
neighbour characters the character's shape depends, and which of the two
neighbour characters are rendered depending on this character.
The joining type has the following possible values:
- Constant: int UC_JOINING_TYPE_U
“Non joining”: Characters of this joining type prohibit joining.
- Constant: int UC_JOINING_TYPE_T
“Transparent”: Characters of this joining type are skipped when
considering joining.
- Constant: int UC_JOINING_TYPE_C
“Join causing”: Characters of this joining type cause their neighbour
characters to change their shapes but don't change their own shape.
- Constant: int UC_JOINING_TYPE_L
“Left joining”: Characters of this joining type have two shapes,
isolated and initial. Such characters currently don't exist.
- Constant: int UC_JOINING_TYPE_R
“Right joining”: Characters of this joining type have two shapes,
isolated and final.
- Constant: int UC_JOINING_TYPE_D
“Dual joining”: Characters of this joining type have four shapes,
initial, medial, final, and isolated.
The following functions implement the association between a joining type
and its name.
- Function: const char * uc_joining_type_name (int joining_type)
Returns the name of a joining type.
- Function: const char * uc_joining_type_long_name (int joining_type)
Returns the long name of a joining type.
- Function: int uc_joining_type_byname (const char *joining_type_name)
Returns the joining type given by name, e.g. "D"
, or by long name,
e.g. "Dual Joining
.
This lookup ignores spaces, underscores, or hyphens as word separators and is
case-insignificant.
The following function gives the joining type of every Unicode character.
- Function: int uc_joining_type (ucs4_t uc)
Returns the joining type of a Unicode character.
The joining group of a character describes how the character's shape
is modified in the four contexts of dual-joining characters or in the
two contexts of right-joining characters.
The joining group has the following possible values:
- Constant: int UC_JOINING_GROUP_NONE
- Constant: int UC_JOINING_GROUP_AIN
- Constant: int UC_JOINING_GROUP_ALAPH
- Constant: int UC_JOINING_GROUP_ALEF
- Constant: int UC_JOINING_GROUP_BEH
- Constant: int UC_JOINING_GROUP_BETH
- Constant: int UC_JOINING_GROUP_BURUSHASKI_YEH_BARREE
- Constant: int UC_JOINING_GROUP_DAL
- Constant: int UC_JOINING_GROUP_DALATH_RISH
- Constant: int UC_JOINING_GROUP_E
- Constant: int UC_JOINING_GROUP_FARSI_YEH
- Constant: int UC_JOINING_GROUP_FE
- Constant: int UC_JOINING_GROUP_FEH
- Constant: int UC_JOINING_GROUP_FINAL_SEMKATH
- Constant: int UC_JOINING_GROUP_GAF
- Constant: int UC_JOINING_GROUP_GAMAL
- Constant: int UC_JOINING_GROUP_HAH
- Constant: int UC_JOINING_GROUP_HE
- Constant: int UC_JOINING_GROUP_HEH
- Constant: int UC_JOINING_GROUP_HEH_GOAL
- Constant: int UC_JOINING_GROUP_HETH
- Constant: int UC_JOINING_GROUP_KAF
- Constant: int UC_JOINING_GROUP_KAPH
- Constant: int UC_JOINING_GROUP_KHAPH
- Constant: int UC_JOINING_GROUP_KNOTTED_HEH
- Constant: int UC_JOINING_GROUP_LAM
- Constant: int UC_JOINING_GROUP_LAMADH
- Constant: int UC_JOINING_GROUP_MEEM
- Constant: int UC_JOINING_GROUP_MIM
- Constant: int UC_JOINING_GROUP_NOON
- Constant: int UC_JOINING_GROUP_NUN
- Constant: int UC_JOINING_GROUP_NYA
- Constant: int UC_JOINING_GROUP_PE
- Constant: int UC_JOINING_GROUP_QAF
- Constant: int UC_JOINING_GROUP_QAPH
- Constant: int UC_JOINING_GROUP_REH
- Constant: int UC_JOINING_GROUP_REVERSED_PE
- Constant: int UC_JOINING_GROUP_SAD
- Constant: int UC_JOINING_GROUP_SADHE
- Constant: int UC_JOINING_GROUP_SEEN
- Constant: int UC_JOINING_GROUP_SEMKATH
- Constant: int UC_JOINING_GROUP_SHIN
- Constant: int UC_JOINING_GROUP_SWASH_KAF
- Constant: int UC_JOINING_GROUP_SYRIAC_WAW
- Constant: int UC_JOINING_GROUP_TAH
- Constant: int UC_JOINING_GROUP_TAW
- Constant: int UC_JOINING_GROUP_TEH_MARBUTA
- Constant: int UC_JOINING_GROUP_TEH_MARBUTA_GOAL
- Constant: int UC_JOINING_GROUP_TETH
- Constant: int UC_JOINING_GROUP_WAW
- Constant: int UC_JOINING_GROUP_YEH
- Constant: int UC_JOINING_GROUP_YEH_BARREE
- Constant: int UC_JOINING_GROUP_YEH_WITH_TAIL
- Constant: int UC_JOINING_GROUP_YUDH
- Constant: int UC_JOINING_GROUP_YUDH_HE
- Constant: int UC_JOINING_GROUP_ZAIN
- Constant: int UC_JOINING_GROUP_ZHAIN
- Constant: int UC_JOINING_GROUP_ROHINGYA_YEH
- Constant: int UC_JOINING_GROUP_STRAIGHT_WAW
- Constant: int UC_JOINING_GROUP_MANICHAEAN_ALEPH
- Constant: int UC_JOINING_GROUP_MANICHAEAN_BETH
- Constant: int UC_JOINING_GROUP_MANICHAEAN_GIMEL
- Constant: int UC_JOINING_GROUP_MANICHAEAN_DALETH
- Constant: int UC_JOINING_GROUP_MANICHAEAN_WAW
- Constant: int UC_JOINING_GROUP_MANICHAEAN_ZAYIN
- Constant: int UC_JOINING_GROUP_MANICHAEAN_HETH
- Constant: int UC_JOINING_GROUP_MANICHAEAN_TETH
- Constant: int UC_JOINING_GROUP_MANICHAEAN_YODH
- Constant: int UC_JOINING_GROUP_MANICHAEAN_KAPH
- Constant: int UC_JOINING_GROUP_MANICHAEAN_LAMEDH
- Constant: int UC_JOINING_GROUP_MANICHAEAN_DHAMEDH
- Constant: int UC_JOINING_GROUP_MANICHAEAN_THAMEDH
- Constant: int UC_JOINING_GROUP_MANICHAEAN_MEM
- Constant: int UC_JOINING_GROUP_MANICHAEAN_NUN
- Constant: int UC_JOINING_GROUP_MANICHAEAN_SAMEKH
- Constant: int UC_JOINING_GROUP_MANICHAEAN_AYIN
- Constant: int UC_JOINING_GROUP_MANICHAEAN_PE
- Constant: int UC_JOINING_GROUP_MANICHAEAN_SADHE
- Constant: int UC_JOINING_GROUP_MANICHAEAN_QOPH
- Constant: int UC_JOINING_GROUP_MANICHAEAN_RESH
- Constant: int UC_JOINING_GROUP_MANICHAEAN_TAW
- Constant: int UC_JOINING_GROUP_MANICHAEAN_ONE
- Constant: int UC_JOINING_GROUP_MANICHAEAN_FIVE
- Constant: int UC_JOINING_GROUP_MANICHAEAN_TEN
- Constant: int UC_JOINING_GROUP_MANICHAEAN_TWENTY
- Constant: int UC_JOINING_GROUP_MANICHAEAN_HUNDRED
- Constant: int UC_JOINING_GROUP_AFRICAN_FEH
- Constant: int UC_JOINING_GROUP_AFRICAN_QAF
- Constant: int UC_JOINING_GROUP_AFRICAN_NOON
- Constant: int UC_JOINING_GROUP_MALAYALAM_NGA
- Constant: int UC_JOINING_GROUP_MALAYALAM_JA
- Constant: int UC_JOINING_GROUP_MALAYALAM_NYA
- Constant: int UC_JOINING_GROUP_MALAYALAM_TTA
- Constant: int UC_JOINING_GROUP_MALAYALAM_NNA
- Constant: int UC_JOINING_GROUP_MALAYALAM_NNNA
- Constant: int UC_JOINING_GROUP_MALAYALAM_BHA
- Constant: int UC_JOINING_GROUP_MALAYALAM_RA
- Constant: int UC_JOINING_GROUP_MALAYALAM_LLA
- Constant: int UC_JOINING_GROUP_MALAYALAM_LLLA
- Constant: int UC_JOINING_GROUP_MALAYALAM_SSA
- Constant: int UC_JOINING_GROUP_HANIFI_ROHINGYA_PA
- Constant: int UC_JOINING_GROUP_HANIFI_ROHINGYA_KINNA_YA
- Constant: int UC_JOINING_GROUP_THIN_YEH
- Constant: int UC_JOINING_GROUP_VERTICAL_TAIL
- Constant: int UC_JOINING_GROUP_KASHMIRI_YEH
The following functions implement the association between a joining group
and its name.
- Function: const char * uc_joining_group_name (int joining_group)
Returns the name of a joining group.
- Function: int uc_joining_group_byname (const char *joining_group_name)
Returns the joining group given by name, e.g. "Teh_Marbuta"
.
This lookup ignores spaces, underscores, or hyphens as word separators and is
case-insignificant.
The following function gives the joining group of every Unicode character.
- Function: int uc_joining_group (ucs4_t uc)
Returns the joining group of a Unicode character.
This section defines boolean properties of Unicode characters. This
means, a character either has the given property or does not have it.
In other words, the property can be viewed as a subset of the set of
Unicode characters.
The GNU libunistring library provides two kinds of API for working with
properties. The object oriented API uses a type uc_property_t
to designate a property. In the function-based API, which is a bit more
low level, a property is merely a function.
The following type designates a property on Unicode characters.
- Type: uc_property_t
This data type denotes a boolean property on Unicode characters. It is an
immediate type that can be copied by simple assignment, without involving
memory allocation. It is not an array type.
Many Unicode properties are predefined.
The following are general properties.
- Constant: uc_property_t UC_PROPERTY_WHITE_SPACE
- Constant: uc_property_t UC_PROPERTY_ALPHABETIC
- Constant: uc_property_t UC_PROPERTY_OTHER_ALPHABETIC
- Constant: uc_property_t UC_PROPERTY_NOT_A_CHARACTER
- Constant: uc_property_t UC_PROPERTY_DEFAULT_IGNORABLE_CODE_POINT
- Constant: uc_property_t UC_PROPERTY_OTHER_DEFAULT_IGNORABLE_CODE_POINT
- Constant: uc_property_t UC_PROPERTY_DEPRECATED
- Constant: uc_property_t UC_PROPERTY_LOGICAL_ORDER_EXCEPTION
- Constant: uc_property_t UC_PROPERTY_VARIATION_SELECTOR
- Constant: uc_property_t UC_PROPERTY_PRIVATE_USE
- Constant: uc_property_t UC_PROPERTY_UNASSIGNED_CODE_VALUE
The following properties are related to case folding.
- Constant: uc_property_t UC_PROPERTY_UPPERCASE
- Constant: uc_property_t UC_PROPERTY_OTHER_UPPERCASE
- Constant: uc_property_t UC_PROPERTY_LOWERCASE
- Constant: uc_property_t UC_PROPERTY_OTHER_LOWERCASE
- Constant: uc_property_t UC_PROPERTY_TITLECASE
- Constant: uc_property_t UC_PROPERTY_CASED
- Constant: uc_property_t UC_PROPERTY_CASE_IGNORABLE
- Constant: uc_property_t UC_PROPERTY_CHANGES_WHEN_LOWERCASED
- Constant: uc_property_t UC_PROPERTY_CHANGES_WHEN_UPPERCASED
- Constant: uc_property_t UC_PROPERTY_CHANGES_WHEN_TITLECASED
- Constant: uc_property_t UC_PROPERTY_CHANGES_WHEN_CASEFOLDED
- Constant: uc_property_t UC_PROPERTY_CHANGES_WHEN_CASEMAPPED
- Constant: uc_property_t UC_PROPERTY_SOFT_DOTTED
The following properties are related to identifiers.
- Constant: uc_property_t UC_PROPERTY_ID_START
- Constant: uc_property_t UC_PROPERTY_OTHER_ID_START
- Constant: uc_property_t UC_PROPERTY_ID_CONTINUE
- Constant: uc_property_t UC_PROPERTY_OTHER_ID_CONTINUE
- Constant: uc_property_t UC_PROPERTY_XID_START
- Constant: uc_property_t UC_PROPERTY_XID_CONTINUE
- Constant: uc_property_t UC_PROPERTY_ID_COMPAT_MATH_START
- Constant: uc_property_t UC_PROPERTY_ID_COMPAT_MATH_CONTINUE
- Constant: uc_property_t UC_PROPERTY_PATTERN_WHITE_SPACE
- Constant: uc_property_t UC_PROPERTY_PATTERN_SYNTAX
The following properties have an influence on shaping and rendering.
- Constant: uc_property_t UC_PROPERTY_JOIN_CONTROL
- Constant: uc_property_t UC_PROPERTY_GRAPHEME_BASE
- Constant: uc_property_t UC_PROPERTY_GRAPHEME_EXTEND
- Constant: uc_property_t UC_PROPERTY_OTHER_GRAPHEME_EXTEND
- Constant: uc_property_t UC_PROPERTY_GRAPHEME_LINK
- Constant: uc_property_t UC_PROPERTY_MODIFIER_COMBINING_MARK
The following properties relate to bidirectional reordering.
- Constant: uc_property_t UC_PROPERTY_BIDI_CONTROL
- Constant: uc_property_t UC_PROPERTY_BIDI_LEFT_TO_RIGHT
- Constant: uc_property_t UC_PROPERTY_BIDI_HEBREW_RIGHT_TO_LEFT
- Constant: uc_property_t UC_PROPERTY_BIDI_ARABIC_RIGHT_TO_LEFT
- Constant: uc_property_t UC_PROPERTY_BIDI_EUROPEAN_DIGIT
- Constant: uc_property_t UC_PROPERTY_BIDI_EUR_NUM_SEPARATOR
- Constant: uc_property_t UC_PROPERTY_BIDI_EUR_NUM_TERMINATOR
- Constant: uc_property_t UC_PROPERTY_BIDI_ARABIC_DIGIT
- Constant: uc_property_t UC_PROPERTY_BIDI_COMMON_SEPARATOR
- Constant: uc_property_t UC_PROPERTY_BIDI_BLOCK_SEPARATOR
- Constant: uc_property_t UC_PROPERTY_BIDI_SEGMENT_SEPARATOR
- Constant: uc_property_t UC_PROPERTY_BIDI_WHITESPACE
- Constant: uc_property_t UC_PROPERTY_BIDI_NON_SPACING_MARK
- Constant: uc_property_t UC_PROPERTY_BIDI_BOUNDARY_NEUTRAL
- Constant: uc_property_t UC_PROPERTY_BIDI_PDF
- Constant: uc_property_t UC_PROPERTY_BIDI_EMBEDDING_OR_OVERRIDE
- Constant: uc_property_t UC_PROPERTY_BIDI_OTHER_NEUTRAL
The following properties deal with number representations.
- Constant: uc_property_t UC_PROPERTY_HEX_DIGIT
- Constant: uc_property_t UC_PROPERTY_ASCII_HEX_DIGIT
The following properties deal with CJK.
- Constant: uc_property_t UC_PROPERTY_IDEOGRAPHIC
- Constant: uc_property_t UC_PROPERTY_UNIFIED_IDEOGRAPH
- Constant: uc_property_t UC_PROPERTY_RADICAL
- Constant: uc_property_t UC_PROPERTY_IDS_UNARY_OPERATOR
- Constant: uc_property_t UC_PROPERTY_IDS_BINARY_OPERATOR
- Constant: uc_property_t UC_PROPERTY_IDS_TRINARY_OPERATOR
The following properties deal with pictographic symbols.
- Constant: uc_property_t UC_PROPERTY_EMOJI
- Constant: uc_property_t UC_PROPERTY_EMOJI_PRESENTATION
- Constant: uc_property_t UC_PROPERTY_EMOJI_MODIFIER
- Constant: uc_property_t UC_PROPERTY_EMOJI_MODIFIER_BASE
- Constant: uc_property_t UC_PROPERTY_EMOJI_COMPONENT
- Constant: uc_property_t UC_PROPERTY_EXTENDED_PICTOGRAPHIC
Other miscellaneous properties are:
- Constant: uc_property_t UC_PROPERTY_ZERO_WIDTH
- Constant: uc_property_t UC_PROPERTY_SPACE
- Constant: uc_property_t UC_PROPERTY_NON_BREAK
- Constant: uc_property_t UC_PROPERTY_ISO_CONTROL
- Constant: uc_property_t UC_PROPERTY_FORMAT_CONTROL
- Constant: uc_property_t UC_PROPERTY_PREPENDED_CONCATENATION_MARK
- Constant: uc_property_t UC_PROPERTY_DASH
- Constant: uc_property_t UC_PROPERTY_HYPHEN
- Constant: uc_property_t UC_PROPERTY_PUNCTUATION
- Constant: uc_property_t UC_PROPERTY_LINE_SEPARATOR
- Constant: uc_property_t UC_PROPERTY_PARAGRAPH_SEPARATOR
- Constant: uc_property_t UC_PROPERTY_QUOTATION_MARK
- Constant: uc_property_t UC_PROPERTY_SENTENCE_TERMINAL
- Constant: uc_property_t UC_PROPERTY_TERMINAL_PUNCTUATION
- Constant: uc_property_t UC_PROPERTY_CURRENCY_SYMBOL
- Constant: uc_property_t UC_PROPERTY_MATH
- Constant: uc_property_t UC_PROPERTY_OTHER_MATH
- Constant: uc_property_t UC_PROPERTY_PAIRED_PUNCTUATION
- Constant: uc_property_t UC_PROPERTY_LEFT_OF_PAIR
- Constant: uc_property_t UC_PROPERTY_COMBINING
- Constant: uc_property_t UC_PROPERTY_COMPOSITE
- Constant: uc_property_t UC_PROPERTY_DECIMAL_DIGIT
- Constant: uc_property_t UC_PROPERTY_NUMERIC
- Constant: uc_property_t UC_PROPERTY_DIACRITIC
- Constant: uc_property_t UC_PROPERTY_EXTENDER
- Constant: uc_property_t UC_PROPERTY_IGNORABLE_CONTROL
- Constant: uc_property_t UC_PROPERTY_REGIONAL_INDICATOR
The following function looks up a property by its name.
- Function: uc_property_t uc_property_byname (const char *property_name)
Returns the property given by name, e.g. "White space"
. If a property
with the given name exists, the result will satisfy the
uc_property_is_valid
predicate. Otherwise the result will not satisfy
this predicate and must not be passed to functions that expect an
uc_property_t
argument.
This lookup ignores spaces, underscores, or hyphens as word separators, is
case-insignificant, and supports the aliases listed in Unicode's
‘PropertyAliases.txt’ file.
This function references a big table of all predefined properties. Its use
can significantly increase the size of your application.
- Function: bool uc_property_is_valid (uc_property_t property)
Returns true
when the given property is valid, or false
otherwise.
The following function views a property as a set of Unicode characters.
- Function: bool uc_is_property (ucs4_t uc, uc_property_t property)
Tests whether the Unicode character uc has the given property.
The following are general properties.
- Function: bool uc_is_property_white_space (ucs4_t uc)
- Function: bool uc_is_property_alphabetic (ucs4_t uc)
- Function: bool uc_is_property_other_alphabetic (ucs4_t uc)
- Function: bool uc_is_property_not_a_character (ucs4_t uc)
- Function: bool uc_is_property_default_ignorable_code_point (ucs4_t uc)
- Function: bool uc_is_property_other_default_ignorable_code_point (ucs4_t uc)
- Function: bool uc_is_property_deprecated (ucs4_t uc)
- Function: bool uc_is_property_logical_order_exception (ucs4_t uc)
- Function: bool uc_is_property_variation_selector (ucs4_t uc)
- Function: bool uc_is_property_private_use (ucs4_t uc)
- Function: bool uc_is_property_unassigned_code_value (ucs4_t uc)
The following properties are related to case folding.
- Function: bool uc_is_property_uppercase (ucs4_t uc)
- Function: bool uc_is_property_other_uppercase (ucs4_t uc)
- Function: bool uc_is_property_lowercase (ucs4_t uc)
- Function: bool uc_is_property_other_lowercase (ucs4_t uc)
- Function: bool uc_is_property_titlecase (ucs4_t uc)
- Function: bool uc_is_property_cased (ucs4_t uc)
- Function: bool uc_is_property_case_ignorable (ucs4_t uc)
- Function: bool uc_is_property_changes_when_lowercased (ucs4_t uc)
- Function: bool uc_is_property_changes_when_uppercased (ucs4_t uc)
- Function: bool uc_is_property_changes_when_titlecased (ucs4_t uc)
- Function: bool uc_is_property_changes_when_casefolded (ucs4_t uc)
- Function: bool uc_is_property_changes_when_casemapped (ucs4_t uc)
- Function: bool uc_is_property_soft_dotted (ucs4_t uc)
The following properties are related to identifiers.
- Function: bool uc_is_property_id_start (ucs4_t uc)
- Function: bool uc_is_property_other_id_start (ucs4_t uc)
- Function: bool uc_is_property_id_continue (ucs4_t uc)
- Function: bool uc_is_property_other_id_continue (ucs4_t uc)
- Function: bool uc_is_property_xid_start (ucs4_t uc)
- Function: bool uc_is_property_xid_continue (ucs4_t uc)
- Function: bool uc_is_property_id_compat_math_start (ucs4_t uc)
- Function: bool uc_is_property_id_compat_math_continue (ucs4_t uc)
- Function: bool uc_is_property_pattern_white_space (ucs4_t uc)
- Function: bool uc_is_property_pattern_syntax (ucs4_t uc)
The following properties have an influence on shaping and rendering.
- Function: bool uc_is_property_join_control (ucs4_t uc)
- Function: bool uc_is_property_grapheme_base (ucs4_t uc)
- Function: bool uc_is_property_grapheme_extend (ucs4_t uc)
- Function: bool uc_is_property_other_grapheme_extend (ucs4_t uc)
- Function: bool uc_is_property_grapheme_link (ucs4_t uc)
- Function: bool uc_is_property_modifier_combining_mark (ucs4_t uc)
The following properties relate to bidirectional reordering.
- Function: bool uc_is_property_bidi_control (ucs4_t uc)
- Function: bool uc_is_property_bidi_left_to_right (ucs4_t uc)
- Function: bool uc_is_property_bidi_hebrew_right_to_left (ucs4_t uc)
- Function: bool uc_is_property_bidi_arabic_right_to_left (ucs4_t uc)
- Function: bool uc_is_property_bidi_european_digit (ucs4_t uc)
- Function: bool uc_is_property_bidi_eur_num_separator (ucs4_t uc)
- Function: bool uc_is_property_bidi_eur_num_terminator (ucs4_t uc)
- Function: bool uc_is_property_bidi_arabic_digit (ucs4_t uc)
- Function: bool uc_is_property_bidi_common_separator (ucs4_t uc)
- Function: bool uc_is_property_bidi_block_separator (ucs4_t uc)
- Function: bool uc_is_property_bidi_segment_separator (ucs4_t uc)
- Function: bool uc_is_property_bidi_whitespace (ucs4_t uc)
- Function: bool uc_is_property_bidi_non_spacing_mark (ucs4_t uc)
- Function: bool uc_is_property_bidi_boundary_neutral (ucs4_t uc)
- Function: bool uc_is_property_bidi_pdf (ucs4_t uc)
- Function: bool uc_is_property_bidi_embedding_or_override (ucs4_t uc)
- Function: bool uc_is_property_bidi_other_neutral (ucs4_t uc)
The following properties deal with number representations.
- Function: bool uc_is_property_hex_digit (ucs4_t uc)
- Function: bool uc_is_property_ascii_hex_digit (ucs4_t uc)
The following properties deal with CJK.
- Function: bool uc_is_property_ideographic (ucs4_t uc)
- Function: bool uc_is_property_unified_ideograph (ucs4_t uc)
- Function: bool uc_is_property_radical (ucs4_t uc)
- Function: bool uc_is_property_ids_unary_operator (ucs4_t uc)
- Function: bool uc_is_property_ids_binary_operator (ucs4_t uc)
- Function: bool uc_is_property_ids_trinary_operator (ucs4_t uc)
The following properties deal with pictographic symbols.
- Function: bool uc_is_property_emoji (ucs4_t uc)
- Function: bool uc_is_property_emoji_presentation (ucs4_t uc)
- Function: bool uc_is_property_emoji_modifier (ucs4_t uc)
- Function: bool uc_is_property_emoji_modifier_base (ucs4_t uc)
- Function: bool uc_is_property_emoji_component (ucs4_t uc)
- Function: bool uc_is_property_extended_pictographic (ucs4_t uc)
Other miscellaneous properties are:
- Function: bool uc_is_property_zero_width (ucs4_t uc)
- Function: bool uc_is_property_space (ucs4_t uc)
- Function: bool uc_is_property_non_break (ucs4_t uc)
- Function: bool uc_is_property_iso_control (ucs4_t uc)
- Function: bool uc_is_property_format_control (ucs4_t uc)
- Function: bool uc_is_property_prepended_concatenation_mark (ucs4_t uc)
- Function: bool uc_is_property_dash (ucs4_t uc)
- Function: bool uc_is_property_hyphen (ucs4_t uc)
- Function: bool uc_is_property_punctuation (ucs4_t uc)
- Function: bool uc_is_property_line_separator (ucs4_t uc)
- Function: bool uc_is_property_paragraph_separator (ucs4_t uc)
- Function: bool uc_is_property_quotation_mark (ucs4_t uc)
- Function: bool uc_is_property_sentence_terminal (ucs4_t uc)
- Function: bool uc_is_property_terminal_punctuation (ucs4_t uc)
- Function: bool uc_is_property_currency_symbol (ucs4_t uc)
- Function: bool uc_is_property_math (ucs4_t uc)
- Function: bool uc_is_property_other_math (ucs4_t uc)
- Function: bool uc_is_property_paired_punctuation (ucs4_t uc)
- Function: bool uc_is_property_left_of_pair (ucs4_t uc)
- Function: bool uc_is_property_combining (ucs4_t uc)
- Function: bool uc_is_property_composite (ucs4_t uc)
- Function: bool uc_is_property_decimal_digit (ucs4_t uc)
- Function: bool uc_is_property_numeric (ucs4_t uc)
- Function: bool uc_is_property_diacritic (ucs4_t uc)
- Function: bool uc_is_property_extender (ucs4_t uc)
- Function: bool uc_is_property_ignorable_control (ucs4_t uc)
- Function: bool uc_is_property_regional_indicator (ucs4_t uc)
This section defines non-boolean attributes of Unicode characters.
The Indic_Conjunct_Break attribute is used when determining the grapheme
cluster boundary in Indic scripts.
The Indic_Conjunct_Break attribute has the following possible values:
- Constant: int UC_INDIC_CONJUNCT_BREAK_NONE
- Constant: int UC_INDIC_CONJUNCT_BREAK_CONSONANT
- Constant: int UC_INDIC_CONJUNCT_BREAK_LINKER
- Constant: int UC_INDIC_CONJUNCT_BREAK_EXTEND
The following functions implement the association between an
Indic_Conjunct_Break value and its name.
- Function: const char * uc_indic_conjunct_break_name (int indic_conjunct_break)
Returns the name of an Indic_Conjunct_Break value.
- Function: int uc_indic_conjunct_break_byname (const char *indic_conjunct_break_name)
Returns the Indic_Conjunct_Break value given by name, e.g. "Consonant"
.
This lookup ignores spaces, underscores, or hyphens as word separators and is
case-insignificant.
The following function gives the Indic_Conjunct_Break attribute of every
Unicode character.
- Function: int uc_indic_conjunct_break (ucs4_t uc)
Returns the Indic_Conjunct_Break attribute of a Unicode character.
The Unicode characters are subdivided into scripts.
The following type is used to represent a script:
- Type: uc_script_t
This data type is a structure type that refers to statically allocated
read-only data. It contains the following fields:
The name
field contains the name of the script.
The following functions look up a script.
- Function: const uc_script_t * uc_script (ucs4_t uc)
Returns the script of a Unicode character. Returns NULL if uc does not
belong to any script.
- Function: const uc_script_t * uc_script_byname (const char *script_name)
Returns the script given by its name, e.g. "HAN"
. Returns NULL if a
script with the given name does not exist.
The following function views a script as a set of Unicode characters.
- Function: bool uc_is_script (ucs4_t uc, const uc_script_t *script)
Tests whether a Unicode character belongs to a given script.
The following gives a global picture of all scripts.
- Function: void uc_all_scripts (const uc_script_t **scripts, size_t *count)
Get the list of all scripts. Stores a pointer to an array of all scripts in
*scripts
and the length of this array in *count
.
The Unicode characters are subdivided into blocks. A block is an interval of
Unicode code points.
The following type is used to represent a block.
- Type: uc_block_t
This data type is a structure type that refers to statically allocated data.
It contains the following fields:
| ucs4_t start;
ucs4_t end;
const char *name;
|
The start
field is the first Unicode code point in the block.
The end
field is the last Unicode code point in the block.
The name
field is the name of the block.
The following function looks up a block.
- Function: const uc_block_t * uc_block (ucs4_t uc)
Returns the block a character belongs to.
The following function views a block as a set of Unicode characters.
- Function: bool uc_is_block (ucs4_t uc, const uc_block_t *block)
Tests whether a Unicode character belongs to a given block.
The following gives a global picture of all block.
- Function: void uc_all_blocks (const uc_block_t **blocks, size_t *count)
Get the list of all blocks. Stores a pointer to an array of all blocks in
*blocks
and the length of this array in *count
.
The following properties are taken from language standards. The supported
language standards are ISO C 99 and Java.
- Function: bool uc_is_c_whitespace (ucs4_t uc)
Tests whether a Unicode character is considered whitespace in ISO C 99.
- Function: bool uc_is_java_whitespace (ucs4_t uc)
Tests whether a Unicode character is considered whitespace in Java.
The following enumerated values are the possible return values of the functions
uc_c_ident_category
and uc_java_ident_category
.
- Constant: int UC_IDENTIFIER_START
This return value means that the given character is valid as first or
subsequent character in an identifier.
- Constant: int UC_IDENTIFIER_VALID
This return value means that the given character is valid as subsequent
character only.
- Constant: int UC_IDENTIFIER_INVALID
This return value means that the given character is not valid in an identifier.
- Constant: int UC_IDENTIFIER_IGNORABLE
This return value (only for Java) means that the given character is ignorable.
The following function determine whether a given character can be a constituent
of an identifier in the given programming language.
- Function: int uc_c_ident_category (ucs4_t uc)
Returns the categorization of a Unicode character with respect to the ISO C 99
identifier syntax.
- Function: int uc_java_ident_category (ucs4_t uc)
Returns the categorization of a Unicode character with respect to the Java
identifier syntax.
The following character classifications mimic those declared in the ISO C
header files <ctype.h>
and <wctype.h>
. These functions are
deprecated, because this set of functions was designed with ASCII in mind and
cannot reflect the more diverse reality of the Unicode character set. But
they can be a quick-and-dirty porting aid when migrating from wchar_t
APIs to Unicode strings.
- Function: bool uc_is_alnum (ucs4_t uc)
Tests for any character for which uc_is_alpha
or uc_is_digit
is
true.
- Function: bool uc_is_alpha (ucs4_t uc)
Tests for any character for which uc_is_upper
or uc_is_lower
is
true, or any character that is one of a locale-specific set of characters for
which none of uc_is_cntrl
, uc_is_digit
, uc_is_punct
, or
uc_is_space
is true.
- Function: bool uc_is_cntrl (ucs4_t uc)
Tests for any control character.
- Function: bool uc_is_digit (ucs4_t uc)
Tests for any character that corresponds to a decimal-digit character.
- Function: bool uc_is_graph (ucs4_t uc)
Tests for any character for which uc_is_print
is true and
uc_is_space
is false.
- Function: bool uc_is_lower (ucs4_t uc)
Tests for any character that corresponds to a lowercase letter or is one
of a locale-specific set of characters for which none of uc_is_cntrl
,
uc_is_digit
, uc_is_punct
, or uc_is_space
is true.
- Function: bool uc_is_print (ucs4_t uc)
Tests for any printing character.
- Function: bool uc_is_punct (ucs4_t uc)
Tests for any printing character that is one of a locale-specific set of
characters for which neither uc_is_space
nor uc_is_alnum
is true.
- Function: bool uc_is_space (ucs4_t uc)
Test for any character that corresponds to a locale-specific set of characters
for which none of uc_is_alnum
, uc_is_graph
, or uc_is_punct
is true.
- Function: bool uc_is_upper (ucs4_t uc)
Tests for any character that corresponds to an uppercase letter or is one
of a locale-specific set of characters for which none of uc_is_cntrl
,
uc_is_digit
, uc_is_punct
, or uc_is_space
is true.
- Function: bool uc_is_xdigit (ucs4_t uc)
Tests for any character that corresponds to a hexadecimal-digit character.
- Function: bool uc_is_blank (ucs4_t uc)
Tests for any character that corresponds to a standard blank character or
a locale-specific set of characters for which uc_is_alnum
is false.
This document was generated by Bruno Haible on October, 16 2024 using texi2html 1.78a.