From a9a31b1de5776a3b08a82101a4fa711294f0dd1d Mon Sep 17 00:00:00 2001 From: "Manuel A. Fernandez Montecelo" Date: Fri, 27 May 2016 14:28:30 +0100 Subject: Imported Upstream version 0.9.6+really0.9.3 --- doc/libunistring_12.html | 475 ++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 391 insertions(+), 84 deletions(-) (limited to 'doc/libunistring_12.html') diff --git a/doc/libunistring_12.html b/doc/libunistring_12.html index f4bee9d3..1a4db370 100644 --- a/doc/libunistring_12.html +++ b/doc/libunistring_12.html @@ -1,6 +1,6 @@ - + -GNU libunistring: 12. Line breaking <unilbrk.h> +GNU libunistring: 12. Normalization forms (composition and decomposition) <uninorm.h> - - + + @@ -42,7 +42,7 @@ ul.toc {list-style: none} - + @@ -51,133 +51,440 @@ ul.toc {list-style: none} - +
[ << ]
[ << ] [ >> ]       [Top] [Contents][Index][Index] [ ? ]

- - -

12. Line breaking <unilbrk.h>

+ + +

12. Normalization forms (composition and decomposition) <uninorm.h>

+ +

This include file defines functions for transforming Unicode strings to one +of the four normal forms, known as NFC, NFD, NKFC, NFKD. These +transformations involve decomposition and — for NFC and NFKC — composition +of Unicode characters. +

+ +
+ + +

12.1 Decomposition of Unicode characters

+ +

The following enumerated values are the possible types of decomposition of a +Unicode character. +

+
+
Constant: int UC_DECOMP_CANONICAL + +
+

Denotes canonical decomposition. +

+ +
+
Constant: int UC_DECOMP_FONT + +
+

UCD marker: <font>. Denotes a font variant (e.g. a blackletter form). +

+ +
+
Constant: int UC_DECOMP_NOBREAK + +
+

UCD marker: <noBreak>. +Denotes a no-break version of a space or hyphen. +

+ +
+
Constant: int UC_DECOMP_INITIAL + +
+

UCD marker: <initial>. +Denotes an initial presentation form (Arabic). +

+ +
+
Constant: int UC_DECOMP_MEDIAL + +
+

UCD marker: <medial>. +Denotes a medial presentation form (Arabic). +

+ +
+
Constant: int UC_DECOMP_FINAL + +
+

UCD marker: <final>. +Denotes a final presentation form (Arabic). +

+ +
+
Constant: int UC_DECOMP_ISOLATED + +
+

UCD marker: <isolated>. +Denotes an isolated presentation form (Arabic). +

+ +
+
Constant: int UC_DECOMP_CIRCLE + +
+

UCD marker: <circle>. +Denotes an encircled form. +

+ +
+
Constant: int UC_DECOMP_SUPER + +
+

UCD marker: <super>. +Denotes a superscript form. +

+ +
+
Constant: int UC_DECOMP_SUB + +
+

UCD marker: <sub>. +Denotes a subscript form. +

+ +
+
Constant: int UC_DECOMP_VERTICAL + +
+

UCD marker: <vertical>. +Denotes a vertical layout presentation form. +

+ +
+
Constant: int UC_DECOMP_WIDE + +
+

UCD marker: <wide>. +Denotes a wide (or zenkaku) compatibility character. +

+ +
+
Constant: int UC_DECOMP_NARROW + +
+

UCD marker: <narrow>. +Denotes a narrow (or hankaku) compatibility character. +

+ +
+
Constant: int UC_DECOMP_SMALL + +
+

UCD marker: <small>. +Denotes a small variant form (CNS compatibility). +

+ +
+
Constant: int UC_DECOMP_SQUARE + +
+

UCD marker: <square>. +Denotes a CJK squared font variant. +

+ +
+
Constant: int UC_DECOMP_FRACTION + +
+

UCD marker: <fraction>. +Denotes a vulgar fraction form. +

+ +
+
Constant: int UC_DECOMP_COMPAT + +
+

UCD marker: <compat>. +Denotes an otherwise unspecified compatibility character. +

+ +

The following constant denotes the maximum size of decomposition of a single +Unicode character. +

+
+
Macro: unsigned int UC_DECOMPOSITION_MAX_LENGTH + +
+

This macro expands to a constant that is the required size of buffer passed to +the uc_decomposition and uc_canonical_decomposition functions. +

-

This include file declares functions for determining where in a string -line breaks could or should be introduced, in order to make the displayed -string fit into a column of given width. +

The following functions decompose a Unicode character. +

+
+
Function: int uc_decomposition (ucs4_t uc, int *decomp_tag, ucs4_t *decomposition) + +
+

Returns the character decomposition mapping of the Unicode character uc. +decomposition must point to an array of at least +UC_DECOMPOSITION_MAX_LENGTH ucs_t elements.

-

These functions are locale dependent. The encoding argument identifies -the encoding (e.g. "ISO-8859-2" for Polish). +

When a decomposition exists, decomposition[0..n-1] and +*decomp_tag are filled and n is returned. Otherwise -1 is +returned. +

+ +
+
Function: int uc_canonical_decomposition (ucs4_t uc, ucs4_t *decomposition) + +
+

Returns the canonical character decomposition mapping of the Unicode character +uc. decomposition must point to an array of at least +UC_DECOMPOSITION_MAX_LENGTH ucs_t elements.

-

The following enumerated values indicate whether, at a given position, a line -break is possible or not. Given an string s as an array -s[0..n-1] and a position i, the values have the -following meanings: +

When a decomposition exists, decomposition[0..n-1] is filled +and n is returned. Otherwise -1 is returned. +

+ +
+ + +

12.2 Composition of Unicode characters

+ +

The following function composes a Unicode character from two Unicode +characters.

-
Constant: int UC_BREAK_MANDATORY - +
Function: ucs4_t uc_composition (ucs4_t uc1, ucs4_t uc2) +
-

This value indicates that s[i] is a line break character. +

Attempts to combine the Unicode characters uc1, uc2. +uc1 is known to have canonical combining class 0. +

+

Returns the combination of uc1 and uc2, if it exists. +Returns 0 otherwise. +

+

Not all decompositions can be recombined using this function. See the Unicode +file ‘CompositionExclusions.txt’ for details.

+
+ + +

12.3 Normalization of strings

+ +

The Unicode standard defines four normalization forms for Unicode strings. +The following type is used to denote a normalization form. +

-
Constant: int UC_BREAK_POSSIBLE - +
Type: uninorm_t +
-

This value indicates that a line break may be inserted between -s[i-1] and s[i]. +

An object of type uninorm_t denotes a Unicode normalization form. +This is a scalar type; its values can be compared with ==.

+

The following constants denote the four normalization forms. +

-
Constant: int UC_BREAK_HYPHENATION - +
Macro: uninorm_t UNINORM_NFD +
-

This value indicates that a hyphen and a line break may be inserted between -s[i-1] and s[i]. But beware of language -dependent hyphenation rules. +

Denotes Normalization form D: canonical decomposition.

-
Constant: int UC_BREAK_PROHIBITED - +
Macro: uninorm_t UNINORM_NFC +
-

This value indicates that s[i-1] and s[i] -must not be separated. +

Normalization form C: canonical decomposition, then canonical composition.

-
Constant: int UC_BREAK_UNDEFINED - +
Macro: uninorm_t UNINORM_NFKD +
-

This value is not used as a return value; rather, in the overriding argument of -the u*_width_linebreaks functions, it indicates the absence of an -override. +

Normalization form KD: compatibility decomposition.

-

The following functions determine the positions at which line breaks are -possible. +

+
Macro: uninorm_t UNINORM_NFKC + +
+

Normalization form KC: compatibility decomposition, then canonical composition. +

+ +

The following functions operate on uninorm_t objects.

-
Function: void u8_possible_linebreaks (const uint8_t *s, size_t n, const char *encoding, char *p) - +
Function: bool uninorm_is_compat_decomposing (uninorm_t nf) +
-
Function: void u16_possible_linebreaks (const uint16_t *s, size_t n, const char *encoding, char *p) - +

Tests whether the normalization form nf does compatibility decomposition. +

+ +
+
Function: bool uninorm_is_composing (uninorm_t nf) +
-
Function: void u32_possible_linebreaks (const uint32_t *s, size_t n, const char *encoding, char *p) - +

Tests whether the normalization form nf includes canonical composition. +

+ +
+
Function: uninorm_t uninorm_decomposing_form (uninorm_t nf) +
-
Function: void ulc_possible_linebreaks (const char *s, size_t n, const char *encoding, char *p) - +

Returns the decomposing variant of the normalization form nf. +This maps NFC,NFD → NFD and NFKC,NFKD → NFKD. +

+ +

The following functions apply a Unicode normalization form to a Unicode string. +

+
+
Function: uint8_t * u8_normalize (uninorm_t nf, const uint8_t *s, size_t n, uint8_t *resultbuf, size_t *lengthp) + +
+
Function: uint16_t * u16_normalize (uninorm_t nf, const uint16_t *s, size_t n, uint16_t *resultbuf, size_t *lengthp) + +
+
Function: uint32_t * u32_normalize (uninorm_t nf, const uint32_t *s, size_t n, uint32_t *resultbuf, size_t *lengthp) +
-

Determines the line break points in s, and stores the result at -p[0..n-1]. Every p[i] is assigned one of -the values UC_BREAK_MANDATORY, UC_BREAK_POSSIBLE, -UC_BREAK_HYPHENATION, UC_BREAK_PROHIBITED. +

Returns the specified normalization form of a string.

-

The following functions determine where line breaks should be inserted so that -each line fits in a given width, when output to a device that uses -non-proportional fonts. +


+ + +

12.4 Normalizing comparisons

+ +

The following functions compare Unicode string, ignoring differences in +normalization.

-
Function: int u8_width_linebreaks (const uint8_t *s, size_t n, int width, int start_column, int at_end_columns, const char *override, const char *encoding, char *p) - +
Function: int u8_normcmp (const uint8_t *s1, size_t n1, const uint8_t *s2, size_t n2, uninorm_t nf, int *resultp) +
-
Function: int u16_width_linebreaks (const uint16_t *s, size_t n, int width, int start_column, int at_end_columns, const char *override, const char *encoding, char *p) - +
Function: int u16_normcmp (const uint16_t *s1, size_t n1, const uint16_t *s2, size_t n2, uninorm_t nf, int *resultp) +
-
Function: int u32_width_linebreaks (const uint32_t *s, size_t n, int width, int start_column, int at_end_columns, const char *override, const char *encoding, char *p) - +
Function: int u32_normcmp (const uint32_t *s1, size_t n1, const uint32_t *s2, size_t n2, uninorm_t nf, int *resultp) +
-
Function: int ulc_width_linebreaks (const char *s, size_t n, int width, int start_column, int at_end_columns, const char *override, const char *encoding, char *p) - +

Compares s1 and s2, ignoring differences in normalization. +

+

nf must be either UNINORM_NFD or UNINORM_NFKD. +

+

If successful, sets *resultp to -1 if s1 < s2, +0 if s1 = s2, 1 if s1 > s2, and returns 0. +Upon failure, returns -1 with errno set. +

+ + + +
+
Function: char * u8_normxfrm (const uint8_t *s, size_t n, uninorm_t nf, char *resultbuf, size_t *lengthp) + +
+
Function: char * u16_normxfrm (const uint16_t *s, size_t n, uninorm_t nf, char *resultbuf, size_t *lengthp) + +
+
Function: char * u32_normxfrm (const uint32_t *s, size_t n, uninorm_t nf, char *resultbuf, size_t *lengthp) + +
+

Converts the string s of length n to a NUL-terminated byte +sequence, in such a way that comparing u8_normxfrm (s1) and +u8_normxfrm (s2) with the u8_cmp2 function is equivalent to +comparing s1 and s2 with the u8_normcoll function. +

+

nf must be either UNINORM_NFC or UNINORM_NFKC. +

+ +
+
Function: int u8_normcoll (const uint8_t *s1, size_t n1, const uint8_t *s2, size_t n2, uninorm_t nf, int *resultp) + +
+
Function: int u16_normcoll (const uint16_t *s1, size_t n1, const uint16_t *s2, size_t n2, uninorm_t nf, int *resultp) + +
+
Function: int u32_normcoll (const uint32_t *s1, size_t n1, const uint32_t *s2, size_t n2, uninorm_t nf, int *resultp) + +
+

Compares s1 and s2, ignoring differences in normalization, using +the collation rules of the current locale. +

+

nf must be either UNINORM_NFC or UNINORM_NFKC. +

+

If successful, sets *resultp to -1 if s1 < s2, +0 if s1 = s2, 1 if s1 > s2, and returns 0. +Upon failure, returns -1 with errno set. +

+ +
+ + +

12.5 Normalization of streams of Unicode characters

+ +

A “stream of Unicode characters” is essentially a function that accepts an +ucs4_t argument repeatedly, optionally combined with a function that +“flushes” the stream. +

+
+
Type: struct uninorm_filter + +
+

This is the data type of a stream of Unicode characters that normalizes its +input according to a given normalization form and passes the normalized +character sequence to the encapsulated stream of Unicode characters. +

+ +
+
Function: struct uninorm_filter * uninorm_filter_create (uninorm_t nf, int (*stream_func) (void *stream_data, ucs4_t uc), void *stream_data) +
-

Chooses the best line breaks, assuming that every character occupies a width -given by the uc_width function (see Display width <uniwidth.h>). +

Creates and returns a normalization filter for Unicode characters.

-

The string is s[0..n-1]. +

The pair (stream_func, stream_data) is the encapsulated stream. +stream_func (stream_data, uc) receives the Unicode +character uc and returns 0 if successful, or -1 with errno set +upon failure.

-

The maximum number of columns per line is given as width. -The starting column of the string is given as start_column. -If the algorithm shall keep room after the last piece, this amount of room can -be given as at_end_columns. +

Returns the new filter, or NULL with errno set upon failure. +

+ +
+
Function: int uninorm_filter_write (struct uninorm_filter *filter, ucs4_t uc) + +
+

Stuffs a Unicode character into a normalizing filter. +Returns 0 if successful, or -1 with errno set upon failure. +

+ +
+
Function: int uninorm_filter_flush (struct uninorm_filter *filter) + +
+

Brings data buffered in the filter to its destination, the encapsulated stream.

-

override is an optional override; if -override[i] != UC_BREAK_UNDEFINED, -override[i] takes precedence over p[i] -as returned by the u*_possible_linebreaks function. +

Returns 0 if successful, or -1 with errno set upon failure.

-

The given encoding is used for disambiguating widths in uc_width. +

Note! If after calling this function, additional characters are written +into the filter, the resulting character sequence in the encapsulated stream +will not necessarily be normalized. +

+ +
+
Function: int uninorm_filter_free (struct uninorm_filter *filter) + +
+

Brings data buffered in the filter to its destination, the encapsulated stream, +then closes and frees the filter.

-

Returns the column after the end of the string, and stores the result at -p[0..n-1]. Every p[i] is assigned one of -the values UC_BREAK_MANDATORY, UC_BREAK_POSSIBLE, -UC_BREAK_HYPHENATION, UC_BREAK_PROHIBITED. Here the value -UC_BREAK_POSSIBLE indicates that a line break should be inserted. +

Returns 0 if successful, or -1 with errno set upon failure.


- + @@ -186,12 +493,12 @@ the values UC_BREAK_MANDATORY, UC_BREAK_POSSIBLE, - +
[ << ]
[ << ] [ >> ]       [Top] [Contents][Index][Index] [ ? ]

- This document was generated by Daiki Ueno on July, 8 2015 using texi2html 1.78a. + This document was generated by Bruno Haible on March, 30 2010 using texi2html 1.78a.
-- cgit v1.2.3