From 5f2b09982312c98863eb9a8dfe2c608b81f58259 Mon Sep 17 00:00:00 2001 From: "Manuel A. Fernandez Montecelo" Date: Thu, 26 May 2016 16:48:15 +0100 Subject: Imported Upstream version 0.9.6 --- doc/libunistring_12.html | 475 +++++++++-------------------------------------- 1 file changed, 84 insertions(+), 391 deletions(-) (limited to 'doc/libunistring_12.html') diff --git a/doc/libunistring_12.html b/doc/libunistring_12.html index 1a4db370..f4bee9d3 100644 --- a/doc/libunistring_12.html +++ b/doc/libunistring_12.html @@ -1,6 +1,6 @@ - + -GNU libunistring: 12. Normalization forms (composition and decomposition) <uninorm.h> +GNU libunistring: 12. Line breaking <unilbrk.h> - - + + @@ -42,7 +42,7 @@ ul.toc {list-style: none} - + @@ -51,440 +51,133 @@ ul.toc {list-style: none} - +

[ << ]
[ << ]	[ >> ]				[Top]	[Contents]	[Index]	[Index]	[ ? ]

- - -

12. Normalization forms (composition and decomposition) `<uninorm.h>`

- -

This include file defines functions for transforming Unicode strings to one -of the four normal forms, known as NFC, NFD, NKFC, NFKD. These -transformations involve decomposition and — for NFC and NFKC — composition -of Unicode characters. -

- -

- - -

12.1 Decomposition of Unicode characters

- -

The following enumerated values are the possible types of decomposition of a -Unicode character. -

Constant: int UC_DECOMP_CANONICAL - -: Denotes canonical decomposition. -

- -

Constant: int UC_DECOMP_FONT - -: UCD marker: <font>. Denotes a font variant (e.g. a blackletter form). -

- -

Constant: int UC_DECOMP_NOBREAK - -: UCD marker: <noBreak>. -Denotes a no-break version of a space or hyphen. -

- -

Constant: int UC_DECOMP_INITIAL - -: UCD marker: <initial>. -Denotes an initial presentation form (Arabic). -

- -

Constant: int UC_DECOMP_MEDIAL - -: UCD marker: <medial>. -Denotes a medial presentation form (Arabic). -

- -

Constant: int UC_DECOMP_FINAL - -: UCD marker: <final>. -Denotes a final presentation form (Arabic). -

- -

Constant: int UC_DECOMP_ISOLATED - -: UCD marker: <isolated>. -Denotes an isolated presentation form (Arabic). -

- -

Constant: int UC_DECOMP_CIRCLE - -: UCD marker: <circle>. -Denotes an encircled form. -

- -

Constant: int UC_DECOMP_SUPER - -: UCD marker: <super>. -Denotes a superscript form. -

- -

Constant: int UC_DECOMP_SUB - -: UCD marker: <sub>. -Denotes a subscript form. -

- -

Constant: int UC_DECOMP_VERTICAL - -: UCD marker: <vertical>. -Denotes a vertical layout presentation form. -

- -

Constant: int UC_DECOMP_WIDE - -: UCD marker: <wide>. -Denotes a wide (or zenkaku) compatibility character. -

- -

Constant: int UC_DECOMP_NARROW - -: UCD marker: <narrow>. -Denotes a narrow (or hankaku) compatibility character. -

- -

Constant: int UC_DECOMP_SMALL - -: UCD marker: <small>. -Denotes a small variant form (CNS compatibility). -

- -

Constant: int UC_DECOMP_SQUARE - -: UCD marker: <square>. -Denotes a CJK squared font variant. -

- -

Constant: int UC_DECOMP_FRACTION - -: UCD marker: <fraction>. -Denotes a vulgar fraction form. -

- -

Constant: int UC_DECOMP_COMPAT - -: UCD marker: <compat>. -Denotes an otherwise unspecified compatibility character. -

- -

The following constant denotes the maximum size of decomposition of a single -Unicode character. -

Macro: unsigned int UC_DECOMPOSITION_MAX_LENGTH - -: This macro expands to a constant that is the required size of buffer passed to -the uc_decomposition and uc_canonical_decomposition functions. -

+ + +

12. Line breaking `<unilbrk.h>`

The following functions decompose a Unicode character. -

Function: int uc_decomposition (ucs4_t uc, int *decomp_tag, ucs4_t *decomposition) - -

Returns the character decomposition mapping of the Unicode character uc. -decomposition must point to an array of at least -UC_DECOMPOSITION_MAX_LENGTH ucs_t elements. +

This include file declares functions for determining where in a string +line breaks could or should be introduced, in order to make the displayed +string fit into a column of given width.

When a decomposition exists, decomposition[0..n-1] and -*decomp_tag are filled and n is returned. Otherwise -1 is -returned. -

- -

Function: int uc_canonical_decomposition (ucs4_t uc, ucs4_t *decomposition) - -

Returns the canonical character decomposition mapping of the Unicode character -uc. decomposition must point to an array of at least -UC_DECOMPOSITION_MAX_LENGTH ucs_t elements. +

These functions are locale dependent. The encoding argument identifies +the encoding (e.g. "ISO-8859-2" for Polish).

When a decomposition exists, decomposition[0..n-1] is filled -and n is returned. Otherwise -1 is returned. -

- -

- - -

12.2 Composition of Unicode characters

- -

The following function composes a Unicode character from two Unicode -characters. +

The following enumerated values indicate whether, at a given position, a line +break is possible or not. Given an string s as an array +s[0..n-1] and a position i, the values have the +following meanings:

Function: ucs4_t uc_composition (ucs4_t uc1, ucs4_t uc2) - +

Constant: int UC_BREAK_MANDATORY +

Attempts to combine the Unicode characters uc1, uc2. -uc1 is known to have canonical combining class 0. -

Returns the combination of uc1 and uc2, if it exists. -Returns 0 otherwise. -

Not all decompositions can be recombined using this function. See the Unicode -file ‘CompositionExclusions.txt’ for details. +

This value indicates that s[i] is a line break character.

- - -

12.3 Normalization of strings

- -

The Unicode standard defines four normalization forms for Unicode strings. -The following type is used to denote a normalization form. -

Type: uninorm_t - +
Constant: int UC_BREAK_POSSIBLE +: An object of type uninorm_t denotes a Unicode normalization form. -This is a scalar type; its values can be compared with ==. +; This value indicates that a line break may be inserted between +s[i-1] and s[i].

The following constants denote the four normalization forms. -

Macro: uninorm_t UNINORM_NFD - +
Constant: int UC_BREAK_HYPHENATION +: Denotes Normalization form D: canonical decomposition. +; This value indicates that a hyphen and a line break may be inserted between +s[i-1] and s[i]. But beware of language +dependent hyphenation rules.

Macro: uninorm_t UNINORM_NFC - +
Constant: int UC_BREAK_PROHIBITED +: Normalization form C: canonical decomposition, then canonical composition. +; This value indicates that s[i-1] and s[i] +must not be separated.

Macro: uninorm_t UNINORM_NFKD - +
Constant: int UC_BREAK_UNDEFINED +: Normalization form KD: compatibility decomposition. +; This value is not used as a return value; rather, in the overriding argument of +the u*_width_linebreaks functions, it indicates the absence of an +override.

Macro: uninorm_t UNINORM_NFKC - -: Normalization form KC: compatibility decomposition, then canonical composition. -

- -

The following functions operate on uninorm_t objects. +

The following functions determine the positions at which line breaks are +possible.

Function: bool uninorm_is_compat_decomposing (uninorm_t nf) - +
Function: void u8_possible_linebreaks (const uint8_t *s, size_t n, const char *encoding, char *p) +: Tests whether the normalization form nf does compatibility decomposition. -

- -

Function: bool uninorm_is_composing (uninorm_t nf) - +
Function: void u16_possible_linebreaks (const uint16_t *s, size_t n, const char *encoding, char *p) +: Tests whether the normalization form nf includes canonical composition. -

- -

Function: uninorm_t uninorm_decomposing_form (uninorm_t nf) - +
Function: void u32_possible_linebreaks (const uint32_t *s, size_t n, const char *encoding, char *p) +: Returns the decomposing variant of the normalization form nf. -This maps NFC,NFD → NFD and NFKC,NFKD → NFKD. -

- -

The following functions apply a Unicode normalization form to a Unicode string. -

Function: uint8_t * u8_normalize (uninorm_t nf, const uint8_t *s, size_t n, uint8_t *resultbuf, size_t *lengthp) - -
Function: uint16_t * u16_normalize (uninorm_t nf, const uint16_t *s, size_t n, uint16_t *resultbuf, size_t *lengthp) - -
Function: uint32_t * u32_normalize (uninorm_t nf, const uint32_t *s, size_t n, uint32_t *resultbuf, size_t *lengthp) - +
Function: void ulc_possible_linebreaks (const char *s, size_t n, const char *encoding, char *p) +: Returns the specified normalization form of a string. +; Determines the line break points in s, and stores the result at +p[0..n-1]. Every p[i] is assigned one of +the values UC_BREAK_MANDATORY, UC_BREAK_POSSIBLE, +UC_BREAK_HYPHENATION, UC_BREAK_PROHIBITED.

- - -

12.4 Normalizing comparisons

- -

The following functions compare Unicode string, ignoring differences in -normalization. +

The following functions determine where line breaks should be inserted so that +each line fits in a given width, when output to a device that uses +non-proportional fonts.

Function: int u8_normcmp (const uint8_t *s1, size_t n1, const uint8_t *s2, size_t n2, uninorm_t nf, int *resultp) - +

Function: int u8_width_linebreaks (const uint8_t *s, size_t n, int width, int start_column, int at_end_columns, const char *override, const char *encoding, char *p) +

Function: int u16_normcmp (const uint16_t *s1, size_t n1, const uint16_t *s2, size_t n2, uninorm_t nf, int *resultp) - +

Function: int u16_width_linebreaks (const uint16_t *s, size_t n, int width, int start_column, int at_end_columns, const char *override, const char *encoding, char *p) +

Function: int u32_normcmp (const uint32_t *s1, size_t n1, const uint32_t *s2, size_t n2, uninorm_t nf, int *resultp) - +

Function: int u32_width_linebreaks (const uint32_t *s, size_t n, int width, int start_column, int at_end_columns, const char *override, const char *encoding, char *p) +

Compares s1 and s2, ignoring differences in normalization. -

nf must be either UNINORM_NFD or UNINORM_NFKD. -

If successful, sets *resultp to -1 if s1 < s2, -0 if s1 = s2, 1 if s1 > s2, and returns 0. -Upon failure, returns -1 with errno set. -

- - - -

Function: char * u8_normxfrm (const uint8_t *s, size_t n, uninorm_t nf, char *resultbuf, size_t *lengthp) - -

Function: char * u16_normxfrm (const uint16_t *s, size_t n, uninorm_t nf, char *resultbuf, size_t *lengthp) - -

Function: char * u32_normxfrm (const uint32_t *s, size_t n, uninorm_t nf, char *resultbuf, size_t *lengthp) - -

Converts the string s of length n to a NUL-terminated byte -sequence, in such a way that comparing u8_normxfrm (s1) and -u8_normxfrm (s2) with the u8_cmp2 function is equivalent to -comparing s1 and s2 with the u8_normcoll function. -

nf must be either UNINORM_NFC or UNINORM_NFKC. -

- -

Function: int u8_normcoll (const uint8_t *s1, size_t n1, const uint8_t *s2, size_t n2, uninorm_t nf, int *resultp) - -

Function: int u16_normcoll (const uint16_t *s1, size_t n1, const uint16_t *s2, size_t n2, uninorm_t nf, int *resultp) - -

Function: int u32_normcoll (const uint32_t *s1, size_t n1, const uint32_t *s2, size_t n2, uninorm_t nf, int *resultp) - -

Compares s1 and s2, ignoring differences in normalization, using -the collation rules of the current locale. -

nf must be either UNINORM_NFC or UNINORM_NFKC. -

If successful, sets *resultp to -1 if s1 < s2, -0 if s1 = s2, 1 if s1 > s2, and returns 0. -Upon failure, returns -1 with errno set. -

- -

- - -

12.5 Normalization of streams of Unicode characters

- -

A “stream of Unicode characters” is essentially a function that accepts an -ucs4_t argument repeatedly, optionally combined with a function that -“flushes” the stream. -

Type: struct uninorm_filter - -: This is the data type of a stream of Unicode characters that normalizes its -input according to a given normalization form and passes the normalized -character sequence to the encapsulated stream of Unicode characters. -

- -

Function: struct uninorm_filter * uninorm_filter_create (uninorm_t nf, int (*stream_func) (void *stream_data, ucs4_t uc), void *stream_data) - +

Function: int ulc_width_linebreaks (const char *s, size_t n, int width, int start_column, int at_end_columns, const char *override, const char *encoding, char *p) +

Creates and returns a normalization filter for Unicode characters. +

Chooses the best line breaks, assuming that every character occupies a width +given by the uc_width function (see Display width <uniwidth.h>).

The pair (stream_func, stream_data) is the encapsulated stream. -stream_func (stream_data, uc) receives the Unicode -character uc and returns 0 if successful, or -1 with errno set -upon failure. +

The string is s[0..n-1].

Returns the new filter, or NULL with errno set upon failure. -

- -

Function: int uninorm_filter_write (struct uninorm_filter *filter, ucs4_t uc) - -: Stuffs a Unicode character into a normalizing filter. -Returns 0 if successful, or -1 with errno set upon failure. -

- -

Function: int uninorm_filter_flush (struct uninorm_filter *filter) - -

Brings data buffered in the filter to its destination, the encapsulated stream. +

The maximum number of columns per line is given as width. +The starting column of the string is given as start_column. +If the algorithm shall keep room after the last piece, this amount of room can +be given as at_end_columns.

Returns 0 if successful, or -1 with errno set upon failure. +

override is an optional override; if +override[i] != UC_BREAK_UNDEFINED, +override[i] takes precedence over p[i] +as returned by the u*_possible_linebreaks function.

Note! If after calling this function, additional characters are written -into the filter, the resulting character sequence in the encapsulated stream -will not necessarily be normalized. -

- -

Function: int uninorm_filter_free (struct uninorm_filter *filter) - -

Brings data buffered in the filter to its destination, the encapsulated stream, -then closes and frees the filter. +

The given encoding is used for disambiguating widths in uc_width.

Returns 0 if successful, or -1 with errno set upon failure. +

Returns the column after the end of the string, and stores the result at +p[0..n-1]. Every p[i] is assigned one of +the values UC_BREAK_MANDATORY, UC_BREAK_POSSIBLE, +UC_BREAK_HYPHENATION, UC_BREAK_PROHIBITED. Here the value +UC_BREAK_POSSIBLE indicates that a line break should be inserted.

- + @@ -493,12 +186,12 @@ then closes and frees the filter. - +

[ << ]
[ << ]	[ >> ]				[Top]	[Contents]	[Index]	[Index]	[ ? ]

- This document was generated by Bruno Haible on March, 30 2010 using texi2html 1.78a. + This document was generated by Daiki Ueno on July, 8 2015 using texi2html 1.78a.
-- cgit v1.2.3