12. Normalization forms (composition and decomposition) &lt;uninorm.h&gt;

12. Normalization forms (composition and decomposition) <code>&lt;uninorm.h&gt;</code>

<p>This include file defines functions for transforming Unicode strings to one
of the four normal forms, known as NFC, NFD, NKFC, NFKD.  These
transformations involve decomposition and &mdash; for NFC and NFKC &mdash; composition
of Unicode characters.

12.1 Decomposition of Unicode characters

<p>The following enumerated values are the possible types of decomposition of a
Unicode character.
Constant: int UC_DECOMP_CANONICAL
Denotes canonical decomposition.
<dd><p>Denotes canonical decomposition.

<dt><u>Constant:</u> int <b>UC_DECOMP_FONT</b>
<dd><p>UCD marker: <code>&lt;font&gt;</code>.  Denotes a font variant (e.g. a blackletter form).

<dt><u>Constant:</u> int <b>UC_DECOMP_NOBREAK</b>
<dd><p>UCD marker: <code>&lt;noBreak&gt;</code>.
Denotes a no-break version of a space or hyphen.

<dt><u>Constant:</u> int <b>UC_DECOMP_INITIAL</b>
<dd><p>UCD marker: <code>&lt;initial&gt;</code>.
Denotes an initial presentation form (Arabic).

<dt><u>Constant:</u> int <b>UC_DECOMP_MEDIAL</b>
<dd><p>UCD marker: <code>&lt;medial&gt;</code>.
Denotes a medial presentation form (Arabic).

<dt><u>Constant:</u> int <b>UC_DECOMP_FINAL</b>
<dd><p>UCD marker: <code>&lt;final&gt;</code>.
Denotes a final presentation form (Arabic).

<dt><u>Constant:</u> int <b>UC_DECOMP_ISOLATED</b>
<dd><p>UCD marker: <code>&lt;isolated&gt;</code>.
Denotes an isolated presentation form (Arabic).

<dt><u>Constant:</u> int <b>UC_DECOMP_CIRCLE</b>
<dd><p>UCD marker: <code>&lt;circle&gt;</code>.
Denotes an encircled form.

<dt><u>Constant:</u> int <b>UC_DECOMP_SUPER</b>
<dd><p>UCD marker: <code>&lt;super&gt;</code>.
Denotes a superscript form.

<dt><u>Constant:</u> int <b>UC_DECOMP_SUB</b>
<dd><p>UCD marker: <code>&lt;sub&gt;</code>.
Denotes a subscript form.

<dt><u>Constant:</u> int <b>UC_DECOMP_VERTICAL</b>
<dd><p>UCD marker: <code>&lt;vertical&gt;</code>.
Denotes a vertical layout presentation form.

<dt><u>Constant:</u> int <b>UC_DECOMP_WIDE</b>
<dd><p>UCD marker: <code>&lt;wide&gt;</code>.
Denotes a wide (or zenkaku) compatibility character.

<dt><u>Constant:</u> int <b>UC_DECOMP_NARROW</b>
<dd><p>UCD marker: <code>&lt;narrow&gt;</code>.
Denotes a narrow (or hankaku) compatibility character.

<dt><u>Constant:</u> int <b>UC_DECOMP_SMALL</b>
<dd><p>UCD marker: <code>&lt;small&gt;</code>.
Denotes a small variant form (CNS compatibility).

<dt><u>Constant:</u> int <b>UC_DECOMP_SQUARE</b>
<dd><p>UCD marker: <code>&lt;square&gt;</code>.
Denotes a CJK squared font variant.

<dt><u>Constant:</u> int <b>UC_DECOMP_FRACTION</b>
<dd><p>UCD marker: <code>&lt;fraction&gt;</code>.
Denotes a vulgar fraction form.

<dt><u>Constant:</u> int <b>UC_DECOMP_COMPAT</b>
<dd><p>UCD marker: <code>&lt;compat&gt;</code>.
Denotes an otherwise unspecified compatibility character.

<p>The following constant denotes the maximum size of decomposition of a single
Unicode character.
<dd><p>This macro expands to a constant that is the required size of buffer passed to
the <code>uc_decomposition</code> and <code>uc_canonical_decomposition</code> functions.

<p>The following functions decompose a Unicode character.
<dd><p>Returns the character decomposition mapping of the Unicode character <var>uc</var>.
<var>decomposition</var> must point to an array of at least
<code>UC_DECOMPOSITION_MAX_LENGTH</code> <code>ucs_t</code> elements.
<p>When a decomposition exists, <code><var>decomposition</var>[0..<var>n</var>-1]</code> and
<code>*<var>decomp_tag</var></code> are filled and <var>n</var> is returned.  Otherwise -1 is

<dd><p>Returns the canonical character decomposition mapping of the Unicode character
<var>uc</var>.  <var>decomposition</var> must point to an array of at least
<code>UC_DECOMPOSITION_MAX_LENGTH</code> <code>ucs_t</code> elements.
<p>When a decomposition exists, <code><var>decomposition</var>[0..<var>n</var>-1]</code> is filled
and <var>n</var> is returned.  Otherwise -1 is returned.

12.2 Composition of Unicode characters

<p>The following function composes a Unicode character from two Unicode
<dd><p>Attempts to combine the Unicode characters <var>uc1</var>, <var>uc2</var>.
<var>uc1</var> is known to have canonical combining class 0.
<p>Returns the combination of <var>uc1</var> and <var>uc2</var>, if it exists.
Returns 0 otherwise.
<p>Not all decompositions can be recombined using this function.  See the Unicode
file &lsquo;<tt>CompositionExclusions.txt</tt>&rsquo; for details.

12.3 Normalization of strings

<p>The Unicode standard defines four normalization forms for Unicode strings.
The following type is used to denote a normalization form.
<dd><p>An object of type <code>uninorm_t</code> denotes a Unicode normalization form.
This is a scalar type; its values can be compared with <code>==</code>.

<p>The following constants denote the four normalization forms.
Macro: uninorm_t UNINORM_NFD
Denotes Normalization form D: canonical decomposition.

Macro: uninorm_t UNINORM_NFC
Normalization form C: canonical decomposition, then canonical composition.

Macro: uninorm_t UNINORM_NFKD
Normalization form KD: compatibility decomposition.

Macro: uninorm_t UNINORM_NFKC
Normalization form KC: compatibility decomposition, then canonical composition.

<p>The following functions operate on <code>uninorm_t</code> objects.
<dd><p>Tests whether the normalization form <var>nf</var> does compatibility decomposition.

<dd><p>Tests whether the normalization form <var>nf</var> includes canonical composition.

<dd><p>Returns the decomposing variant of the normalization form <var>nf</var>.
This maps NFC,NFD → NFD and NFKC,NFKD → NFKD.

<p>The following functions apply a Unicode normalization form to a Unicode string.
<dd><p>Returns the specified normalization form of a string.

12.4 Normalizing comparisons

<p>The following functions compare Unicode string, ignoring differences in
<dd><p>Compares <var>s1</var> and <var>s2</var>, ignoring differences in normalization.
<p><var>nf</var> must be either <code>UNINORM_NFD</code> or <code>UNINORM_NFKD</code>.
<p>If successful, sets <code>*<var>resultp</var></code> to -1 if <var>s1</var> &lt; <var>s2</var>,
0 if <var>s1</var> = <var>s2</var>, 1 if <var>s1</var> &gt; <var>s2</var>, and returns 0.
Upon failure, returns -1 with <code>errno</code> set.

<dd><p>Converts the string <var>s</var> of length <var>n</var> to a NUL-terminated byte
sequence, in such a way that comparing <code>u8_normxfrm (<var>s1</var>)</code> and
<code>u8_normxfrm (<var>s2</var>)</code> with the <code>u8_cmp2</code> function is equivalent to
comparing <var>s1</var> and <var>s2</var> with the <code>u8_normcoll</code> function.
<p><var>nf</var> must be either <code>UNINORM_NFC</code> or <code>UNINORM_NFKC</code>.

<dd><p>Compares <var>s1</var> and <var>s2</var>, ignoring differences in normalization, using
the collation rules of the current locale.
<p><var>nf</var> must be either <code>UNINORM_NFC</code> or <code>UNINORM_NFKC</code>.
<p>If successful, sets <code>*<var>resultp</var></code> to -1 if <var>s1</var> &lt; <var>s2</var>,
0 if <var>s1</var> = <var>s2</var>, 1 if <var>s1</var> &gt; <var>s2</var>, and returns 0.
Upon failure, returns -1 with <code>errno</code> set.

12.5 Normalization of streams of Unicode characters

<p>A &ldquo;stream of Unicode characters&rdquo; is essentially a function that accepts an
<code>ucs4_t</code> argument repeatedly, optionally combined with a function that
&ldquo;flushes&rdquo; the stream.
<dd><p>This is the data type of a stream of Unicode characters that normalizes its
input according to a given normalization form and passes the normalized
character sequence to the encapsulated stream of Unicode characters.

<dd><p>Creates and returns a normalization filter for Unicode characters.
<p>The pair (<var>stream_func</var>, <var>stream_data</var>) is the encapsulated stream.
<code><var>stream_func</var> (<var>stream_data</var>, <var>uc</var>)</code> receives the Unicode
character <var>uc</var> and returns 0 if successful, or -1 with <code>errno</code> set
upon failure.
<p>Returns the new filter, or NULL with <code>errno</code> set upon failure.

<dd><p>Stuffs a Unicode character into a normalizing filter.
Returns 0 if successful, or -1 with <code>errno</code> set upon failure.

<dd><p>Brings data buffered in the filter to its destination, the encapsulated stream.
<p>Returns 0 if successful, or -1 with <code>errno</code> set upon failure.
<p>Note! If after calling this function, additional characters are written
into the filter, the resulting character sequence in the encapsulated stream
will not necessarily be normalized.

<dd><p>Brings data buffered in the filter to its destination, the encapsulated stream,
then closes and frees the filter.
<p>Returns 0 if successful, or -1 with <code>errno</code> set upon failure.
