summaryrefslogtreecommitdiff
path: root/doc/uninorm.texi
diff options
context:
space:
mode:
authorAndreas Rottmann <a.rottmann@gmx.at>2009-09-14 12:32:44 +0200
committerAndreas Rottmann <a.rottmann@gmx.at>2009-09-14 12:32:44 +0200
commitfa095a4504cbe668e4244547e2c141597bea4ecf (patch)
tree06135820a286ffec47804e75fbf8a147e92acd2e /doc/uninorm.texi
Imported Upstream version 0.9.1upstream/0.9.1
Diffstat (limited to 'doc/uninorm.texi')
-rw-r--r--doc/uninorm.texi299
1 files changed, 299 insertions, 0 deletions
diff --git a/doc/uninorm.texi b/doc/uninorm.texi
new file mode 100644
index 00000000..d4206d50
--- /dev/null
+++ b/doc/uninorm.texi
@@ -0,0 +1,299 @@
+@node uninorm.h
+@chapter Normalization forms (composition and decomposition) @code{<uninorm.h>}
+
+@cindex normal forms
+@cindex normalizing
+This include file defines functions for transforming Unicode strings to one
+of the four normal forms, known as NFC, NFD, NKFC, NFKD. These
+transformations involve decomposition and --- for NFC and NFKC --- composition
+of Unicode characters.
+
+@menu
+* Decomposition of characters::
+* Composition of characters::
+* Normalization of strings::
+* Normalizing comparisons::
+* Normalization of streams::
+@end menu
+
+@node Decomposition of characters
+@section Decomposition of Unicode characters
+
+@cindex decomposing
+The following enumerated values are the possible types of decomposition of a
+Unicode character.
+
+@deftypevr Constant int UC_DECOMP_CANONICAL
+Denotes canonical decomposition.
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_FONT
+UCD marker: @code{<font>}. Denotes a font variant (e.g. a blackletter form).
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_NOBREAK
+UCD marker: @code{<noBreak>}.
+Denotes a no-break version of a space or hyphen.
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_INITIAL
+UCD marker: @code{<initial>}.
+Denotes an initial presentation form (Arabic).
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_MEDIAL
+UCD marker: @code{<medial>}.
+Denotes a medial presentation form (Arabic).
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_FINAL
+UCD marker: @code{<final>}.
+Denotes a final presentation form (Arabic).
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_ISOLATED
+UCD marker: @code{<isolated>}.
+Denotes an isolated presentation form (Arabic).
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_CIRCLE
+UCD marker: @code{<circle>}.
+Denotes an encircled form.
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_SUPER
+UCD marker: @code{<super>}.
+Denotes a superscript form.
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_SUB
+UCD marker: @code{<sub>}.
+Denotes a subscript form.
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_VERTICAL
+UCD marker: @code{<vertical>}.
+Denotes a vertical layout presentation form.
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_WIDE
+UCD marker: @code{<wide>}.
+Denotes a wide (or zenkaku) compatibility character.
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_NARROW
+UCD marker: @code{<narrow>}.
+Denotes a narrow (or hankaku) compatibility character.
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_SMALL
+UCD marker: @code{<small>}.
+Denotes a small variant form (CNS compatibility).
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_SQUARE
+UCD marker: @code{<square>}.
+Denotes a CJK squared font variant.
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_FRACTION
+UCD marker: @code{<fraction>}.
+Denotes a vulgar fraction form.
+@end deftypevr
+
+@deftypevr Constant int UC_DECOMP_COMPAT
+UCD marker: @code{<compat>}.
+Denotes an otherwise unspecified compatibility character.
+@end deftypevr
+
+The following constant denotes the maximum size of decomposition of a single
+Unicode character.
+
+@deftypevr Macro {unsigned int} UC_DECOMPOSITION_MAX_LENGTH
+This macro expands to a constant that is the required size of buffer passed to
+the @code{uc_decomposition} and @code{uc_canonical_decomposition} functions.
+@end deftypevr
+
+The following functions decompose a Unicode character.
+
+@deftypefun int uc_decomposition (ucs4_t @var{uc}, int *@var{decomp_tag}, ucs4_t *@var{decomposition})
+Returns the character decomposition mapping of the Unicode character @var{uc}.
+@var{decomposition} must point to an array of at least
+@code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements.
+
+When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} and
+@code{*@var{decomp_tag}} are filled and @var{n} is returned. Otherwise -1 is
+returned.
+@end deftypefun
+
+@deftypefun int uc_canonical_decomposition (ucs4_t @var{uc}, ucs4_t *@var{decomposition})
+Returns the canonical character decomposition mapping of the Unicode character
+@var{uc}. @var{decomposition} must point to an array of at least
+@code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements.
+
+When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} is filled
+and @var{n} is returned. Otherwise -1 is returned.
+@end deftypefun
+
+@node Composition of characters
+@section Composition of Unicode characters
+
+@cindex composing, Unicode characters
+@cindex combining, Unicode characters
+The following function composes a Unicode character from two Unicode
+characters.
+
+@deftypefun ucs4_t uc_composition (ucs4_t @var{uc1}, ucs4_t @var{uc2})
+Attempts to combine the Unicode characters @var{uc1}, @var{uc2}.
+@var{uc1} is known to have canonical combining class 0.
+
+Returns the combination of @var{uc1} and @var{uc2}, if it exists.
+Returns 0 otherwise.
+
+Not all decompositions can be recombined using this function. See the Unicode
+file @file{CompositionExclusions.txt} for details.
+@end deftypefun
+
+@node Normalization of strings
+@section Normalization of strings
+
+The Unicode standard defines four normalization forms for Unicode strings.
+The following type is used to denote a normalization form.
+
+@deftp Type uninorm_t
+An object of type @code{uninorm_t} denotes a Unicode normalization form.
+This is a scalar type; its values can be compared with @code{==}.
+@end deftp
+
+The following constants denote the four normalization forms.
+
+@deftypevr Macro uninorm_t UNINORM_NFD
+Denotes Normalization form D: canonical decomposition.
+@end deftypevr
+
+@deftypevr Macro uninorm_t UNINORM_NFC
+Normalization form C: canonical decomposition, then canonical composition.
+@end deftypevr
+
+@deftypevr Macro uninorm_t UNINORM_NFKD
+Normalization form KD: compatibility decomposition.
+@end deftypevr
+
+@deftypevr Macro uninorm_t UNINORM_NFKC
+Normalization form KC: compatibility decomposition, then canonical composition.
+@end deftypevr
+
+The following functions operate on @code{uninorm_t} objects.
+
+@deftypefun bool uninorm_is_compat_decomposing (uninorm_t @var{nf})
+Tests whether the normalization form @var{nf} does compatibility decomposition.
+@end deftypefun
+
+@deftypefun bool uninorm_is_composing (uninorm_t @var{nf})
+Tests whether the normalization form @var{nf} includes canonical composition.
+@end deftypefun
+
+@deftypefun uninorm_t uninorm_decomposing_form (uninorm_t @var{nf})
+Returns the decomposing variant of the normalization form @var{nf}.
+This maps NFC,NFD @arrow{} NFD and NFKC,NFKD @arrow{} NFKD.
+@end deftypefun
+
+The following functions apply a Unicode normalization form to a Unicode string.
+
+@deftypefun {uint8_t *} u8_normalize (uninorm_t @var{nf}, const uint8_t *@var{s}, size_t @var{n}, uint8_t *@var{resultbuf}, size_t *@var{lengthp})
+@deftypefunx {uint16_t *} u16_normalize (uninorm_t @var{nf}, const uint16_t *@var{s}, size_t @var{n}, uint16_t *@var{resultbuf}, size_t *@var{lengthp})
+@deftypefunx {uint32_t *} u32_normalize (uninorm_t @var{nf}, const uint32_t *@var{s}, size_t @var{n}, uint32_t *@var{resultbuf}, size_t *@var{lengthp})
+Returns the specified normalization form of a string.
+@end deftypefun
+
+@node Normalizing comparisons
+@section Normalizing comparisons
+
+@cindex comparing, ignoring normalization
+The following functions compare Unicode string, ignoring differences in
+normalization.
+
+@deftypefun int u8_normcmp (const uint8_t *@var{s1}, size_t @var{n1}, const uint8_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp})
+@deftypefunx int u16_normcmp (const uint16_t *@var{s1}, size_t @var{n1}, const uint16_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp})
+@deftypefunx int u32_normcmp (const uint32_t *@var{s1}, size_t @var{n1}, const uint32_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp})
+Compares @var{s1} and @var{s2}, ignoring differences in normalization.
+
+@var{nf} must be either @code{UNINORM_NFD} or @code{UNINORM_NFKD}.
+
+If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2},
+0 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0.
+Upon failure, returns -1 with @code{errno} set.
+@end deftypefun
+
+@cindex comparing, ignoring normalization, with collation rules
+@cindex comparing, with collation rules, ignoring normalization
+@deftypefun {char *} u8_normxfrm (const uint8_t *@var{s}, size_t @var{n}, uninorm_t @var{nf}, char *@var{resultbuf}, size_t *@var{lengthp})
+@deftypefunx {char *} u16_normxfrm (const uint16_t *@var{s}, size_t @var{n}, uninorm_t @var{nf}, char *@var{resultbuf}, size_t *@var{lengthp})
+@deftypefunx {char *} u32_normxfrm (const uint32_t *@var{s}, size_t @var{n}, uninorm_t @var{nf}, char *@var{resultbuf}, size_t *@var{lengthp})
+Converts the string @var{s} of length @var{n} to a NUL-terminated byte
+sequence, in such a way that comparing @code{u8_normxfrm (@var{s1})} and
+@code{u8_normxfrm (@var{s2})} with the @code{u8_cmp2} function is equivalent to
+comparing @var{s1} and @var{s2} with the @code{u8_normcoll} function.
+
+@var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}.
+@end deftypefun
+
+@deftypefun int u8_normcoll (const uint8_t *@var{s1}, size_t @var{n1}, const uint8_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp})
+@deftypefunx int u16_normcoll (const uint16_t *@var{s1}, size_t @var{n1}, const uint16_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp})
+@deftypefunx int u32_normcoll (const uint32_t *@var{s1}, size_t @var{n1}, const uint32_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp})
+Compares @var{s1} and @var{s2}, ignoring differences in normalization, using
+the collation rules of the current locale.
+
+@var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}.
+
+If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2},
+0 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0.
+Upon failure, returns -1 with @code{errno} set.
+@end deftypefun
+
+@node Normalization of streams
+@section Normalization of streams of Unicode characters
+
+@cindex stream, normalizing a
+A ``stream of Unicode characters'' is essentially a function that accepts an
+@code{ucs4_t} argument repeatedly, optionally combined with a function that
+``flushes'' the stream.
+
+@deftp Type {struct uninorm_filter}
+This is the data type of a stream of Unicode characters that normalizes its
+input according to a given normalization form and passes the normalized
+character sequence to the encapsulated stream of Unicode characters.
+@end deftp
+
+@deftypefun {struct uninorm_filter *} uninorm_filter_create (uninorm_t @var{nf}, int (*@var{stream_func}) (void *@var{stream_data}, ucs4_t @var{uc}), void *@var{stream_data})
+Creates and returns a normalization filter for Unicode characters.
+
+The pair (@var{stream_func}, @var{stream_data}) is the encapsulated stream.
+@code{@var{stream_func} (@var{stream_data}, @var{uc})} receives the Unicode
+character @var{uc} and returns 0 if successful, or -1 with @code{errno} set
+upon failure.
+
+Returns the new filter, or NULL with @code{errno} set upon failure.
+@end deftypefun
+
+@deftypefun int uninorm_filter_write (struct uninorm_filter *@var{filter}, ucs4_t @var{uc})
+Stuffs a Unicode character into a normalizing filter.
+Returns 0 if successful, or -1 with @code{errno} set upon failure.
+@end deftypefun
+
+@deftypefun int uninorm_filter_flush (struct uninorm_filter *@var{filter})
+Brings data buffered in the filter to its destination, the encapsulated stream.
+
+Returns 0 if successful, or -1 with @code{errno} set upon failure.
+
+Note! If after calling this function, additional characters are written
+into the filter, the resulting character sequence in the encapsulated stream
+will not necessarily be normalized.
+@end deftypefun
+
+@deftypefun int uninorm_filter_free (struct uninorm_filter *@var{filter})
+Brings data buffered in the filter to its destination, the encapsulated stream,
+then closes and frees the filter.
+
+Returns 0 if successful, or -1 with @code{errno} set upon failure.
+@end deftypefun