diff options
author | Andreas Rottmann <a.rottmann@gmx.at> | 2009-09-14 12:32:44 +0200 |
---|---|---|
committer | Andreas Rottmann <a.rottmann@gmx.at> | 2009-09-14 12:32:44 +0200 |
commit | fa095a4504cbe668e4244547e2c141597bea4ecf (patch) | |
tree | 06135820a286ffec47804e75fbf8a147e92acd2e /doc/libunistring_1.html |
Imported Upstream version 0.9.1upstream/0.9.1
Diffstat (limited to 'doc/libunistring_1.html')
-rw-r--r-- | doc/libunistring_1.html | 531 |
1 files changed, 531 insertions, 0 deletions
diff --git a/doc/libunistring_1.html b/doc/libunistring_1.html new file mode 100644 index 00000000..646fdc65 --- /dev/null +++ b/doc/libunistring_1.html @@ -0,0 +1,531 @@ +<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html401/loose.dtd"> +<html> +<!-- Created on July, 1 2009 by texi2html 1.78a --> +<!-- +Written by: Lionel Cons <Lionel.Cons@cern.ch> (original author) + Karl Berry <karl@freefriends.org> + Olaf Bachmann <obachman@mathematik.uni-kl.de> + and many others. +Maintained by: Many creative people. +Send bugs and suggestions to <texi2html-bug@nongnu.org> + +--> +<head> +<title>GNU libunistring: 1. Introduction</title> + +<meta name="description" content="GNU libunistring: 1. Introduction"> +<meta name="keywords" content="GNU libunistring: 1. Introduction"> +<meta name="resource-type" content="document"> +<meta name="distribution" content="global"> +<meta name="Generator" content="texi2html 1.78a"> +<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> +<style type="text/css"> +<!-- +a.summary-letter {text-decoration: none} +pre.display {font-family: serif} +pre.format {font-family: serif} +pre.menu-comment {font-family: serif} +pre.menu-preformatted {font-family: serif} +pre.smalldisplay {font-family: serif; font-size: smaller} +pre.smallexample {font-size: smaller} +pre.smallformat {font-family: serif; font-size: smaller} +pre.smalllisp {font-size: smaller} +span.roman {font-family:serif; font-weight:normal;} +span.sansserif {font-family:sans-serif; font-weight:normal;} +ul.toc {list-style: none} +--> +</style> + + +</head> + +<body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000"> + +<table cellpadding="1" cellspacing="1" border="0"> +<tr><td valign="middle" align="left">[ << ]</td> +<td valign="middle" align="left">[<a href="libunistring_2.html#SEC9" title="Next chapter"> >> </a>]</td> +<td valign="middle" align="left"> </td> +<td valign="middle" align="left"> </td> +<td valign="middle" align="left"> </td> +<td valign="middle" align="left"> </td> +<td valign="middle" align="left"> </td> +<td valign="middle" align="left">[<a href="libunistring.html#SEC_Top" title="Cover (top) of document">Top</a>]</td> +<td valign="middle" align="left">[<a href="libunistring.html#SEC_Contents" title="Table of contents">Contents</a>]</td> +<td valign="middle" align="left">[<a href="libunistring_18.html#SEC71" title="Index">Index</a>]</td> +<td valign="middle" align="left">[<a href="libunistring_abt.html#SEC_About" title="About (help)"> ? </a>]</td> +</tr></table> + +<hr size="2"> +<a name="Introduction"></a> +<a name="SEC1"></a> +<h1 class="chapter"> <a href="libunistring.html#TOC1">1. Introduction</a> </h1> + +<p>This library provides functions for manipulating Unicode strings and +for manipulating C strings according to the Unicode standard. +</p> +<p>It consists of the following parts: +</p> +<dl compact="compact"> +<dt> <code><unistr.h></code></dt> +<dd><p>elementary string functions +</p></dd> +<dt> <code><uniconv.h></code></dt> +<dd><p>conversion from/to legacy encodings +</p></dd> +<dt> <code><unistdio.h></code></dt> +<dd><p>formatted output to strings +</p></dd> +<dt> <code><uniname.h></code></dt> +<dd><p>character names +</p></dd> +<dt> <code><unictype.h></code></dt> +<dd><p>character classification and properties +</p></dd> +<dt> <code><uniwidth.h></code></dt> +<dd><p>string width when using nonproportional fonts +</p></dd> +<dt> <code><uniwbrk.h></code></dt> +<dd><p>word breaks +</p></dd> +<dt> <code><unilbrk.h></code></dt> +<dd><p>line breaking algorithm +</p></dd> +<dt> <code><uninorm.h></code></dt> +<dd><p>normalization (composition and decomposition) +</p></dd> +<dt> <code><unicase.h></code></dt> +<dd><p>case folding +</p></dd> +<dt> <code><uniregex.h></code></dt> +<dd><p>regular expressions (not yet implemented) +</p></dd> +</dl> + +<a name="IDX1"></a> +<a name="IDX2"></a> +<p>libunistring is for you if your application involves non-trivial text +processing, such as upper/lower case conversions, line breaking, operations +on words, or more advanced analysis of text. Text provided by the user can, +in general, contain characters of all kinds of scripts. The text processing +functions provided by this library handle all scripts and all languages. +</p> +<p>libunistring is for you if your application already uses the ISO C / POSIX +<code><ctype.h></code>, <code><wctype.h></code> functions and the text it operates on is +provided by the user and can be in any language. +</p> +<p>libunistring is also for you if your application uses Unicode strings as +internal in-memory representation. +</p> + +<hr size="6"> +<a name="Unicode"></a> +<a name="SEC2"></a> +<h2 class="section"> <a href="libunistring.html#TOC2">1.1 Unicode</a> </h2> + +<p>Unicode is a standardized repertoire of characters that contains characters +from all scripts of the world, from Latin letters to Chinese ideographs +and Babylonian cuneiform glyphs. It also specifies how these characters +are to be rendered on a screen or on paper, and how common text processing +(word selection, line breaking, uppercasing of page titles etc.) is supposed +to behave on Unicode text. +</p> +<p>Unicode also specifies three ways of storing sequences of Unicode +characters in a computer whose basic unit of data is an 8-bit byte: +<a name="IDX3"></a> +<a name="IDX4"></a> +<a name="IDX5"></a> +<a name="IDX6"></a> +</p><dl compact="compact"> +<dt> UTF-8</dt> +<dd><p>Every character is represented as 1 to 4 bytes. +</p></dd> +<dt> UTF-16</dt> +<dd><p>Every character is represented as 1 to 2 units of 16 bits. +</p></dd> +<dt> UTF-32, a.k.a. UCS-4</dt> +<dd><p>Every character is represented as 1 unit of 32 bits. +</p></dd> +</dl> + +<p>For encoding Unicode text in a file, UTF-8 is usually used. For encoding +Unicode strings in memory for a program, either of the three encoding forms +can be reasonably used. +</p> +<p>Unicode is widely used on the web. Prior to the use of Unicode, web pages +were in many different encodings (ISO-8859-1 for English, French, Spanish, +ISO-8859-2 for Polish, ISO-8859-7 for Greek, KOI8-R for Russian, GB2312 or +BIG5 for Chinese, ISO-2022-JP-2 or EUC-JP or Shift_JIS for Japanese, and many +many others). It was next to impossible to create a document that contained +Chinese and Polish text in the same document. Due to the many encodings for +Japanese, even the processing of pure Japanese text was error prone. +</p> +<p>References: +</p><ul> +<li> +The Unicode standard: <a href="http://www.unicode.org/">http://www.unicode.org/</a> +</li><li> +Definition of UTF-8: <a href="http://www.rfc-editor.org/rfc/rfc3629.txt">http://www.rfc-editor.org/rfc/rfc3629.txt</a> +</li><li> +Definition of UTF-16: <a href="http://www.rfc-editor.org/rfc/rfc2781.txt">http://www.rfc-editor.org/rfc/rfc2781.txt</a> +</li><li> +Markus Kuhn's UTF-8 and Unicode FAQ: +<a href="http://www.cl.cam.ac.uk/~mgk25/unicode.html">http://www.cl.cam.ac.uk/~mgk25/unicode.html</a> +</li></ul> + +<hr size="6"> +<a name="Unicode-and-i18n"></a> +<a name="SEC3"></a> +<h2 class="section"> <a href="libunistring.html#TOC3">1.2 Unicode and Internationalization</a> </h2> + +<p>Internationalization is the process of changing the source code of a program +so that it can meet the expectations of users in any culture, if culture +specific data (translations, images etc.) are provided. +</p> +<p>Use of Unicode is not strictly required for internationalization, but it +makes internationalization much easier, because operations that need to +look at specific characters (like hyphenation, spell checking, or the +automatic conversion of double-quotes to opening and closing double-quote +characters) don't need to consider multiple possible encodings of the text. +</p> +<p>Use of Unicode also enables multilingualization: the ability of having text +in multiple languages present in the same document or even in the same line +of text. +</p> +<p>But use of Unicode is not everything. Internationalization usually consists +of three features: +</p><ul> +<li> +Use of Unicode where needed for text processing. This is what this library +is for. +</li><li> +Use of message catalogs for messages shown to the user, This is what +GNU gettext is about. +</li><li> +Use of locale specific conventions for date and time formats, for numeric +formatting, or for sorting of text. This can be done adequately with the +POSIX APIs and the implementation of locales in the GNU C library. +</li></ul> + +<hr size="6"> +<a name="Locale-encodings"></a> +<a name="SEC4"></a> +<h2 class="section"> <a href="libunistring.html#TOC4">1.3 Locale encodings</a> </h2> + +<p>A locale is a set of cultural conventions. According to POSIX, for a program, +at any moment, there is one locale being designated as the “current locale”. +(Actually, POSIX supports also one locale per thread, but this feature is not +yet universally implemented and not widely used.) +<a name="IDX7"></a> +The locale is partitioned into several aspects, called the “categories” +of the locale. The main various aspects are: +</p><ul class="toc"> +<li> +The character encoding and the character properties. This is the +<code>LC_CTYPE</code> category. +</li><li> +The sorting rules for text. This is the <code>LC_COLLATE</code> category. +</li><li> +The language specific translations of messages. This is the +<code>LC_MESSAGES</code> category. +</li><li> +The formatting rules for numbers, such as the decimal separator. This is +the <code>LC_NUMERIC</code> category. +</li><li> +The formatting rules for amounts of money. This is the <code>LC_MONETARY</code> +category. +</li><li> +The formatting of date and time. This is the <code>LC_TIME</code> category. +</li></ul> + +<a name="IDX8"></a> +<p>In particular, the <code>LC_CTYPE</code> category of the current locale determines +the character encoding. This is the encoding of ‘<samp>char *</samp>’ strings. +We also call it the “locale encoding”. GNU libunistring has a function, +<code>locale_charset</code>, that returns a standardized (platform independent) +name for this encoding. +</p> +<p>All locale encodings used on glibc systems are essentially ASCII compatible: +Most graphic ASCII characters have the same representation, as a single byte, +in that encoding as in ASCII. +</p> +<p>Among the possible locale encodings are UTF-8 and GB18030. Both allow +to represent any Unicode character as a sequence of bytes. UTF-8 is used in +most of the world, whereas GB18030 is used in the People's Republic of China, +because it is backward compatible with the GB2312 encoding that was used in +this country earlier. +</p> +<p>The legacy locale encodings, ISO-8859-15 (which supplanted ISO-8859-1 in +most of Europe), ISO-8859-2, KOI8-R, EUC-JP, etc., are still in use in +many places, though. +</p> +<p>UTF-16 and UTF-32 are not used as locale encodings, because they are not +ASCII compatible. +</p> +<hr size="6"> +<a name="In_002dmemory-representation"></a> +<a name="SEC5"></a> +<h2 class="section"> <a href="libunistring.html#TOC5">1.4 Choice of in-memory representation of strings</a> </h2> + +<p>There are three ways of representing strings in memory of a running +program. +</p><ul class="toc"> +<li> +As ‘<samp>char *</samp>’ strings. Such strings are represented in locale encoding. +This approach is employed when not much text processing is done by the +program. When some Unicode aware processing is to be done, a string is +converted to Unicode on the fly and back to locale encoding afterwards. +</li><li> +As UTF-8 or UTF-16 or UTF-32 strings. This implies that conversion from +locale encoding to Unicode is performed on input, and in the opposite +direction on output. This approach is employed when the program does +a significant amount of text processing, or when the program has multiple +threads operating on the same data but in different locales. +</li><li> +As ‘<samp>wchar_t *</samp>’, a.k.a. “wide strings”. This approach is misguided, +see <a href="#SEC7">The <code>wchar_t</code> mess</a>. +</li></ul> + +<hr size="6"> +<a name="char-_002a-strings"></a> +<a name="SEC6"></a> +<h2 class="section"> <a href="libunistring.html#TOC6">1.5 ‘<samp>char *</samp>’ strings</a> </h2> + +<p>The classical C strings, with its C library support standardized by +ISO C and POSIX, can be used in internationalized programs with some +precautions. The problem with this API is that many of the C library +functions for strings don't work correctly on strings in locale +encodings, leading to bugs that only people in some cultures of the +world will experience. +</p> +<a name="IDX9"></a> +<p>The first problem with the C library API is the support of multibyte +locales. According to the locale encoding, in general, every character +is represented by one or more bytes (up to 4 bytes in practice — but +use <code>MB_LEN_MAX</code> instead of the number 4 in the code). +When every character is represented by only 1 byte, we speak of an +“unibyte locale”, otherwise of a “multibyte locale”. It is important +to realize that the majority of Unix installations nowadays use UTF-8 +or GB18030 as locale encoding; therefore, the majority of users are +using multibyte locales. +</p> +<a name="IDX10"></a> +<p>The important fact to remember is: +</p><table class="cartouche" border="1"><tr><td> +<p><em>A ‘<samp>char</samp>’ is a byte, not a character.</em> +</p></td></tr></table> + +<p>As a consequence: +</p><ul class="toc"> +<li> +The <code><ctype.h></code> API is useless in this context; it does not work in +multibyte locales. +</li><li> +The <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strlen.html"><code>strlen</code></a> function does not return the number of characters +in a string. Nor does it return the number of screen columns occupied +by a string after it is output. It merely returns the number of +<em>bytes</em> occupied by a string. +</li><li> +Truncating a string, for example, with <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strncpy.html"><code>strncpy</code></a>, can have the +effect of truncating it in the middle of a multibyte character. Such +a string will, when output, have a garbled character at its end, often +represented by a hollow box. +</li><li> +<a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strchr.html"><code>strchr</code></a> and <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strrchr.html"><code>strrchr</code></a> do not work with multibyte strings +if the locale encoding is GB18030 and the character to be searched is +a digit. +</li><li> +<a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strstr.html"><code>strstr</code></a> does not work with multibyte strings if the locale encoding +is different from UTF-8. +</li><li> +<a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strcspn.html"><code>strcspn</code></a>, <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strpbrk.html"><code>strpbrk</code></a>, <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strspn.html"><code>strspn</code></a> cannot work +correctly in multibyte locales: they assume the second argument is a list of +single-byte characters. Even in this simple case, they do not work with +multibyte strings if the locale encoding is GB18030 and one of the +characters to be searched is a digit. +</li><li> +<a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strsep.html"><code>strsep</code></a> and <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strtok_r.html"><code>strtok_r</code></a> do not work with multibyte strings +unless all of the delimiter characters are ASCII characters < 0x30. +</li><li> +The <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strcasecmp.html"><code>strcasecmp</code></a>, <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strncasecmp.html"><code>strncasecmp</code></a>, and <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strcasestr.html"><code>strcasestr</code></a> +functions do not work with multibyte strings. +</li></ul> + +<p>The workarounds can be found in GNU gnulib +<a href="http://www.gnu.org/software/gnulib/">http://www.gnu.org/software/gnulib/</a>. +</p><ul class="toc"> +<li> +gnulib has modules ‘<samp>mbchar</samp>’, ‘<samp>mbiter</samp>’, ‘<samp>mbuiter</samp>’ that +represent multibyte characters and allow to iterate across a multibyte +string with the same ease as through a unibyte string. +</li><li> +gnulib has functions <code>mbslen</code> and <code>mbswidth</code> that can be +used instead of <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strlen.html"><code>strlen</code></a> when the number of characters or the +number of screen columns of a string is requested. +</li><li> +gnulib has functions <code>mbschr</code> and <code>mbsrrchr</code> that are +like <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strchr.html"><code>strchr</code></a> and <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strrchr.html"><code>strrchr</code></a>, but work in multibyte locales. +</li><li> +gnulib has a function <code>mbsstr</code>, like <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strstr.html"><code>strstr</code></a>, but works +in multibyte locales. +</li><li> +gnulib has functions <code>mbscspn</code>, <code>mbspbrk</code>, <code>mbsspn</code> +that are like <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strcspn.html"><code>strcspn</code></a>, <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strpbrk.html"><code>strpbrk</code></a>, <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strspn.html"><code>strspn</code></a>, but +work in multibyte locales. +</li><li> +gnulib has functions <code>mbssep</code> and <code>mbstok_r</code> that are +like <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strsep.html"><code>strsep</code></a> and <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strtok_r.html"><code>strtok_r</code></a> but work in multibyte locales. +</li><li> +gnulib has functions <code>mbscasecmp</code>, <code>mbsncasecmp</code>, +<code>mbspcasecmp</code>, and <code>mbscasestr</code> that are like <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strcasecmp.html"><code>strcasecmp</code></a>, +<a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strncasecmp.html"><code>strncasecmp</code></a>, and <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/strcasestr.html"><code>strcasestr</code></a>, but +work in multibyte locales. Still, the function <code>ulc_casecmp</code> is +preferable to these functions; see below. +</li></ul> + +<p>The second problem with the C library API is that it has some assumptions built-in that are not valid in some languages: +</p><ul class="toc"> +<li> +It assumes that there are only two forms of every character: uppercase +and lowercase. This is not true for Croatian, where the character +<small>LETTER DZ WITH CARON</small> comes in three forms: +<small>LATIN CAPITAL LETTER DZ WITH CARON</small> (DZ), +<small>LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON</small> (Dz), +<small>LATIN SMALL LETTER DZ WITH CARON</small> (dz). +</li><li> +It assumes that uppercasing of 1 character leads to 1 character. This +is not true for German, where the <small>LATIN SMALL LETTER SHARP S</small>, when +uppercased, becomes ‘<samp>SS</samp>’. +</li><li> +It assumes that there is 1:1 mapping between uppercase and lowercase forms. +This is not true for the Greek sigma: <small>GREEK CAPITAL LETTER SIGMA</small> is +the uppercase of both <small>GREEK SMALL LETTER SIGMA</small> and +<small>GREEK SMALL LETTER FINAL SIGMA</small>. +</li><li> +It assumes that the upper/lowercase mappings are position independent. +This is not true for the Greek sigma and the Lithuanian i. +</li></ul> + +<p>The correct way to deal with this problem is +</p><ol> +<li> +to provide functions for titlecasing, as well as for upper- and +lowercasing, +</li><li> +to view case transformations as functions that operates on strings, +rather than on characters. +</li></ol> + +<p>This is implemented in this library, through the functions declared in <code><unicase.h></code>, see <a href="libunistring_13.html#SEC48">Case mappings <code><unicase.h></code></a>. +</p> +<hr size="6"> +<a name="The-wchar_005ft-mess"></a> +<a name="SEC7"></a> +<h2 class="section"> <a href="libunistring.html#TOC7">1.6 The <code>wchar_t</code> mess</a> </h2> + +<p>The ISO C and POSIX standard creators made an attempt to fix the first +problem mentioned in the previous section. They introduced +</p><ul class="toc"> +<li> +a type ‘<samp>wchar_t</samp>’, designed to encapsulate an entire character, +</li><li> +a “wide string” type ‘<samp>wchar_t *</samp>’, and +</li><li> +functions declared in <code><wctype.h></code> that were meant to supplant the +ones in <code><ctype.h></code>. +</li></ul> + +<p>Unfortunately, this API and its implementation has numerous problems: +</p> +<ul class="toc"> +<li> +On AIX and Windows platforms, <code>wchar_t</code> is a 16-bit type. This +means that it can never accommodate an entire Unicode character. Either +the <code>wchar_t *</code> strings are limited to characters in UCS-2 (the +“Basic Multilingual Plane” of Unicode), or — if <code>wchar_t *</code> +strings are encoded in UTF-16 — a <code>wchar_t</code> represents only half +of a character in the worst case, making the <code><wctype.h></code> functions +pointless. + +</li><li> +On Solaris and FreeBSD, the <code>wchar_t</code> encoding is locale dependent +and undocumented. This means, if you want to know any property of a +<code>wchar_t</code> character, other than the properties defined by +<code><wctype.h></code> — such as whether it's a dash, currency symbol, +paragraph separator, or similar —, you have to convert it to +<code>char *</code> encoding first, by use of the function <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/wctomb.html"><code>wctomb</code></a>. + +</li><li> +When you read a stream of wide characters, through the functions +<a href="http://www.opengroup.org/onlinepubs/9699919799/functions/fgetwc.html"><code>fgetwc</code></a> and <a href="http://www.opengroup.org/onlinepubs/9699919799/functions/fgetws.html"><code>fgetws</code></a>, and when the input stream/file is +not in the expected encoding, you have no way to determine the invalid +byte sequence and do some corrective action. If you use these +functions, your program becomes “garbage in - more garbage out” or +“garbage in - abort”. +</li></ul> + +<p>As a consequence, it is better to use multibyte strings, as explained in +the previous section. Such multibyte strings can bypass limitations +of the <code>wchar_t</code> type, if you use functions defined in gnulib and +libunistring for text processing. They can also faithfully transport +malformed characters that were present in the input, without requiring +the program to produce garbage or abort. +</p> +<hr size="6"> +<a name="Unicode-strings"></a> +<a name="SEC8"></a> +<h2 class="section"> <a href="libunistring.html#TOC8">1.7 Unicode strings</a> </h2> + +<p>libunistring supports Unicode strings in three representations: +<a name="IDX11"></a> +<a name="IDX12"></a> +<a name="IDX13"></a> +</p><ul class="toc"> +<li> +UTF-8 strings, through the type ‘<samp>uint8_t *</samp>’. The units are bytes +(<code>uint8_t</code>). +</li><li> +UTF-16 strings, through the type ‘<samp>uint16_t *</samp>’, The units are 16-bit +memory words (<code>uint16_t</code>). +</li><li> +UTF-32 strings, through the type ‘<samp>uint32_t *</samp>’. The units are 32-bit +memory words (<code>uint32_t</code>). +</li></ul> + +<p>As with C strings, there are two variants: +</p><ul class="toc"> +<li> +Unicode strings with a terminating NUL character are represented as +a pointer to the first unit of the string. There is a unit containing +a 0 value at the end. It is considered part of the string for all +memory allocation purposes, but is not considered part of the string +for all other logical purposes. +</li><li> +Unicode strings where embedded NUL characters are allowed. These +are represented by a pointer to the first unit and the number of units +(not bytes!) of the string. In this setting, there is no trailing +zero-valued unit used as “end marker”. +</li></ul> + +<hr size="6"> +<table cellpadding="1" cellspacing="1" border="0"> +<tr><td valign="middle" align="left">[<a href="#SEC1" title="Beginning of this chapter or previous chapter"> << </a>]</td> +<td valign="middle" align="left">[<a href="libunistring_2.html#SEC9" title="Next chapter"> >> </a>]</td> +<td valign="middle" align="left"> </td> +<td valign="middle" align="left"> </td> +<td valign="middle" align="left"> </td> +<td valign="middle" align="left"> </td> +<td valign="middle" align="left"> </td> +<td valign="middle" align="left">[<a href="libunistring.html#SEC_Top" title="Cover (top) of document">Top</a>]</td> +<td valign="middle" align="left">[<a href="libunistring.html#SEC_Contents" title="Table of contents">Contents</a>]</td> +<td valign="middle" align="left">[<a href="libunistring_18.html#SEC71" title="Index">Index</a>]</td> +<td valign="middle" align="left">[<a href="libunistring_abt.html#SEC_About" title="About (help)"> ? </a>]</td> +</tr></table> +<p> + <font size="-1"> + This document was generated by <em>Bruno Haible</em> on <em>July, 1 2009</em> using <a href="http://www.nongnu.org/texi2html/"><em>texi2html 1.78a</em></a>. + </font> + <br> + +</p> +</body> +</html> |