From fa095a4504cbe668e4244547e2c141597bea4ecf Mon Sep 17 00:00:00 2001 From: Andreas Rottmann Date: Mon, 14 Sep 2009 12:32:44 +0200 Subject: Imported Upstream version 0.9.1 --- doc/libunistring_1.html | 531 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 531 insertions(+) create mode 100644 doc/libunistring_1.html (limited to 'doc/libunistring_1.html') diff --git a/doc/libunistring_1.html b/doc/libunistring_1.html new file mode 100644 index 00000000..646fdc65 --- /dev/null +++ b/doc/libunistring_1.html @@ -0,0 +1,531 @@ + + + + + +GNU libunistring: 1. Introduction + + + + + + + + + + + + + + + + + + + + + + + + + + +

[ << ]

+ +

+ + +

1. Introduction

+ +

This library provides functions for manipulating Unicode strings and +for manipulating C strings according to the Unicode standard. +

It consists of the following parts: +

<unistr.h>: elementary string functions +
<uniconv.h>: conversion from/to legacy encodings +
<unistdio.h>: formatted output to strings +
<uniname.h>: character names +
<unictype.h>: character classification and properties +
<uniwidth.h>: string width when using nonproportional fonts +
<uniwbrk.h>: word breaks +
<unilbrk.h>: line breaking algorithm +
<uninorm.h>: normalization (composition and decomposition) +
<unicase.h>: case folding +
<uniregex.h>: regular expressions (not yet implemented) +

+ + + +

libunistring is for you if your application involves non-trivial text +processing, such as upper/lower case conversions, line breaking, operations +on words, or more advanced analysis of text. Text provided by the user can, +in general, contain characters of all kinds of scripts. The text processing +functions provided by this library handle all scripts and all languages. +

libunistring is for you if your application already uses the ISO C / POSIX +<ctype.h>, <wctype.h> functions and the text it operates on is +provided by the user and can be in any language. +

libunistring is also for you if your application uses Unicode strings as +internal in-memory representation. +

+ +

+ + +

1.1 Unicode

+ +

Unicode is a standardized repertoire of characters that contains characters +from all scripts of the world, from Latin letters to Chinese ideographs +and Babylonian cuneiform glyphs. It also specifies how these characters +are to be rendered on a screen or on paper, and how common text processing +(word selection, line breaking, uppercasing of page titles etc.) is supposed +to behave on Unicode text. +

Unicode also specifies three ways of storing sequences of Unicode +characters in a computer whose basic unit of data is an 8-bit byte: + + + + +

UTF-8: Every character is represented as 1 to 4 bytes. +
UTF-16: Every character is represented as 1 to 2 units of 16 bits. +
UTF-32, a.k.a. UCS-4: Every character is represented as 1 unit of 32 bits. +

+ +

For encoding Unicode text in a file, UTF-8 is usually used. For encoding +Unicode strings in memory for a program, either of the three encoding forms +can be reasonably used. +

Unicode is widely used on the web. Prior to the use of Unicode, web pages +were in many different encodings (ISO-8859-1 for English, French, Spanish, +ISO-8859-2 for Polish, ISO-8859-7 for Greek, KOI8-R for Russian, GB2312 or +BIG5 for Chinese, ISO-2022-JP-2 or EUC-JP or Shift_JIS for Japanese, and many +many others). It was next to impossible to create a document that contained +Chinese and Polish text in the same document. Due to the many encodings for +Japanese, even the processing of pure Japanese text was error prone. +

References: +

+The Unicode standard: http://www.unicode.org/ +
+Definition of UTF-8: http://www.rfc-editor.org/rfc/rfc3629.txt +
+Definition of UTF-16: http://www.rfc-editor.org/rfc/rfc2781.txt +
+Markus Kuhn's UTF-8 and Unicode FAQ: +http://www.cl.cam.ac.uk/~mgk25/unicode.html +

+ +

+ + +

1.2 Unicode and Internationalization

+ +

Internationalization is the process of changing the source code of a program +so that it can meet the expectations of users in any culture, if culture +specific data (translations, images etc.) are provided. +

Use of Unicode is not strictly required for internationalization, but it +makes internationalization much easier, because operations that need to +look at specific characters (like hyphenation, spell checking, or the +automatic conversion of double-quotes to opening and closing double-quote +characters) don't need to consider multiple possible encodings of the text. +

Use of Unicode also enables multilingualization: the ability of having text +in multiple languages present in the same document or even in the same line +of text. +

But use of Unicode is not everything. Internationalization usually consists +of three features: +

+Use of Unicode where needed for text processing. This is what this library +is for. +
+Use of message catalogs for messages shown to the user, This is what +GNU gettext is about. +
+Use of locale specific conventions for date and time formats, for numeric +formatting, or for sorting of text. This can be done adequately with the +POSIX APIs and the implementation of locales in the GNU C library. +

+ +

+ + +

1.3 Locale encodings

+ +

A locale is a set of cultural conventions. According to POSIX, for a program, +at any moment, there is one locale being designated as the “current locale”. +(Actually, POSIX supports also one locale per thread, but this feature is not +yet universally implemented and not widely used.) + +The locale is partitioned into several aspects, called the “categories” +of the locale. The main various aspects are: +

+The character encoding and the character properties. This is the +LC_CTYPE category. +
+The sorting rules for text. This is the LC_COLLATE category. +
+The language specific translations of messages. This is the +LC_MESSAGES category. +
+The formatting rules for numbers, such as the decimal separator. This is +the LC_NUMERIC category. +
+The formatting rules for amounts of money. This is the LC_MONETARY +category. +
+The formatting of date and time. This is the LC_TIME category. +

+ + +

In particular, the LC_CTYPE category of the current locale determines +the character encoding. This is the encoding of ‘char *’ strings. +We also call it the “locale encoding”. GNU libunistring has a function, +locale_charset, that returns a standardized (platform independent) +name for this encoding. +

All locale encodings used on glibc systems are essentially ASCII compatible: +Most graphic ASCII characters have the same representation, as a single byte, +in that encoding as in ASCII. +

Among the possible locale encodings are UTF-8 and GB18030. Both allow +to represent any Unicode character as a sequence of bytes. UTF-8 is used in +most of the world, whereas GB18030 is used in the People's Republic of China, +because it is backward compatible with the GB2312 encoding that was used in +this country earlier. +

The legacy locale encodings, ISO-8859-15 (which supplanted ISO-8859-1 in +most of Europe), ISO-8859-2, KOI8-R, EUC-JP, etc., are still in use in +many places, though. +

UTF-16 and UTF-32 are not used as locale encodings, because they are not +ASCII compatible. +

+ + +

1.4 Choice of in-memory representation of strings

+ +

There are three ways of representing strings in memory of a running +program. +

+As ‘char *’ strings. Such strings are represented in locale encoding. +This approach is employed when not much text processing is done by the +program. When some Unicode aware processing is to be done, a string is +converted to Unicode on the fly and back to locale encoding afterwards. +
+As UTF-8 or UTF-16 or UTF-32 strings. This implies that conversion from +locale encoding to Unicode is performed on input, and in the opposite +direction on output. This approach is employed when the program does +a significant amount of text processing, or when the program has multiple +threads operating on the same data but in different locales. +
+As ‘wchar_t *’, a.k.a. “wide strings”. This approach is misguided, +see The wchar_t mess. +

+ +

+ + +

1.5 ‘`char *`’ strings

+ +

The classical C strings, with its C library support standardized by +ISO C and POSIX, can be used in internationalized programs with some +precautions. The problem with this API is that many of the C library +functions for strings don't work correctly on strings in locale +encodings, leading to bugs that only people in some cultures of the +world will experience. +

+ +

The first problem with the C library API is the support of multibyte +locales. According to the locale encoding, in general, every character +is represented by one or more bytes (up to 4 bytes in practice — but +use MB_LEN_MAX instead of the number 4 in the code). +When every character is represented by only 1 byte, we speak of an +“unibyte locale”, otherwise of a “multibyte locale”. It is important +to realize that the majority of Unix installations nowadays use UTF-8 +or GB18030 as locale encoding; therefore, the majority of users are +using multibyte locales. +

+ +

The important fact to remember is: +

A ‘char’ is a byte, not a character. +

+ +

As a consequence: +

+The <ctype.h> API is useless in this context; it does not work in +multibyte locales. +
+The strlen function does not return the number of characters +in a string. Nor does it return the number of screen columns occupied +by a string after it is output. It merely returns the number of +bytes occupied by a string. +
+Truncating a string, for example, with strncpy, can have the +effect of truncating it in the middle of a multibyte character. Such +a string will, when output, have a garbled character at its end, often +represented by a hollow box. +
+strchr and strrchr do not work with multibyte strings +if the locale encoding is GB18030 and the character to be searched is +a digit. +
+strstr does not work with multibyte strings if the locale encoding +is different from UTF-8. +
+strcspn, strpbrk, strspn cannot work +correctly in multibyte locales: they assume the second argument is a list of +single-byte characters. Even in this simple case, they do not work with +multibyte strings if the locale encoding is GB18030 and one of the +characters to be searched is a digit. +
+strsep and strtok_r do not work with multibyte strings +unless all of the delimiter characters are ASCII characters < 0x30. +
+The strcasecmp, strncasecmp, and strcasestr +functions do not work with multibyte strings. +

+ +

The workarounds can be found in GNU gnulib +http://www.gnu.org/software/gnulib/. +

+gnulib has modules ‘mbchar’, ‘mbiter’, ‘mbuiter’ that +represent multibyte characters and allow to iterate across a multibyte +string with the same ease as through a unibyte string. +
+gnulib has functions mbslen and mbswidth that can be +used instead of strlen when the number of characters or the +number of screen columns of a string is requested. +
+gnulib has functions mbschr and mbsrrchr that are +like strchr and strrchr, but work in multibyte locales. +
+gnulib has a function mbsstr, like strstr, but works +in multibyte locales. +
+gnulib has functions mbscspn, mbspbrk, mbsspn +that are like strcspn, strpbrk, strspn, but +work in multibyte locales. +
+gnulib has functions mbssep and mbstok_r that are +like strsep and strtok_r but work in multibyte locales. +
+gnulib has functions mbscasecmp, mbsncasecmp, +mbspcasecmp, and mbscasestr that are like strcasecmp, +strncasecmp, and strcasestr, but +work in multibyte locales. Still, the function ulc_casecmp is +preferable to these functions; see below. +

+ +

The second problem with the C library API is that it has some assumptions built-in that are not valid in some languages: +

+It assumes that there are only two forms of every character: uppercase +and lowercase. This is not true for Croatian, where the character +LETTER DZ WITH CARON comes in three forms: +LATIN CAPITAL LETTER DZ WITH CARON (DZ), +LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON (Dz), +LATIN SMALL LETTER DZ WITH CARON (dz). +
+It assumes that uppercasing of 1 character leads to 1 character. This +is not true for German, where the LATIN SMALL LETTER SHARP S, when +uppercased, becomes ‘SS’. +
+It assumes that there is 1:1 mapping between uppercase and lowercase forms. +This is not true for the Greek sigma: GREEK CAPITAL LETTER SIGMA is +the uppercase of both GREEK SMALL LETTER SIGMA and +GREEK SMALL LETTER FINAL SIGMA. +
+It assumes that the upper/lowercase mappings are position independent. +This is not true for the Greek sigma and the Lithuanian i. +

+ +

The correct way to deal with this problem is +

+to provide functions for titlecasing, as well as for upper- and +lowercasing, +
+to view case transformations as functions that operates on strings, +rather than on characters. +

+ +

This is implemented in this library, through the functions declared in <unicase.h>, see Case mappings <unicase.h>. +

+ + +

1.6 The `wchar_t` mess

+ +

The ISO C and POSIX standard creators made an attempt to fix the first +problem mentioned in the previous section. They introduced +

+a type ‘wchar_t’, designed to encapsulate an entire character, +
+a “wide string” type ‘wchar_t *’, and +
+functions declared in <wctype.h> that were meant to supplant the +ones in <ctype.h>. +

+ +

Unfortunately, this API and its implementation has numerous problems: +

+On AIX and Windows platforms, wchar_t is a 16-bit type. This +means that it can never accommodate an entire Unicode character. Either +the wchar_t * strings are limited to characters in UCS-2 (the +“Basic Multilingual Plane” of Unicode), or — if wchar_t * +strings are encoded in UTF-16 — a wchar_t represents only half +of a character in the worst case, making the <wctype.h> functions +pointless. + +
+On Solaris and FreeBSD, the wchar_t encoding is locale dependent +and undocumented. This means, if you want to know any property of a +wchar_t character, other than the properties defined by +<wctype.h> — such as whether it's a dash, currency symbol, +paragraph separator, or similar —, you have to convert it to +char * encoding first, by use of the function wctomb. + +
+When you read a stream of wide characters, through the functions +fgetwc and fgetws, and when the input stream/file is +not in the expected encoding, you have no way to determine the invalid +byte sequence and do some corrective action. If you use these +functions, your program becomes “garbage in - more garbage out” or +“garbage in - abort”. +

+ +

As a consequence, it is better to use multibyte strings, as explained in +the previous section. Such multibyte strings can bypass limitations +of the wchar_t type, if you use functions defined in gnulib and +libunistring for text processing. They can also faithfully transport +malformed characters that were present in the input, without requiring +the program to produce garbage or abort. +

+ + +

1.7 Unicode strings

+ +

libunistring supports Unicode strings in three representations: + + + +

+UTF-8 strings, through the type ‘uint8_t *’. The units are bytes +(uint8_t). +
+UTF-16 strings, through the type ‘uint16_t *’, The units are 16-bit +memory words (uint16_t). +
+UTF-32 strings, through the type ‘uint32_t *’. The units are 32-bit +memory words (uint32_t). +

+ +

As with C strings, there are two variants: +

+Unicode strings with a terminating NUL character are represented as +a pointer to the first unit of the string. There is a unit containing +a 0 value at the end. It is considered part of the string for all +memory allocation purposes, but is not considered part of the string +for all other logical purposes. +
+Unicode strings where embedded NUL characters are allowed. These +are represented by a pointer to the first unit and the number of units +(not bytes!) of the string. In this setting, there is no trailing +zero-valued unit used as “end marker”. +

+ +

+ + + + + + + + + + + + +

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

+ + This document was generated by Bruno Haible on July, 1 2009 using texi2html 1.78a. + +
+ +

+ + -- cgit v1.2.3