New upstream version 0.9.9upstream/0.9.9

author: Jörg Frings-Fürst <debian@jff.email> 2018-03-07 05:31:03 +0100
committer: Jörg Frings-Fürst <debian@jff.email> 2018-03-07 05:31:03 +0100
commit: f7c3580478601e3a77dc864e5a1d91c1edad5187 (patch)
tree: 26f2b85233e76ce30cbd013559944404e66a6552 /doc/wchar_t.texi
parent: 44a3eaeba04ef78835ca741592c376428ada5f71 (diff)
1 files changed, 51 insertions, 0 deletions
diff --git a/doc/wchar_t.texi b/doc/wchar_t.texi
new file mode 100644
index 00000000..f5c239a5
--- /dev/null
+++ b/doc/wchar_t.texi
@@ -0,0 +1,51 @@
+@node The wchar_t mess
+@appendix The @code{wchar_t} mess
+
+@cindex wchar_t, type
+The ISO C and POSIX standard creators made an attempt to fix the first
+problem mentioned in the section @ref{char * strings}.  They introduced
+@itemize @bullet
+@item
+a type @samp{wchar_t}, designed to encapsulate an entire character,
+@item
+a ``wide string'' type @samp{wchar_t *}, and
+@item
+functions declared in @posixheader{wctype.h} that were meant to supplant the
+ones in @posixheader{ctype.h}.
+@end itemize
+
+Unfortunately, this API and its implementation has numerous problems:
+
+@itemize @bullet
+@item
+On AIX and Windows platforms, @code{wchar_t} is a 16-bit type.  This
+means that it can never accommodate an entire Unicode character.  Either
+the @code{wchar_t *} strings are limited to characters in UCS-2 (the
+``Basic Multilingual Plane'' of Unicode), or --- if @code{wchar_t *}
+strings are encoded in UTF-16 --- a @code{wchar_t} represents only half
+of a character in the worst case, making the @posixheader{wctype.h} functions
+pointless.
+
+@item
+On Solaris and FreeBSD, the @code{wchar_t} encoding is locale dependent
+and undocumented.  This means, if you want to know any property of a
+@code{wchar_t} character, other than the properties defined by
+@posixheader{wctype.h} --- such as whether it's a dash, currency symbol,
+paragraph separator, or similar ---, you have to convert it to
+@code{char *} encoding first, by use of the function @posixfunc{wctomb}.
+
+@item
+When you read a stream of wide characters, through the functions
+@posixfunc{fgetwc} and @posixfunc{fgetws}, and when the input stream/file is
+not in the expected encoding, you have no way to determine the invalid
+byte sequence and do some corrective action.  If you use these
+functions, your program becomes ``garbage in - more garbage out'' or
+``garbage in - abort''.
+@end itemize
+
+As a consequence, it is better to use multibyte strings, as explained in
+the section @ref{char * strings}.  Such multibyte strings can bypass
+limitations of the @code{wchar_t} type, if you use functions defined in gnulib
+and libunistring for text processing.  They can also faithfully transport
+malformed characters that were present in the input, without requiring
+the program to produce garbage or abort.
author	Jörg Frings-Fürst <debian@jff.email>	2018-03-07 05:31:03 +0100
committer	Jörg Frings-Fürst <debian@jff.email>	2018-03-07 05:31:03 +0100
commit	f7c3580478601e3a77dc864e5a1d91c1edad5187 (patch)
tree	26f2b85233e76ce30cbd013559944404e66a6552 /doc/wchar_t.texi
parent	44a3eaeba04ef78835ca741592c376428ada5f71 (diff)