From f7c3580478601e3a77dc864e5a1d91c1edad5187 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=B6rg=20Frings-F=C3=BCrst?= <debian@jff.email>
Date: Wed, 7 Mar 2018 05:31:03 +0100
Subject: New upstream version 0.9.9

---
 doc/wchar_t.texi | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)
 create mode 100644 doc/wchar_t.texi

(limited to 'doc/wchar_t.texi')

diff --git a/doc/wchar_t.texi b/doc/wchar_t.texi
new file mode 100644
index 00000000..f5c239a5
--- /dev/null
+++ b/doc/wchar_t.texi
@@ -0,0 +1,51 @@
+@node The wchar_t mess
+@appendix The @code{wchar_t} mess
+
+@cindex wchar_t, type
+The ISO C and POSIX standard creators made an attempt to fix the first
+problem mentioned in the section @ref{char * strings}.  They introduced
+@itemize @bullet
+@item
+a type @samp{wchar_t}, designed to encapsulate an entire character,
+@item
+a ``wide string'' type @samp{wchar_t *}, and
+@item
+functions declared in @posixheader{wctype.h} that were meant to supplant the
+ones in @posixheader{ctype.h}.
+@end itemize
+
+Unfortunately, this API and its implementation has numerous problems:
+
+@itemize @bullet
+@item
+On AIX and Windows platforms, @code{wchar_t} is a 16-bit type.  This
+means that it can never accommodate an entire Unicode character.  Either
+the @code{wchar_t *} strings are limited to characters in UCS-2 (the
+``Basic Multilingual Plane'' of Unicode), or --- if @code{wchar_t *}
+strings are encoded in UTF-16 --- a @code{wchar_t} represents only half
+of a character in the worst case, making the @posixheader{wctype.h} functions
+pointless.
+
+@item
+On Solaris and FreeBSD, the @code{wchar_t} encoding is locale dependent
+and undocumented.  This means, if you want to know any property of a
+@code{wchar_t} character, other than the properties defined by
+@posixheader{wctype.h} --- such as whether it's a dash, currency symbol,
+paragraph separator, or similar ---, you have to convert it to
+@code{char *} encoding first, by use of the function @posixfunc{wctomb}.
+
+@item
+When you read a stream of wide characters, through the functions
+@posixfunc{fgetwc} and @posixfunc{fgetws}, and when the input stream/file is
+not in the expected encoding, you have no way to determine the invalid
+byte sequence and do some corrective action.  If you use these
+functions, your program becomes ``garbage in - more garbage out'' or
+``garbage in - abort''.
+@end itemize
+
+As a consequence, it is better to use multibyte strings, as explained in
+the section @ref{char * strings}.  Such multibyte strings can bypass
+limitations of the @code{wchar_t} type, if you use functions defined in gnulib
+and libunistring for text processing.  They can also faithfully transport
+malformed characters that were present in the input, without requiring
+the program to produce garbage or abort.
-- 
cgit v1.2.3