summaryrefslogtreecommitdiff
path: root/doc/wchar_t.texi
diff options
context:
space:
mode:
Diffstat (limited to 'doc/wchar_t.texi')
-rw-r--r--doc/wchar_t.texi51
1 files changed, 51 insertions, 0 deletions
diff --git a/doc/wchar_t.texi b/doc/wchar_t.texi
new file mode 100644
index 00000000..f5c239a5
--- /dev/null
+++ b/doc/wchar_t.texi
@@ -0,0 +1,51 @@
+@node The wchar_t mess
+@appendix The @code{wchar_t} mess
+
+@cindex wchar_t, type
+The ISO C and POSIX standard creators made an attempt to fix the first
+problem mentioned in the section @ref{char * strings}. They introduced
+@itemize @bullet
+@item
+a type @samp{wchar_t}, designed to encapsulate an entire character,
+@item
+a ``wide string'' type @samp{wchar_t *}, and
+@item
+functions declared in @posixheader{wctype.h} that were meant to supplant the
+ones in @posixheader{ctype.h}.
+@end itemize
+
+Unfortunately, this API and its implementation has numerous problems:
+
+@itemize @bullet
+@item
+On AIX and Windows platforms, @code{wchar_t} is a 16-bit type. This
+means that it can never accommodate an entire Unicode character. Either
+the @code{wchar_t *} strings are limited to characters in UCS-2 (the
+``Basic Multilingual Plane'' of Unicode), or --- if @code{wchar_t *}
+strings are encoded in UTF-16 --- a @code{wchar_t} represents only half
+of a character in the worst case, making the @posixheader{wctype.h} functions
+pointless.
+
+@item
+On Solaris and FreeBSD, the @code{wchar_t} encoding is locale dependent
+and undocumented. This means, if you want to know any property of a
+@code{wchar_t} character, other than the properties defined by
+@posixheader{wctype.h} --- such as whether it's a dash, currency symbol,
+paragraph separator, or similar ---, you have to convert it to
+@code{char *} encoding first, by use of the function @posixfunc{wctomb}.
+
+@item
+When you read a stream of wide characters, through the functions
+@posixfunc{fgetwc} and @posixfunc{fgetws}, and when the input stream/file is
+not in the expected encoding, you have no way to determine the invalid
+byte sequence and do some corrective action. If you use these
+functions, your program becomes ``garbage in - more garbage out'' or
+``garbage in - abort''.
+@end itemize
+
+As a consequence, it is better to use multibyte strings, as explained in
+the section @ref{char * strings}. Such multibyte strings can bypass
+limitations of the @code{wchar_t} type, if you use functions defined in gnulib
+and libunistring for text processing. They can also faithfully transport
+malformed characters that were present in the input, without requiring
+the program to produce garbage or abort.