From f7c3580478601e3a77dc864e5a1d91c1edad5187 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?J=C3=B6rg=20Frings-F=C3=BCrst?= Date: Wed, 7 Mar 2018 05:31:03 +0100 Subject: New upstream version 0.9.9 --- doc/wchar_t.texi | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) create mode 100644 doc/wchar_t.texi (limited to 'doc/wchar_t.texi') diff --git a/doc/wchar_t.texi b/doc/wchar_t.texi new file mode 100644 index 00000000..f5c239a5 --- /dev/null +++ b/doc/wchar_t.texi @@ -0,0 +1,51 @@ +@node The wchar_t mess +@appendix The @code{wchar_t} mess + +@cindex wchar_t, type +The ISO C and POSIX standard creators made an attempt to fix the first +problem mentioned in the section @ref{char * strings}. They introduced +@itemize @bullet +@item +a type @samp{wchar_t}, designed to encapsulate an entire character, +@item +a ``wide string'' type @samp{wchar_t *}, and +@item +functions declared in @posixheader{wctype.h} that were meant to supplant the +ones in @posixheader{ctype.h}. +@end itemize + +Unfortunately, this API and its implementation has numerous problems: + +@itemize @bullet +@item +On AIX and Windows platforms, @code{wchar_t} is a 16-bit type. This +means that it can never accommodate an entire Unicode character. Either +the @code{wchar_t *} strings are limited to characters in UCS-2 (the +``Basic Multilingual Plane'' of Unicode), or --- if @code{wchar_t *} +strings are encoded in UTF-16 --- a @code{wchar_t} represents only half +of a character in the worst case, making the @posixheader{wctype.h} functions +pointless. + +@item +On Solaris and FreeBSD, the @code{wchar_t} encoding is locale dependent +and undocumented. This means, if you want to know any property of a +@code{wchar_t} character, other than the properties defined by +@posixheader{wctype.h} --- such as whether it's a dash, currency symbol, +paragraph separator, or similar ---, you have to convert it to +@code{char *} encoding first, by use of the function @posixfunc{wctomb}. + +@item +When you read a stream of wide characters, through the functions +@posixfunc{fgetwc} and @posixfunc{fgetws}, and when the input stream/file is +not in the expected encoding, you have no way to determine the invalid +byte sequence and do some corrective action. If you use these +functions, your program becomes ``garbage in - more garbage out'' or +``garbage in - abort''. +@end itemize + +As a consequence, it is better to use multibyte strings, as explained in +the section @ref{char * strings}. Such multibyte strings can bypass +limitations of the @code{wchar_t} type, if you use functions defined in gnulib +and libunistring for text processing. They can also faithfully transport +malformed characters that were present in the input, without requiring +the program to produce garbage or abort. -- cgit v1.2.3