summaryrefslogtreecommitdiff
path: root/doc/char32_t.texi
diff options
context:
space:
mode:
Diffstat (limited to 'doc/char32_t.texi')
-rw-r--r--doc/char32_t.texi50
1 files changed, 50 insertions, 0 deletions
diff --git a/doc/char32_t.texi b/doc/char32_t.texi
new file mode 100644
index 00000000..040e298e
--- /dev/null
+++ b/doc/char32_t.texi
@@ -0,0 +1,50 @@
+@node The char32_t problem
+@appendix The @code{char32_t} problem
+
+@cindex char32_t, type
+@cindex char16_t, type
+In response to the @code{wchar_t} mess described in the previous section,
+ISO C 11 introduces two new types: @code{char32_t} and @code{char16_t}.
+
+@code{char32_t} is a type like @code{wchar_t}, with the added guarantee that it
+is 32 bits wide. So, it is a type that is appropriate for encoding a Unicode
+character. It is meant to resolve the problems of the 16-bit wide
+@code{wchar_t} on AIX and Windows platforms, and allow a saner programming model
+for wide character strings across all platforms.
+
+@code{char16_t} is a type like @code{wchar_t}, with the added guarantee that it
+is 16 bits wide. It is meant to allow porting programs that use the broken wide
+character strings programming model from Windows to all platforms. Of course,
+no one needs this.
+
+These types are accompanied with a syntax for defining wide string literals with
+these element types: @code{u"..."} and @code{U"..."}.
+
+So far, so good. What the ISO C designers forgot, is to provide standardized C
+library functions that operate on these wide character strings. They
+standardized only the most basic functions, @code{mbrtoc32} and @code{c32rtomb},
+which are analogous to @code{mbrtowc} and @code{wcrtomb}, respectively. For the
+rest, GNU gnulib @url{https://www.gnu.org/software/gnulib/} provides the
+functions:
+@itemize @bullet
+@item
+Functions for converting an entire string: @code{mbstoc32s} -- like
+@code{mbstowcs}, @code{c32stombs} -- like @code{wcstombs}.
+@item
+Functions for testing the properties of a 32-bit wide character:
+@code{c32isalnum}, @code{c32isalpha}, etc. -- like @code{iswalnum},
+@code{iswalpha}, etc.
+@end itemize
+
+Still, this API has two problems:
+@itemize @bullet
+@item
+The @code{char32_t} encoding is locale dependent and undocumented. This means,
+if you want to know any property of a @code{char32_t} character, other than the
+properties defined by @code{<wctype.h>} -- such as whether it's a dash, currency
+symbol, paragraph separator, or similar --, you have to convert it to
+@code{char *} encoding first, by use of the function @code{c32tomb}.
+@item
+Even on platforms where @code{wchar_t} is 32 bits wide, the @code{char32_t}
+encoding may be different from the @code{wchar_t} encoding.
+@end itemize