summaryrefslogtreecommitdiff
path: root/doc/libunistring_10.html
blob: b7bee4b184fab2f3578c1fe08fdfc99239a0ca23 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html401/loose.dtd">
<html>
<!-- Created on October, 16 2024 by texi2html 1.78a -->
<!--
Written by: Lionel Cons <Lionel.Cons@cern.ch> (original author)
            Karl Berry  <karl@freefriends.org>
            Olaf Bachmann <obachman@mathematik.uni-kl.de>
            and many others.
Maintained by: Many creative people.
Send bugs and suggestions to <texi2html-bug@nongnu.org>

-->
<head>
<title>GNU libunistring: 10. Grapheme cluster breaks in strings &lt;unigbrk.h&gt;</title>

<meta name="description" content="GNU libunistring: 10. Grapheme cluster breaks in strings &lt;unigbrk.h&gt;">
<meta name="keywords" content="GNU libunistring: 10. Grapheme cluster breaks in strings &lt;unigbrk.h&gt;">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="texi2html 1.78a">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css">
<!--
a.summary-letter {text-decoration: none}
pre.display {font-family: serif}
pre.format {font-family: serif}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
pre.smalldisplay {font-family: serif; font-size: smaller}
pre.smallexample {font-size: smaller}
pre.smallformat {font-family: serif; font-size: smaller}
pre.smalllisp {font-size: smaller}
span.roman {font-family:serif; font-weight:normal;}
span.sansserif {font-family:sans-serif; font-weight:normal;}
ul.toc {list-style: none}
-->
</style>


</head>

<body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000">

<table cellpadding="1" cellspacing="1" border="0">
<tr><td valign="middle" align="left">[<a href="libunistring_9.html#SEC55" title="Beginning of this chapter or previous chapter"> &lt;&lt; </a>]</td>
<td valign="middle" align="left">[<a href="libunistring_11.html#SEC59" title="Next chapter"> &gt;&gt; </a>]</td>
<td valign="middle" align="left"> &nbsp; </td>
<td valign="middle" align="left"> &nbsp; </td>
<td valign="middle" align="left"> &nbsp; </td>
<td valign="middle" align="left"> &nbsp; </td>
<td valign="middle" align="left"> &nbsp; </td>
<td valign="middle" align="left">[<a href="libunistring_toc.html#SEC_Top" title="Cover (top) of document">Top</a>]</td>
<td valign="middle" align="left">[<a href="libunistring_toc.html#SEC_Contents" title="Table of contents">Contents</a>]</td>
<td valign="middle" align="left">[<a href="libunistring_21.html#SEC94" title="Index">Index</a>]</td>
<td valign="middle" align="left">[<a href="libunistring_abt.html#SEC_About" title="About (help)"> ? </a>]</td>
</tr></table>

<hr size="2">
<a name="unigbrk_002eh"></a>
<a name="SEC56"></a>
<h1 class="chapter"> <a href="libunistring_toc.html#TOC56">10. Grapheme cluster breaks in strings <code>&lt;unigbrk.h&gt;</code></a> </h1>

<p>This include file declares functions for determining where in a string
&ldquo;grapheme clusters&rdquo; start and end.  A &ldquo;grapheme cluster&rdquo; is an
approximation to a user-perceived character, which sometimes
corresponds to multiple Unicode characters.  Editing operations such as
mouse selection, cursor movement, and backspacing often operate on
grapheme clusters as units, not on individual characters.
</p>
<p>Some grapheme clusters are built from a base character and a combining
character.  The letter &lsquo;<samp>&eacute;</samp>&rsquo;,
for example, is most commonly represented in Unicode as a single
character U+00E8 <small>LATIN SMALL LETTER E WITH ACUTE</small>.  It is,
however, equally valid to use the pair of characters U+0065 <small>LATIN
SMALL LETTER E</small> followed by U+0301 <small>COMBINING ACUTE ACCENT</small>.  Since
the user would perceive this pair of characters as a single character,
they would be grouped into a single grapheme cluster.
</p>
<p>But there are also grapheme clusters that consist of several base characters.
For example, a Devanagari letter and a Devanagari vowel sign that follows it
may form a grapheme cluster.  Similarly, some pairs of Thai characters and
Hangul syllables (formed by two or three Hangul characters) are grapheme
clusters.
</p>

<hr size="6">
<a name="Grapheme-cluster-breaks-in-a-string"></a>
<a name="SEC57"></a>
<h2 class="section"> <a href="libunistring_toc.html#TOC57">10.1 Grapheme cluster breaks in a string</a> </h2>

<p>The following functions find a single boundary between grapheme
clusters in a string.
</p>
<dl>
<dt><u>Function:</u> void <b>u8_grapheme_next</b><i> (const&nbsp;uint8_t&nbsp;*<var>s</var>, const&nbsp;uint8_t&nbsp;*<var>end</var>)</i>
<a name="IDX790"></a>
</dt>
<dt><u>Function:</u> void <b>u16_grapheme_next</b><i> (const&nbsp;uint16_t&nbsp;*<var>s</var>, const&nbsp;uint16_t&nbsp;*<var>end</var>)</i>
<a name="IDX791"></a>
</dt>
<dt><u>Function:</u> void <b>u32_grapheme_next</b><i> (const&nbsp;uint32_t&nbsp;*<var>s</var>, const&nbsp;uint32_t&nbsp;*<var>end</var>)</i>
<a name="IDX792"></a>
</dt>
<dd><p>Returns the start of the next grapheme cluster following <var>s</var>,
or <var>end</var> if no grapheme cluster break is encountered before it.
Returns NULL if and only if <code><var>s</var> == <var>end</var></code>.
</p>
<p>Note that these functions do not handle the case when a character
outside of the range between <var>s</var> and <var>end</var> is needed to
determine the boundary.
This is the case in particular with syllables in Indic scripts or emojis.
Use <code>_grapheme_breaks</code> functions for such cases.
</p></dd></dl>

<dl>
<dt><u>Function:</u> void <b>u8_grapheme_prev</b><i> (const&nbsp;uint8_t&nbsp;*<var>s</var>, const&nbsp;uint8_t&nbsp;*<var>start</var>)</i>
<a name="IDX793"></a>
</dt>
<dt><u>Function:</u> void <b>u16_grapheme_prev</b><i> (const&nbsp;uint16_t&nbsp;*<var>s</var>, const&nbsp;uint16_t&nbsp;*<var>start</var>)</i>
<a name="IDX794"></a>
</dt>
<dt><u>Function:</u> void <b>u32_grapheme_prev</b><i> (const&nbsp;uint32_t&nbsp;*<var>s</var>, const&nbsp;uint32_t&nbsp;*<var>start</var>)</i>
<a name="IDX795"></a>
</dt>
<dd><p>Returns the start of the grapheme cluster preceding <var>s</var>, or
<var>start</var> if no grapheme cluster break is encountered before it.
Returns NULL if and only if <code><var>s</var> == <var>start</var></code>.
</p>
<p>Note that these functions do not handle the case when a character
outside of the range between <var>start</var> and <var>s</var> is needed to
determine the boundary.
This is the case in particular with syllables in Indic scripts or emojis.
Use <code>_grapheme_breaks</code> functions for such cases.
</p>
<p>Note also that these functions work only on well-formed Unicode strings.
</p></dd></dl>

<p>The following functions determine all of the grapheme cluster
boundaries in a string.
</p>
<dl>
<dt><u>Function:</u> void <b>u8_grapheme_breaks</b><i> (const&nbsp;uint8_t&nbsp;*<var>s</var>, size_t&nbsp;<var>n</var>, char&nbsp;*<var>p</var>)</i>
<a name="IDX796"></a>
</dt>
<dt><u>Function:</u> void <b>u16_grapheme_breaks</b><i> (const&nbsp;uint16_t&nbsp;*<var>s</var>, size_t&nbsp;<var>n</var>, char&nbsp;*<var>p</var>)</i>
<a name="IDX797"></a>
</dt>
<dt><u>Function:</u> void <b>u32_grapheme_breaks</b><i> (const&nbsp;uint32_t&nbsp;*<var>s</var>, size_t&nbsp;<var>n</var>, char&nbsp;*<var>p</var>)</i>
<a name="IDX798"></a>
</dt>
<dt><u>Function:</u> void <b>ulc_grapheme_breaks</b><i> (const&nbsp;char&nbsp;*<var>s</var>, size_t&nbsp;<var>n</var>, char&nbsp;*<var>p</var>)</i>
<a name="IDX799"></a>
</dt>
<dt><u>Function:</u> void <b>uc_grapheme_breaks</b><i> (const&nbsp;ucs_t&nbsp;*<var>s</var>, size_t&nbsp;<var>n</var>, char&nbsp;*<var>p</var>)</i>
<a name="IDX800"></a>
</dt>
<dd><p>Determines the grapheme cluster break points in <var>s</var>, an array of
<var>n</var> units, and stores the result at <code><var>p</var>[0..<var>nx</var>-1]</code>.
</p><dl compact="compact">
<dt> <code><var>p</var>[i] = 1</code></dt>
<dd><p>means that there is a grapheme cluster boundary between
<code><var>s</var>[i-1]</code> and <code><var>s</var>[i]</code>.
</p></dd>
<dt> <code><var>p</var>[i] = 0</code></dt>
<dd><p>means that <code><var>s</var>[i-1]</code> and <code><var>s</var>[i]</code> are part of the
same grapheme cluster.
</p></dd>
</dl>
<p><code><var>p</var>[0]</code> is always set to 1, because there is always a
grapheme cluster break at start of text.
</p>
<p>In addition to the above variants for UTF-8, UTF-16, and UTF-32 strings,
<code>&lt;unigbrk.h&gt;</code> provides another variant: <code>uc_grapheme_breaks</code>.
</p>
<p>This is similar to <code>u32_grapheme_breaks</code>, but it accepts any
characters which may not be represented in UTF-32, such as control
characters.
</p></dd></dl>

<hr size="6">
<a name="Grapheme-cluster-break-property"></a>
<a name="SEC58"></a>
<h2 class="section"> <a href="libunistring_toc.html#TOC58">10.2 Grapheme cluster break property</a> </h2>

<p>This is a more low-level API.  The grapheme cluster break property is a
property defined in Unicode Standard Annex #29, section &ldquo;Grapheme Cluster
Boundaries&rdquo;, see
<a href="https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries">https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries</a>.
It is used for determining the grapheme cluster breaks in a string.
</p>
<p>The following are the possible values of the grapheme cluster break
property.  More values may be added in the future.
</p>
<dl>
<dt><u>Constant:</u> int <b>GBP_OTHER</b>
<a name="IDX801"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_CR</b>
<a name="IDX802"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_LF</b>
<a name="IDX803"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_CONTROL</b>
<a name="IDX804"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_EXTEND</b>
<a name="IDX805"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_PREPEND</b>
<a name="IDX806"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_SPACINGMARK</b>
<a name="IDX807"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_L</b>
<a name="IDX808"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_V</b>
<a name="IDX809"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_T</b>
<a name="IDX810"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_LV</b>
<a name="IDX811"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_LVT</b>
<a name="IDX812"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_RI</b>
<a name="IDX813"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_ZWJ</b>
<a name="IDX814"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_EB</b>
<a name="IDX815"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_EM</b>
<a name="IDX816"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_GAZ</b>
<a name="IDX817"></a>
</dt>
<dt><u>Constant:</u> int <b>GBP_EBG</b>
<a name="IDX818"></a>
</dt>
</dl>

<p>The following function looks up the grapheme cluster break property of a
character.
</p>
<dl>
<dt><u>Function:</u> int <b>uc_graphemeclusterbreak_property</b><i> (ucs4_t&nbsp;<var>uc</var>)</i>
<a name="IDX819"></a>
</dt>
<dd><p>Returns the Grapheme_Cluster_Break property of a Unicode character.
</p></dd></dl>

<p>The following function determines whether there is a grapheme cluster
break between two Unicode characters.  It is the primitive upon which
the higher-level functions in the previous section are directly based.
</p>
<dl>
<dt><u>Function:</u> bool <b>uc_is_grapheme_break</b><i> (ucs4_t&nbsp;<var>a</var>, ucs4_t&nbsp;<var>b</var>)</i>
<a name="IDX820"></a>
</dt>
<dd><p>Returns true if there is an grapheme cluster boundary between Unicode
characters <var>a</var> and <var>b</var>.
</p>
<p>There is always a grapheme cluster break at the start or end of text.
You can specify zero for <var>a</var> or <var>b</var> to indicate start of text or end
of text, respectively.
</p>
<p>This implements the extended (not legacy) grapheme cluster rules
described in the Unicode standard, because the standard says that they
are preferred.
</p>
<p>Note that this function does not handle the case when three or more
consecutive characters are needed to determine the boundary.
This is the case in particular with syllables in Indic scripts or emojis.
Use <code>uc_grapheme_breaks</code> for such cases.
</p></dd></dl>
<hr size="6">
<table cellpadding="1" cellspacing="1" border="0">
<tr><td valign="middle" align="left">[<a href="#SEC56" title="Beginning of this chapter or previous chapter"> &lt;&lt; </a>]</td>
<td valign="middle" align="left">[<a href="libunistring_11.html#SEC59" title="Next chapter"> &gt;&gt; </a>]</td>
<td valign="middle" align="left"> &nbsp; </td>
<td valign="middle" align="left"> &nbsp; </td>
<td valign="middle" align="left"> &nbsp; </td>
<td valign="middle" align="left"> &nbsp; </td>
<td valign="middle" align="left"> &nbsp; </td>
<td valign="middle" align="left">[<a href="libunistring_toc.html#SEC_Top" title="Cover (top) of document">Top</a>]</td>
<td valign="middle" align="left">[<a href="libunistring_toc.html#SEC_Contents" title="Table of contents">Contents</a>]</td>
<td valign="middle" align="left">[<a href="libunistring_21.html#SEC94" title="Index">Index</a>]</td>
<td valign="middle" align="left">[<a href="libunistring_abt.html#SEC_About" title="About (help)"> ? </a>]</td>
</tr></table>
<p>
 <font size="-1">
  This document was generated by <em>Bruno Haible</em> on <em>October, 16 2024</em> using <a href="https://www.nongnu.org/texi2html/"><em>texi2html 1.78a</em></a>.
 </font>
 <br>

</p>
</body>
</html>