From 5f2b09982312c98863eb9a8dfe2c608b81f58259 Mon Sep 17 00:00:00 2001 From: "Manuel A. Fernandez Montecelo" Date: Thu, 26 May 2016 16:48:15 +0100 Subject: Imported Upstream version 0.9.6 --- doc/libunistring_10.html | 228 ++++++++++++++++++++++++++++++----------------- 1 file changed, 148 insertions(+), 80 deletions(-) (limited to 'doc/libunistring_10.html') diff --git a/doc/libunistring_10.html b/doc/libunistring_10.html index 617406d6..3f4f5dac 100644 --- a/doc/libunistring_10.html +++ b/doc/libunistring_10.html @@ -1,6 +1,6 @@ - + -GNU libunistring: 10. Word breaks in strings <uniwbrk.h> +GNU libunistring: 10. Grapheme cluster breaks in strings <unigbrk.h> - - + + @@ -42,8 +42,8 @@ ul.toc {list-style: none} - - + + @@ -51,126 +51,194 @@ ul.toc {list-style: none} - +
[ << ][ >> ]
[ << ][ >> ]         [Top] [Contents][Index][Index] [ ? ]

- - -

10. Word breaks in strings <uniwbrk.h>

+ + +

10. Grapheme cluster breaks in strings <unigbrk.h>

This include file declares functions for determining where in a string -“words” start and end. Here “words” are not necessarily the same as -entities that can be looked up in dictionaries, but rather groups of -consecutive characters that should not be split by text processing -operations. +“grapheme clusters” start and end. A “grapheme cluster” is an +approximation to a user-perceived character, which sometimes +corresponds to multiple Unicode characters. Editing operations such as +mouse selection, cursor movement, and backspacing often operate on +grapheme clusters as units, not on individual characters. +

+

Some grapheme clusters are built from a base character and a combining +character. The letter ‘é’, +for example, is most commonly represented in Unicode as a single +character U+00E8 LATIN SMALL LETTER E WITH ACUTE. It is, +however, equally valid to use the pair of characters U+0065 LATIN +SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT. Since +the user would perceive this pair of characters as a single character, +they would be grouped into a single grapheme cluster. +

+

But there are also grapheme clusters that consist of several base characters. +For example, a Devanagari letter and a Devanagari vowel sign that follows it +may form a grapheme cluster. Similarly, some pairs of Thai characters and +Hangul syllables (formed by two or three Hangul characters) are grapheme +clusters.


- - -

10.1 Word breaks in a string

+ + +

10.1 Grapheme cluster breaks in a string

+ +

The following functions find a single boundary between grapheme +clusters in a string. +

+
+
Function: void u8_grapheme_next (const uint8_t *s, const uint8_t *end) + +
+
Function: void u16_grapheme_next (const uint16_t *s, const uint16_t *end) + +
+
Function: void u32_grapheme_next (const uint32_t *s, const uint32_t *end) + +
+

Returns the start of the next grapheme cluster following s, +or end if no grapheme cluster break is encountered before it. +Returns NULL if and only if s == end. +

+ +
+
Function: void u8_grapheme_prev (const uint8_t *s, const uint8_t *start) + +
+
Function: void u16_grapheme_prev (const uint16_t *s, const uint16_t *start) + +
+
Function: void u32_grapheme_prev (const uint32_t *s, const uint32_t *start) + +
+

Returns the start of the grapheme cluster preceding s, or +start if no grapheme cluster break is encountered before it. +Returns NULL if and only if s == start. +

-

The following functions determine the word breaks in a string. +

The following functions determine all of the grapheme cluster +boundaries in a string.

-
Function: void u8_wordbreaks (const uint8_t *s, size_t n, char *p) - +
Function: void u8_grapheme_breaks (const uint8_t *s, size_t n, char *p) +
-
Function: void u16_wordbreaks (const uint16_t *s, size_t n, char *p) - +
Function: void u16_grapheme_breaks (const uint16_t *s, size_t n, char *p) +
-
Function: void u32_wordbreaks (const uint32_t *s, size_t n, char *p) - +
Function: void u32_grapheme_breaks (const uint32_t *s, size_t n, char *p) +
-
Function: void ulc_wordbreaks (const char *s, size_t n, char *p) - +
Function: void ulc_grapheme_breaks (const char *s, size_t n, char *p) +
-

Determines the word break points in s, an array of n units, and -stores the result at p[0..n-1]. +

Determines the grapheme cluster break points in s, an array of +n units, and stores the result at p[0..n-1].

p[i] = 1
-

means that there is a word boundary between s[i-1] and -s[i]. +

means that there is a grapheme cluster boundary between +s[i-1] and s[i].

p[i] = 0
-

means that s[i-1] and s[i] must not be separated. +

means that s[i-1] and s[i] are part of the +same grapheme cluster.

-

p[0] is always set to 0. If an application wants to consider a -word break to be present at the beginning of the string (before -s[0]) or at the end of the string (after -s[0..n-1]), it has to treat these cases explicitly. +

p[0] is always set to 1, because there is always a +grapheme cluster break at start of text.


- - -

10.2 Word break property

- -

This is a more low-level API. The word break property is a property defined -in Unicode Standard Annex #29, section “Word Boundaries”, see -http://www.unicode.org/reports/tr29/#Word_Boundaries. It is -used for determining the word breaks in a string. + + +

10.2 Grapheme cluster break property

+ +

This is a more low-level API. The grapheme cluster break property is a +property defined in Unicode Standard Annex #29, section “Grapheme Cluster +Boundaries”, see +http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries. +It is used for determining the grapheme cluster breaks in a string.

-

The following are the possible values of the word break property. More values -may be added in the future. +

The following are the possible values of the grapheme cluster break +property. More values may be added in the future.

-
Constant: int WBP_OTHER - -
-
Constant: int WBP_CR - +
Constant: int GBP_OTHER +
-
Constant: int WBP_LF - +
Constant: int GBP_CR +
-
Constant: int WBP_NEWLINE - +
Constant: int GBP_LF +
-
Constant: int WBP_EXTEND - +
Constant: int GBP_CONTROL +
-
Constant: int WBP_FORMAT - +
Constant: int GBP_EXTEND +
-
Constant: int WBP_KATAKANA - +
Constant: int GBP_PREPEND +
-
Constant: int WBP_ALETTER - +
Constant: int GBP_SPACINGMARK +
-
Constant: int WBP_MIDNUMLET - +
Constant: int GBP_L +
-
Constant: int WBP_MIDLETTER - +
Constant: int GBP_V +
-
Constant: int WBP_MIDNUM - +
Constant: int GBP_T +
-
Constant: int WBP_NUMERIC - +
Constant: int GBP_LV +
-
Constant: int WBP_EXTENDNUMLET - +
Constant: int GBP_LVT +
-

The following function looks up the word break property of a character. +

The following function looks up the grapheme cluster break property of a +character. +

+
+
Function: int uc_graphemeclusterbreak_property (ucs4_t uc) + +
+

Returns the Grapheme_Cluster_Break property of a Unicode character. +

+ +

The following function determines whether there is a grapheme cluster +break between two Unicode characters. It is the primitive upon which +the higher-level functions in the previous section are directly based.

-
Function: int uc_wordbreak_property (ucs4_t uc) - +
Function: bool uc_is_grapheme_break (ucs4_t a, ucs4_t b) +
-

Returns the Word_Break property of a Unicode character. +

Returns true if there is an grapheme cluster boundary between Unicode +characters a and b. +

+

There is always a grapheme cluster break at the start or end of text. +You can specify zero for a or b to indicate start of text or end +of text, respectively. +

+

This implements the extended (not legacy) grapheme cluster rules +described in the Unicode standard, because the standard says that they +are preferred.


- - + + @@ -178,12 +246,12 @@ may be added in the future. - +
[ << ][ >> ]
[ << ][ >> ]         [Top] [Contents][Index][Index] [ ? ]

- This document was generated by Bruno Haible on March, 30 2010 using texi2html 1.78a. + This document was generated by Daiki Ueno on July, 8 2015 using texi2html 1.78a.
-- cgit v1.2.3