From a9a31b1de5776a3b08a82101a4fa711294f0dd1d Mon Sep 17 00:00:00 2001 From: "Manuel A. Fernandez Montecelo" Date: Fri, 27 May 2016 14:28:30 +0100 Subject: Imported Upstream version 0.9.6+really0.9.3 --- doc/libunistring_10.html | 228 +++++++++++++++++------------------------------ 1 file changed, 80 insertions(+), 148 deletions(-) (limited to 'doc/libunistring_10.html') diff --git a/doc/libunistring_10.html b/doc/libunistring_10.html index 3f4f5dac..617406d6 100644 --- a/doc/libunistring_10.html +++ b/doc/libunistring_10.html @@ -1,6 +1,6 @@ - + -GNU libunistring: 10. Grapheme cluster breaks in strings <unigbrk.h> +GNU libunistring: 10. Word breaks in strings <uniwbrk.h> - - + + @@ -42,8 +42,8 @@ ul.toc {list-style: none} - - + + @@ -51,194 +51,126 @@ ul.toc {list-style: none} - +
[ << ][ >> ]
[ << ][ >> ]         [Top] [Contents][Index][Index] [ ? ]

- - -

10. Grapheme cluster breaks in strings <unigbrk.h>

+ + +

10. Word breaks in strings <uniwbrk.h>

This include file declares functions for determining where in a string -“grapheme clusters” start and end. A “grapheme cluster” is an -approximation to a user-perceived character, which sometimes -corresponds to multiple Unicode characters. Editing operations such as -mouse selection, cursor movement, and backspacing often operate on -grapheme clusters as units, not on individual characters. -

-

Some grapheme clusters are built from a base character and a combining -character. The letter ‘é’, -for example, is most commonly represented in Unicode as a single -character U+00E8 LATIN SMALL LETTER E WITH ACUTE. It is, -however, equally valid to use the pair of characters U+0065 LATIN -SMALL LETTER E followed by U+0301 COMBINING ACUTE ACCENT. Since -the user would perceive this pair of characters as a single character, -they would be grouped into a single grapheme cluster. -

-

But there are also grapheme clusters that consist of several base characters. -For example, a Devanagari letter and a Devanagari vowel sign that follows it -may form a grapheme cluster. Similarly, some pairs of Thai characters and -Hangul syllables (formed by two or three Hangul characters) are grapheme -clusters. +“words” start and end. Here “words” are not necessarily the same as +entities that can be looked up in dictionaries, but rather groups of +consecutive characters that should not be split by text processing +operations.


- - -

10.1 Grapheme cluster breaks in a string

- -

The following functions find a single boundary between grapheme -clusters in a string. -

-
-
Function: void u8_grapheme_next (const uint8_t *s, const uint8_t *end) - -
-
Function: void u16_grapheme_next (const uint16_t *s, const uint16_t *end) - -
-
Function: void u32_grapheme_next (const uint32_t *s, const uint32_t *end) - -
-

Returns the start of the next grapheme cluster following s, -or end if no grapheme cluster break is encountered before it. -Returns NULL if and only if s == end. -

- -
-
Function: void u8_grapheme_prev (const uint8_t *s, const uint8_t *start) - -
-
Function: void u16_grapheme_prev (const uint16_t *s, const uint16_t *start) - -
-
Function: void u32_grapheme_prev (const uint32_t *s, const uint32_t *start) - -
-

Returns the start of the grapheme cluster preceding s, or -start if no grapheme cluster break is encountered before it. -Returns NULL if and only if s == start. -

+ + +

10.1 Word breaks in a string

-

The following functions determine all of the grapheme cluster -boundaries in a string. +

The following functions determine the word breaks in a string.

-
Function: void u8_grapheme_breaks (const uint8_t *s, size_t n, char *p) - +
Function: void u8_wordbreaks (const uint8_t *s, size_t n, char *p) +
-
Function: void u16_grapheme_breaks (const uint16_t *s, size_t n, char *p) - +
Function: void u16_wordbreaks (const uint16_t *s, size_t n, char *p) +
-
Function: void u32_grapheme_breaks (const uint32_t *s, size_t n, char *p) - +
Function: void u32_wordbreaks (const uint32_t *s, size_t n, char *p) +
-
Function: void ulc_grapheme_breaks (const char *s, size_t n, char *p) - +
Function: void ulc_wordbreaks (const char *s, size_t n, char *p) +
-

Determines the grapheme cluster break points in s, an array of -n units, and stores the result at p[0..n-1]. +

Determines the word break points in s, an array of n units, and +stores the result at p[0..n-1].

p[i] = 1
-

means that there is a grapheme cluster boundary between -s[i-1] and s[i]. +

means that there is a word boundary between s[i-1] and +s[i].

p[i] = 0
-

means that s[i-1] and s[i] are part of the -same grapheme cluster. +

means that s[i-1] and s[i] must not be separated.

-

p[0] is always set to 1, because there is always a -grapheme cluster break at start of text. +

p[0] is always set to 0. If an application wants to consider a +word break to be present at the beginning of the string (before +s[0]) or at the end of the string (after +s[0..n-1]), it has to treat these cases explicitly.


- - -

10.2 Grapheme cluster break property

- -

This is a more low-level API. The grapheme cluster break property is a -property defined in Unicode Standard Annex #29, section “Grapheme Cluster -Boundaries”, see -http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries. -It is used for determining the grapheme cluster breaks in a string. + + +

10.2 Word break property

+ +

This is a more low-level API. The word break property is a property defined +in Unicode Standard Annex #29, section “Word Boundaries”, see +http://www.unicode.org/reports/tr29/#Word_Boundaries. It is +used for determining the word breaks in a string.

-

The following are the possible values of the grapheme cluster break -property. More values may be added in the future. +

The following are the possible values of the word break property. More values +may be added in the future.

-
Constant: int GBP_OTHER - +
Constant: int WBP_OTHER +
-
Constant: int GBP_CR - +
Constant: int WBP_CR +
-
Constant: int GBP_LF - +
Constant: int WBP_LF +
-
Constant: int GBP_CONTROL - +
Constant: int WBP_NEWLINE +
-
Constant: int GBP_EXTEND - +
Constant: int WBP_EXTEND +
-
Constant: int GBP_PREPEND - +
Constant: int WBP_FORMAT +
-
Constant: int GBP_SPACINGMARK - +
Constant: int WBP_KATAKANA +
-
Constant: int GBP_L - +
Constant: int WBP_ALETTER +
-
Constant: int GBP_V - +
Constant: int WBP_MIDNUMLET +
-
Constant: int GBP_T - +
Constant: int WBP_MIDLETTER +
-
Constant: int GBP_LV - +
Constant: int WBP_MIDNUM +
-
Constant: int GBP_LVT - +
Constant: int WBP_NUMERIC +
-
- -

The following function looks up the grapheme cluster break property of a -character. -

-
-
Function: int uc_graphemeclusterbreak_property (ucs4_t uc) - +
Constant: int WBP_EXTENDNUMLET +
-

Returns the Grapheme_Cluster_Break property of a Unicode character. -

+ -

The following function determines whether there is a grapheme cluster -break between two Unicode characters. It is the primitive upon which -the higher-level functions in the previous section are directly based. +

The following function looks up the word break property of a character.

-
Function: bool uc_is_grapheme_break (ucs4_t a, ucs4_t b) - +
Function: int uc_wordbreak_property (ucs4_t uc) +
-

Returns true if there is an grapheme cluster boundary between Unicode -characters a and b. -

-

There is always a grapheme cluster break at the start or end of text. -You can specify zero for a or b to indicate start of text or end -of text, respectively. -

-

This implements the extended (not legacy) grapheme cluster rules -described in the Unicode standard, because the standard says that they -are preferred. +

Returns the Word_Break property of a Unicode character.


- - + + @@ -246,12 +178,12 @@ are preferred. - +
[ << ][ >> ]
[ << ][ >> ]         [Top] [Contents][Index][Index] [ ? ]

- This document was generated by Daiki Ueno on July, 8 2015 using texi2html 1.78a. + This document was generated by Bruno Haible on March, 30 2010 using texi2html 1.78a.
-- cgit v1.2.3