Unicode

struct Unicode {}

Members

Static functions

ucs4ToUtf16
wchar* ucs4ToUtf16(dchar* str, glong len, glong itemsRead, glong itemsWritten)

Convert a string from UCS-4 to UTF-16. A 0 character will be added to the result after the converted text.

ucs4ToUtf8
string ucs4ToUtf8(dchar* str, glong len, glong itemsRead, glong itemsWritten)

Convert a string from a 32-bit fixed width representation as UCS-4. to UTF-8. The result will be terminated with a 0 byte.

unicharBreakType
GUnicodeBreakType unicharBreakType(dchar c)

Determines the break type of @c. @c should be a Unicode character (to derive a character from UTF-8 encoded text, use g_utf8_get_char()). The break type is used to find word and line breaks ("text boundaries"), Pango implements the Unicode boundary resolution algorithms and normally you would use a function such as pango_break() instead of caring about break types yourself.

unicharCombiningClass
int unicharCombiningClass(dchar uc)

Determines the canonical combining class of a Unicode character.

unicharCompose
bool unicharCompose(dchar a, dchar b, dchar* ch)

Performs a single composition step of the Unicode canonical composition algorithm.

unicharDecompose
bool unicharDecompose(dchar ch, dchar* a, dchar* b)

Performs a single decomposition step of the Unicode canonical decomposition algorithm.

unicharDigitValue
int unicharDigitValue(dchar c)

Determines the numeric value of a character as a decimal digit.

unicharFullyDecompose
size_t unicharFullyDecompose(dchar ch, bool compat, dchar* result, size_t resultLen)

Computes the canonical or compatibility decomposition of a Unicode character. For compatibility decomposition, pass %TRUE for @compat; for canonical decomposition pass %FALSE for @compat.

unicharGetMirrorChar
bool unicharGetMirrorChar(dchar ch, dchar* mirroredCh)

In Unicode, some characters are "mirrored". This means that their images are mirrored horizontally in text that is laid out from right to left. For instance, "(" would become its mirror image, ")", in right-to-left text.

unicharGetScript
GUnicodeScript unicharGetScript(dchar ch)

Looks up the #GUnicodeScript for a particular character (as defined by Unicode Standard Annex \#24). No check is made for @ch being a valid Unicode character; if you pass in invalid character, the result is undefined.

unicharIsalnum
bool unicharIsalnum(dchar c)

Determines whether a character is alphanumeric. Given some UTF-8 text, obtain a character value with g_utf8_get_char().

unicharIsalpha
bool unicharIsalpha(dchar c)

Determines whether a character is alphabetic (i.e. a letter). Given some UTF-8 text, obtain a character value with g_utf8_get_char().

unicharIscntrl
bool unicharIscntrl(dchar c)

Determines whether a character is a control character. Given some UTF-8 text, obtain a character value with g_utf8_get_char().

unicharIsdefined
bool unicharIsdefined(dchar c)

Determines if a given character is assigned in the Unicode standard.

unicharIsdigit
bool unicharIsdigit(dchar c)

Determines whether a character is numeric (i.e. a digit). This covers ASCII 0-9 and also digits in other languages/scripts. Given some UTF-8 text, obtain a character value with g_utf8_get_char().

unicharIsgraph
bool unicharIsgraph(dchar c)

Determines whether a character is printable and not a space (returns %FALSE for control characters, format characters, and spaces). g_unichar_isprint() is similar, but returns %TRUE for spaces. Given some UTF-8 text, obtain a character value with g_utf8_get_char().

unicharIslower
bool unicharIslower(dchar c)

Determines whether a character is a lowercase letter. Given some UTF-8 text, obtain a character value with g_utf8_get_char().

unicharIsmark
bool unicharIsmark(dchar c)

Determines whether a character is a mark (non-spacing mark, combining mark, or enclosing mark in Unicode speak). Given some UTF-8 text, obtain a character value with g_utf8_get_char().

unicharIsprint
bool unicharIsprint(dchar c)

Determines whether a character is printable. Unlike g_unichar_isgraph(), returns %TRUE for spaces. Given some UTF-8 text, obtain a character value with g_utf8_get_char().

unicharIspunct
bool unicharIspunct(dchar c)

Determines whether a character is punctuation or a symbol. Given some UTF-8 text, obtain a character value with g_utf8_get_char().

unicharIsspace
bool unicharIsspace(dchar c)

Determines whether a character is a space, tab, or line separator (newline, carriage return, etc.). Given some UTF-8 text, obtain a character value with g_utf8_get_char().

unicharIstitle
bool unicharIstitle(dchar c)

Determines if a character is titlecase. Some characters in Unicode which are composites, such as the DZ digraph have three case variants instead of just two. The titlecase form is used at the beginning of a word where only the first letter is capitalized. The titlecase form of the DZ digraph is U+01F2 LATIN CAPITAL LETTTER D WITH SMALL LETTER Z.

unicharIsupper
bool unicharIsupper(dchar c)

Determines if a character is uppercase.

unicharIswide
bool unicharIswide(dchar c)

Determines if a character is typically rendered in a double-width cell.

unicharIswideCjk
bool unicharIswideCjk(dchar c)

Determines if a character is typically rendered in a double-width cell under legacy East Asian locales. If a character is wide according to g_unichar_iswide(), then it is also reported wide with this function, but the converse is not necessarily true. See the Unicode Standard Annex #11

for details.

unicharIsxdigit
bool unicharIsxdigit(dchar c)

Determines if a character is a hexidecimal digit.

unicharIszerowidth
bool unicharIszerowidth(dchar c)

Determines if a given character typically takes zero width when rendered. The return value is %TRUE for all non-spacing and enclosing marks (e.g., combining accents), format characters, zero-width space, but not U+00AD SOFT HYPHEN.

unicharToUtf8
int unicharToUtf8(dchar c, char[] outbuf)

Converts a single character to UTF-8.

unicharTolower
dchar unicharTolower(dchar c)

Converts a character to lower case.

unicharTotitle
dchar unicharTotitle(dchar c)

Converts a character to the titlecase.

unicharToupper
dchar unicharToupper(dchar c)

Converts a character to uppercase.

unicharType
GUnicodeType unicharType(dchar c)

Classifies a Unicode character by type.

unicharValidate
bool unicharValidate(dchar ch)

Checks whether @ch is a valid Unicode character. Some possible integer values of @ch will not be valid. 0 is considered a valid character, though it's normally a string terminator.

unicharXdigitValue
int unicharXdigitValue(dchar c)

Determines the numeric value of a character as a hexidecimal digit.

unicodeCanonicalDecomposition
dchar* unicodeCanonicalDecomposition(dchar ch, size_t* resultLen)

Computes the canonical decomposition of a Unicode character.

unicodeCanonicalOrdering
void unicodeCanonicalOrdering(dchar* str, size_t len)

Computes the canonical ordering of a string in-place. This rearranges decomposed characters in the string according to their combining classes. See the Unicode manual for more information.

unicodeScriptFromIso15924
GUnicodeScript unicodeScriptFromIso15924(uint iso15924)

Looks up the Unicode script for @iso15924. ISO 15924 assigns four-letter codes to scripts. For example, the code for Arabic is 'Arab'. This function accepts four letter codes encoded as a @guint32 in a big-endian fashion. That is, the code expected for Arabic is 0x41726162 (0x41 is ASCII code for 'A', 0x72 is ASCII code for 'r', etc).

unicodeScriptToIso15924
uint unicodeScriptToIso15924(GUnicodeScript script)

Looks up the ISO 15924 code for @script. ISO 15924 assigns four-letter codes to scripts. For example, the code for Arabic is 'Arab'. The four letter codes are encoded as a @guint32 by this function in a big-endian fashion. That is, the code returned for Arabic is 0x41726162 (0x41 is ASCII code for 'A', 0x72 is ASCII code for 'r', etc).

utf16ToUcs4
dchar* utf16ToUcs4(wchar* str, glong len, glong itemsRead, glong itemsWritten)

Convert a string from UTF-16 to UCS-4. The result will be nul-terminated.

utf16ToUtf8
string utf16ToUtf8(wchar* str, glong len, glong itemsRead, glong itemsWritten)

Convert a string from UTF-16 to UTF-8. The result will be terminated with a 0 byte.

utf8Casefold
string utf8Casefold(string str, ptrdiff_t len)

Converts a string into a form that is independent of case. The result will not correspond to any particular case, but can be compared for equality or ordered with the results of calling g_utf8_casefold() on other strings.

utf8Collate
int utf8Collate(string str1, string str2)

Compares two strings for ordering using the linguistically correct rules for the [current locale]setlocale. When sorting a large number of strings, it will be significantly faster to obtain collation keys with g_utf8_collate_key() and compare the keys with strcmp() when sorting instead of sorting the original strings.

utf8CollateKey
string utf8CollateKey(string str, ptrdiff_t len)

Converts a string into a collation key that can be compared with other collation keys produced by the same function using strcmp().

utf8CollateKeyForFilename
string utf8CollateKeyForFilename(string str, ptrdiff_t len)

Converts a string into a collation key that can be compared with other collation keys produced by the same function using strcmp().

utf8FindNextChar
string utf8FindNextChar(string p, string end)

Finds the start of the next UTF-8 character in the string after @p.

utf8FindPrevChar
string utf8FindPrevChar(string str, string p)

Given a position @p with a UTF-8 encoded string @str, find the start of the previous UTF-8 character starting before @p. Returns %NULL if no UTF-8 characters are present in @str before @p.

utf8GetChar
dchar utf8GetChar(string p)

Converts a sequence of bytes encoded as UTF-8 to a Unicode character.

utf8GetCharValidated
dchar utf8GetCharValidated(string p, ptrdiff_t maxLen)

Convert a sequence of bytes encoded as UTF-8 to a Unicode character. This function checks for incomplete characters, for invalid characters such as characters that are out of the range of Unicode, and for overlong encodings of valid characters.

utf8MakeValid
string utf8MakeValid(string str, ptrdiff_t len)

If the provided string is valid UTF-8, return a copy of it. If not, return a copy in which bytes that could not be interpreted as valid Unicode are replaced with the Unicode replacement character (U+FFFD).

utf8Normalize
string utf8Normalize(string str, ptrdiff_t len, GNormalizeMode mode)

Converts a string into canonical form, standardizing such issues as whether a character with an accent is represented as a base character and combining accent or as a single precomposed character. The string has to be valid UTF-8, otherwise %NULL is returned. You should generally call g_utf8_normalize() before comparing two Unicode strings.

utf8OffsetToPointer
string utf8OffsetToPointer(string str, glong offset)

Converts from an integer character offset to a pointer to a position within the string.

utf8PointerToOffset
glong utf8PointerToOffset(string str, string pos)

Converts from a pointer to position within a string to a integer character offset.

utf8PrevChar
string utf8PrevChar(string p)

Finds the previous UTF-8 character in the string before @p.

utf8Strchr
string utf8Strchr(string p, ptrdiff_t len, dchar c)

Finds the leftmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to @len bytes. If @len is -1, allow unbounded search.

utf8Strdown
string utf8Strdown(string str, ptrdiff_t len)

Converts all Unicode characters in the string that have a case to lowercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string changing.

utf8Strlen
glong utf8Strlen(string p, ptrdiff_t max)

Computes the length of the string in characters, not including the terminating nul character. If the @max'th byte falls in the middle of a character, the last (partial) character is not counted.

utf8Strncpy
string utf8Strncpy(string dest, string src, size_t n)

Like the standard C strncpy() function, but copies a given number of characters instead of a given number of bytes. The @src string must be valid UTF-8 encoded text. (Use g_utf8_validate() on all text before trying to use UTF-8 utility functions with it.)

utf8Strrchr
string utf8Strrchr(string p, ptrdiff_t len, dchar c)

Find the rightmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to @len bytes. If @len is -1, allow unbounded search.

utf8Strreverse
string utf8Strreverse(string str, ptrdiff_t len)

Reverses a UTF-8 string. @str must be valid UTF-8 encoded text. (Use g_utf8_validate() on all text before trying to use UTF-8 utility functions with it.)

utf8Strup
string utf8Strup(string str, ptrdiff_t len)

Converts all Unicode characters in the string that have a case to uppercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string increasing. (For instance, the German ess-zet will be changed to SS.)

utf8Substring
string utf8Substring(string str, glong startPos, glong endPos)

Copies a substring out of a UTF-8 encoded string. The substring will contain @end_pos - @start_pos characters.

utf8ToUcs4
dchar* utf8ToUcs4(string str, glong len, glong itemsRead, glong itemsWritten)

Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4. A trailing 0 character will be added to the string after the converted text.

utf8ToUcs4Fast
dchar* utf8ToUcs4Fast(string str, glong len, glong itemsWritten)

Convert a string from UTF-8 to a 32-bit fixed width representation as UCS-4, assuming valid UTF-8 input. This function is roughly twice as fast as g_utf8_to_ucs4() but does no error checking on the input. A trailing 0 character will be added to the string after the converted text.

utf8ToUtf16
wchar* utf8ToUtf16(string str, glong len, glong itemsRead, glong itemsWritten)

Convert a string from UTF-8 to UTF-16. A 0 character will be added to the result after the converted text.

utf8Validate
bool utf8Validate(string str, string end)

Validates UTF-8 encoded text. @str is the text to validate; if @str is nul-terminated, then @max_len can be -1, otherwise @max_len should be the number of bytes to validate. If @end is non-%NULL, then the end of the valid range will be stored there (i.e. the start of the first invalid character if some bytes were invalid, or the end of the text being validated otherwise).

Meta