COLLATE

Collations are rules that specify how to compare text values in various international languages. The COLLATE command allows us to incorporate such rules into indexes for text values. Collations can be used with both Unicode and ANSI text values.  See the Collations topic to learn how collations are managed by Manifold.

 

 

COLLATE '<collate>'

 

<collate> -  Zero or more options for specifying the collation.   No options are expressed as two single quote ' characters with nothing between them, as in

 

COLLATE ''

 

which specifies the default collation language of neutral with no other options.

Options

<name>

A short string that specifies the collation language. Collations are identified by names such as en-US, en-GB and zh-CN.   Use neutral for the default collation language, or the name of a collation taken from a roster of over 150 supported collations, similar to those used by PostgreSQL.   Collations are identified by names such as en-US, en-GB and zh-CN.   An empty name is allowed and will specify the default collation.  The only option supported by neutral is nocase: all other options for neutral are ignored. The short form of '' for the entire default collation is accepted, but 'neutral, nocase' must spell the collation language as neutral and cannot omit it or use an empty string.

noaccent

Ignore diacritical marks, accents, and other similar characters, referred to as non-space characters in Unicode jargon.  The default is to use such characters when sorting.

nocase

Ignore case.  The default is case-sensitive sort order.

nokanatype

Ignore kana type for Asian languages.  The default is to utilize the kana type, so that two instances of the same character in, say, hiragana and katakana, are considered to be different when sorting.

nosymbols

Ignore symbols and punctuation.  The default is to utilize symbols and punctuation.

nowidth

Ignore character widths for Asian languages. The default is to utilize width, so that two instances of the same character which only vary in width are considered to be different when sorting.

wordsort

Ignore space between words, that is, sort only on the characters that make up words and not also the spaces between words. The default is that space between words matters when sorting, so that words are not treated distinctly but only the full content of the string, words and spaces both.

 

Examples:

CREATE TABLE [Table] (

[name] NVARCHAR,

INDEX [name_x] BTREENULL ([name] COLLATE 'en-US, nocase, noaccent')

);

 

Specifies US English language, ignoring case and ignoring accents.

 

CREATE TABLE [Table] (

[name] NVARCHAR,

INDEX [name_x] BTREENULL ([name] COLLATE '')

);

 

Specifies neutral collation language with no other options.

 

CREATE TABLE [Table] (

[name] NVARCHAR,

INDEX [name_x] BTREENULL ([name] COLLATE 'neutral, nocase')

);

 

Specifies neutral collation language, ignoring case.

 

Using Collations

Collations can be utilized in the following circumstances:

 

 

SQL Functions using Collations

The following SQL functions utilize collations.

 

CollateCode(<collate>) : <collate-code>

The Collate function takes a string with collation definition options like that used by the COLLATE SQL command, and returns a numeric value that encodes that collation which can be used as the <collate-code> argument in functions that use collations.

 

? CollateCode('en-US, nocase, noaccent')

 

The above generates the number 934, which encodes the use of a US English collation insensitive to case and insensitive to accents.

 

CollateCodeDef(<collate-code>): <collate>)

Given a numeric code for a (non-external) collation, the CollateCodeDef function returns the collation definition string for that collation.

 

StringCompareCollate(<string>, <string>, <collate-code>) : <value>

Same as Compare but compares two strings using the specified collate code.

 

Returns -1 if the first string is less than the second string, 1 if the first string is greater than the second string and 0 if the first and second strings are the same.   

 

In the examples that follow 928 is the collation code for en-US (US English) that is case sensitive.   930 is the collation code for en-US where case does not matter.

 

? StringCompareCollate('John','john', 928)

 

returns 1, meaning the first string is greater than the second string.  

 

? StringCompareCollate('John','john', 930)

 

returns 0, meaning the two strings are the same (as they are if case does not matter).

 

StringCompareKeyCollate(<string>, <collate-code>) : <value>

Given a string and a specified collation returns a binary representation of the text using the specified collation.  The produced key can be compared to keys for other strings with the same collation, and such comparisons will yield the same result as if the comparison was done using StringCompareCollate(). However, comparing binary keys is usually much faster than comparing strings using a given collation.

 

StringContainsCollate(<string>, <substring>, <collate-code>) : <boolean>

Given a string, a substring and a collation returns True if the string contains the substring using the rules of the specified collation.   

 

In the examples that follow 928 is the collation code for en-US (US English) that is case sensitive.   930 is the collation code for en-US where case does not matter.

 

? StringContainsCollate('Long John Silver', 'john', 928)

 

returns 0 for False,  because if case matters john does not occur in the first string.

 

? StringContainsCollate('Long John Silver', 'john', 930)

 

returns 1 for True,  because if case does not matter john occurs as the second word in the first string.

 

StringEndsWithCollate(<string>, <substring>, <collate-code>) : <boolean>

Given a string, a substring and a collation returns True if the string ends with the substring using the rules of the specified collation.    See the StringContainsCollate function for similar examples.

 

StringStartsWithCollate(<string>, <substring>, <collate-code>) : <boolean>

Given a string, a substring and a collation returns True if the string starts with the substring using the rules of the specified collation.    See the StringContainsCollate function for similar examples.

 

StringFindCollate(<string>, <substring>, <collate-code>) : <position>

Given a string, a substring and a collation returns the position, beginning with 0,  in the string of the first character of the substring if the string contains the substring using the rules of the specified collation.   If the string does not contain the substring return -1.

 

In the examples that follow 928 is the collation code for en-US (US English) that is case sensitive.   930 is the collation code for en-US where case does not matter.

 

? StringFindCollate('Long John Silver','john', 928)

 

returns -1 because if case matters john does not occur in the first string.

 

? StringFindCollate('Long John Silver','john', 930)

 

returns 5 because if case does not matter the substring john begins at the sixth character in the first string.

 

StringFindNthCollate(<string>, <substring>, <occurrence>, <collate-code>) : <position>

Given a string, a substring and an <occurrence> number, n, (zero based counting, with the first <occurrence> being 0, the second being 1 and so on) and a collation, return the number of the character position in the text, also zero based, where the substring matches the nth time using the rules of the specified collation.   If not found at the specified nth position return -1.

 

In the examples that follow 930 is the collation code for en-US (US English) that is not case sensitive.

 

? StringFindNthCollate('Long John Silver','L', 1, 930)

 

returns 12 because if case does not matter the second instance (indicated by 1 in zero based counting) of the substring 'L' occurs at the thirteenth character (indicated by 12 in zero based counting).

 

? StringFindNthCollate('Long John Silver','L', 2, 930)

 

returns -1 because there is no third instance of the substring 'L'.

 

StringReplaceCollate(<string>, <source>, <target>, <collate-code>) : <string>

Given a string, a source string, a target string and a collation return a modified string with all occurrences of the source string, using the rules of the specified collation,  replaced by the target string.   If no instances of the source string are found, returns the original string.   

 

In the example that follows 930 is the collation code for en-US (US English) that is not case sensitive.

 

? StringReplaceCollate('Long John Silver','long', 'Short', 930)

 

The above expression returns:

 

nvarchar: Short John Silver

 

See the Replace : text operation in the Transform - Text: Replace topic.

 

StringReplaceNthCollate(<string>, <source>, <target>, <occurrence>, <collate-code>) : <string>

Given a string, a source string, a target string, an <occurrence> number, n, (zero based counting, with the first <occurrence> being 0, the second being 1 and so on) and a collation, return a modified string with the nth occurrence of the source string, using the rules of the specified collation,  replaced by the target string.  If the source string is not found at the specified nth position returns the original string.

 

In the examples that follow 930 is the collation code for en-US (US English) that is not case sensitive.

 

? StringReplaceNthCollate('Long John Silver','L', 'X', 1, 930)

 

The above expression returns:

 

nvarchar: Long John SiXver

 

? StringReplaceNthCollate('Long John Silver','L', 'X', 2, 930)

 

The above expression returns the original string since there is no third occurrence of the source string 'L'.

 

See the Replace : text occurrence operation in the Transform - Text: Replace topic.

 

StringStartsWithCollate(<string>, <substring>, <collate-code>) : <boolean>

Given a string, a substring and a collation returns True if the string starts with the substring using the rules of the specified collation.    See the StringContainsCollate function for similar examples.

 

StringToLowerCaseCollate(<string>, <collate-code>) : <string>

Given a string and a collation returns the same string all in lower case using the rules of the specified collation.  Ignores <useCase> option in the encoded collation since the result is all lower case.

 

See the Case : lower operation in the Transform - Text: Case topic.

 

StringToTitleCaseCollate(<string>, <collate-code>) : <string>

Given a string and a collation returns the same string in title case using the rules of the specified collation.  Ignores <useCase> option in the encoded collation since the result is title case.

 

See the Case : title operation in the Transform - Text: Case topic.

 

StringToUpperCaseCollate(<string>, <collate-code>) : <string>

Given a string and a collation returns the same string all in upper case using the rules of the specified collation.  Ignores <useCase> option in the encoded collation since the result is all upper case.

 

See the Case : upper operation in the Transform - Text: Case topic.

 

 

Notes

 

Why are indexes on ANSI text fields limited to 'neutral'? Because the definition of ANSI text in Release 9 is 'interpret single-byte characters using whatever is the current language on the current system'. If we move a .map file from a German system to a Portuguese system, what ANSI characters map to changes: what was being displayed as the letter a might now become the letter b.  Consequently, when building indexes on ANSI character data we cannot rely on the linguistic meaning of characters, because that can change.   We can only rely on what does not change, that is, on character codes. That is what the neutral language choice does

 

Manifold collations - Manifold utilizes collations differently than some other database systems.     Many database systems tie collations to individual text fields so that each field gets a collation, either the default collation or a collation explicitly specified in CREATE TABLE.   In such systems when the query engine sees a construction such as ORDER BY f, it uses the collation for the field f and orders accordingly and it resolves expressions such as f = 'abc' using the collation for f, which may or may not be case sensitive. When the engine encounters an expression such as f <= g it will fail the operation in cases where the collation for field f differs from that for field g.  The expression fails because the operator has two competing sets of rules to apply and cannot choose between them.

 

Such systems allow resolutions of such conflicts by explicitly using COLLATE to temporarily match the collation of f to g for the duration of the operator, using an expression such as f COLLATE ... <= g.   Such systems provide other means to manage conflicts that can arise when collations are tied to individual fields, for example, providing rules showing how collations are conveyed through functions, with each function having its own custom rules.  For example, a rule might specify the result of Substring(f, ...) will have the same collation as f.

 

Manifold takes a different approach and ties collations to specific constructs which do comparisons such as ORDER.  There are advantages and disadvantages to that approach.

 

The biggest advantages are that there are no conflicts and no hidden rules.  Everything specific to collations is clearly visible.   For example, we can be sure that an expression such as f = 'A' is case-sensitive.  Another big and important advantage is that we do not face a choice of either creating a way of conveying collation information both ways (both for arguments and return values) for external functions, or having to live with external functions being  different from internal functions in that external functions could not use collations.

 

The main disadvantages to Manifold's approach is that first, if we always use a specific collation with some field we will have to write that collation more than once (although we can reduce the need for repetitive work by using functions) and second, other database systems more frequently tie collations to individual fields so the Manifold approach might at first be unfamiliar.

 

Collations in Manifold use PostgreSQL style names, typically using an abbreviation for the language, such as fr for French or en for English and a two letter code for the country, such as US for the United States, FR for France and BE for Belgium.   French as spoken in Belgium has the collation name fr-BE, French as spoken in France is fr-FR, English as spoken in the US is en-US, and English as spoken in Great Britain en-GB.  MySQL locale values are similar, but using an underscore character instead of a hyphen, for constructions such as fr_BE.

 

See Also

Collations

SQL Functions

 

Transform - Text: Case

 

Transform - Text: Replace