In a database, a collation is a set of rules that determine how data is sorted and compared, usually applied to how text data is sorted in different languages for the purpose of indexing text data and for making comparisons between text values. Different languages have different character sets and ordering. A collation allows character data for a given language to be sorted using rules that define the correct character sequence, with options for specifying case-sensitivity, accent marks, kana character types, use of symbols or punctuation, character width, and word sorting.
Collations are important anywhere text is sorted or compared and in many other settings as well. Using US-English standard settings or other defaults can get such operations wrong when the language is other than US-style English. Manifold therefore provides support for different collations in many settings, such as index on text fields for tables, within transform templates that operate on text fields, and when working with DBMS packages.
Manifold can specify a desired collation when creating an index on a table, so using of that index will result in correct ordering of text values and correct comparisons between text values. Manifold can connect to external databases and use indexes within those databases correctly by supporting collations such databases may use.
For the broadest support of modern DBMS and other applications that use the ICU library, run Manifold on Windows 10 or more recent Windows operating systems. Windows 10 and subsequent have extensive support for ICU collations whereas prior Windows editions do not. Manifold automatically takes advantage of the extensive support for ICU within Windows 10 and more recent Windows editions.
The Windows operating system provides a built-in collection of collations that programs, such as Manifold, which utilize Windows facilities can count on being available all of the time. Manifold refers to such collations that are built into Windows simply as collations.
We call a collation external when it is not implemented by Windows. Prior editions of Manifold once had only those collations implemented by Windows, but current editions have those and other collations as well, for example, custom collations used by PostgreSQL, MySQL, or other DBMS packages that have been re-implemented using Manifold code to support how those DBMS packages use collations. In addition, Manifold supports collations used within ICU, a library for all things Unicode developed and maintained by IBM, by using the massive collection of ICU collations included within Windows 10. ICU is used by many, if not all, significant applications that support Unicode, including Chrome.
Manifold refers to collations not built into Windows as external for three reasons:
We can think of external collations as not always guaranteed to be available. Normally, we would use them with data sources that already use them as a part of the database, so if a project can connect to the database the collation will be there as well. But when it comes to saving data persistently in .map project files, it is better to use a collation implemented within Windows.
When running on Windows 10, the collection of ICU collations embedded into Windows 10 is so huge that it is a safe bet any collation we may run across will be supported by Windows 10 and thus, by Manifold. In contrast, when running on an older version of Windows, such as Windows 7, the set of supported collations is so small that when connecting to relatively newer DBMS, sorts and other operations in the database which depend on full support for ICU will probably fail.
Databases very often use text values as unique identifiers, so being able to use indexes on text fields is very important for performance. That can cause a variety of problems, including:
Complex rules: Indexes on text fields often use complex rules to tell whether text values are equal to each other and how they are ordered relative to each other. This happens because default settings on many databases are set to use a specific linguistic collation, whether English or not.
Unicode a work in progress: Although support for Unicode has been gradually growing, many databases are only fully starting to support Unicode. For many databases, even famous ones, some changes to fully support Unicode are currently in development branches and are not in the released versions of the database. Before current efforts to support Unicode, various databases often implemented linguistic collations for various languages using custom code.
Manifold's ability to launch queries that partially execute within Manifold and partially within the external database requires Manifold to simultaneously handle external collations and collations specified within Windows.
When creating a data source on an external database, for example, in PostgreSQL or Oracle or some other DBMS, Manifold creates a virtual System Data folder that hosts two to four virtual tables, created on the fly for housekeeping. These tables appear to be part of the data source, but do not exist within the data source and instead are created on the fly for our convenience as if they were in the data source. They allow treating various external databases as if they were native Manifold components.
An mfd_meta and mfd_root table are always in such a virtual System Data folder. Usually an mfd_srid table listing coordinate systems as spatial reference identifiers also appears, if the database uses SRIDs. In addition, if the database supports collations an mfd_collate table appears, listing all the collations available in the database.
The illustration above shows a typical mfd_collate table, from a data source created on a gisdb database within PostgreSQL as enhanced with PostGIS. We can Copy the value of such a collation and then Paste it into the Custom tab of the Collations dialog (see below) for a database to define a Custom collation.
See the discussion in the DBMS Data Sources - Notes topic.
Transform templates in the Transform pane allow specifying the collation used for the language.
The Collation choice shows the current collation in use, with neutral meaning whatever is the default collation, that is, whatever is the current language set by default for the Windows system in use. By default, Manifold uses the neutral, nocase collation, using the neutral language in a case insensitive way for searches. To turn on case sensitivity in searches, we can choose the neutral collation.
Click the collation picker button.
Choose More... in the drop down menu.
In the Collation dialog, click the desired collation.
Check boxes for desired collation options.
Press OK.
Clicking the collation picker button calls up a menu that allows choosing any available collation we desire, including changing options for the current collation.
(list of favorites) |
A list provides one-click choice of any collations in the Favorites list. The neutral and neutral, nocase collations are listed by default. |
More... |
Launch the Collation dialog to choose a collation or to alter the current collation, for example, by changing options. |
Favorites |
Manage the Favorites list of collations, for example, by adding a collation to the list, deleting a favorite, or renaming a favorite. Only collations from the Standard tab (built into Manifold) may be added as favorites. |
Click the More... command in the drop down menu to launch the Collation dialog.
The Standard tab always appears. Use neutral for the default collation language, or the name of a collation taken from a roster of over 150 supported collations, similar to those used by PostgreSQL. Collations are identified by names such as en-US, en-GB and zh-CN. The only option allowed for neutral is Ignore case. Choosing a collation other than neutral will enable other options, even in cases where the option, such as Ignore kana type in the case of Italian languages, will not be used.
Standard tab |
The Standard tab always appears, listing collations built into Manifold. |
Database tab |
The Database tab appears for tables hosted within some DBMS data sources, listing collations available in the database. The collations listed are those listed in the mfd_collate table |
Custom tab |
The Custom tab appears for tables hosted within some DBMS data sources. We can Copy collations definitions from the mfd_collate table of a database and Paste into the Custom tab, and then modify them to add a custom collation. |
(Filter Box) |
Our best friend when sifting through long lists. Enter text, such as French into the filter box and only those collations which include that text in their names will be displayed. |
(list pane) |
Click on a collation in the list to choose it. |
Ignore case |
Ignore case, for case-insensitive searches and sort order. The default when choosing a new collation is case-sensitive sort order. |
Ignore accent |
Ignore diacritical marks, accents, and other similar characters, referred to as non-space characters in Unicode jargon. The default is to use such characters. |
Ignore symbols |
Ignore symbols and punctuation. The default is to utilize symbols and punctuation. |
Ignore kana type |
Ignore kana type for Asian languages. The default is to utilize the kana type, so that two instances of the same character in, say, hiragana and katakana, are considered to be different when sorting. |
Ignore width |
Ignore character widths for Asian languages. The default is to utilize width, so that two instances of the same character which only vary in width are considered to be different when sorting. |
Word sort |
Ignore space between words, that is, sort only on the characters that make up words and not also the spaces between words. The default is that space between words matters when sorting, so that words are not treated distinctly but only the full content of the string, words and spaces both. |
OK |
Choose the specified collation with given options. |
Cancel |
Exit the dialog without making any changes. |
Collations that have been added to the Favorites collection are automatically listed in the drop down menu for the collations picker button. By default, two collations are listed as favorites: the neutral collation and neutral, nocase, meaning to use the neutral collation but ignoring case.
Called by the Favorites choice in the collation picker button drop down menu, the Favorites dialog for collations allows building and editing a list of frequently used collations. Any favorites we add will appear in the pull down lists of favorites in menus to pick collations, allowing a one-click choice of a favorite. Only collations from the Standard tab of the Collations dialog (collations built into Manifold) can appear as favorites.
Add - Add a new favorite. Launches the Collation dialog. |
|
Add Defaults - Adds factory default favorite collations to the current list. Use this to restore factory settings if any of the original default favorites have been deleted. |
|
Ctrl-click |
Ctrl-click a favorite to select it or to de-select it. |
Ctrl-A |
Select all favorites. |
Ctrl-I |
Invert the selection. A quick way to select none is to Ctrl-A and then Ctrl-I. |
Move to Top - Moves the selected entries to the top of the list. The entry must be selected for this button to be enabled. |
|
Move Up - Moves the selected entries up one position in the list. The entry must be selected for this button to be enabled. |
|
Move Down - Moves the selected entries down one position in the list. The entry must be selected for this button to be enabled. |
|
Move to Bottom - Moves the selected entries to the bottom of the list. The entry must be selected for this button to be enabled. |
|
Delete - Remove the selected entries from the list. The entry must be selected for this button to be enabled. |
|
Add to Favorites |
Enabled when the lower pane contains a collation and that collation is selected (shown with red selection background color). Ctrl-click collation to select it. Pressing the Add to Favorites button will add it to the Favorites list.
Choosing a collation through other means and then launching the Favorites dialog will populate the lower pane with that data source. |
We can add a new collation to the Favorites list of collations through any Transform template that provides a collation picker button. In this example we have a component open, such as a table, with a text field in it. We choose the Transform pane and choose a text field as the target, in this case a field called Temp_text. We choose the Case : lower operation in the Transform - Text: Case template.
We click the collation picker button and choose Favorites in the drop down menu to launch the Favorites dialog.
We click the Add button, which launches the Collation dialog to allow us to choose a collation.
We choose the Italian (Italy) collation, which uses a collation identification of it-IT, and press OK. In this example we do not choose any of the options, such as to Ignore case.
The new collation appears in the Favorites list, already selected in case we want to move it up or down in the list of favorites. We press OK.
The next time we click the collation picker button, the new it-IT choice appears in the one-click list of favorites.
If we have already chosen a collation we can add it to the list of favorites. In the example below we first choose a new collation, and then we add it to the list of favorites.
We click the collation picker button and choose More... in the drop down menu to launch the Collation dialog.
In the Collation dialog we choose the Malay (Malaysia) collation identified with ms_MY and press OK.
The new collation appears as the specified collation in the Language parameter. Now that we have specified a new collation, suppose we decide to save it in the Favorites list? That is easy to do.
We click the collation picker button and choose Favorites in the drop down menu to launch the Favorites dialog.
The lower pane of the Favorites dialog shows the collation currently specified, in this case, the ms-MY collation.
To add it to the Favorites list we Ctrl-click it to select it, and then we press the Add to Favorites button.
It appears in the Favorites list. We press OK.
The next time we click the collation picker button, the new ms-MY choice appears in the one-click list of favorites. If we would like to go back to neutral, nocase, we simply click that choice.
Infrastructure for external collations - Given the varying nature of external collations in different databases, achieving good support for external collations requires significant infrastructure within Manifold. That infrastructure includes:
A type system that can support external collations for character data.
Explicit documentation of external collations: Database dataports will list available external collations in a system mfd_collate table within the data source hierarchy. Each collation has a unique name and a definition that can be used in a query. Unsupported collations have NULL definitions.
Management of issues that can arise when external collations change from the time a project that links external data has been saved to the time it is next opened. Loading a MAP file with indexes with external collations that cannot load will not stop the entire MAP file from loading. Instead, the affected index will be inoperable (reads via the index fail) and the table will be read-only. The index that failed to load reports the failure in the log window and can be dropped cleanly. If all indexes that failed to load for a particular table are dropped, the table becomes writable automatically.
Implementation of custom collations currently used by the more common and more popular DBMS products See the DBMS Data Sources - Notes topic.
Invalid collations are shown in red - If we create a situation where an invalid collation is referenced, the collation will be shown in red text in the collation picker.
ANSI collates are limited to the neutral language - ANSI collates are limited to the neutral language, because we cannot rely on a linguistic meaning of characters that changes between machines. Some discussion on what follows from that restriction:
If field f is ANSI, then ... ORDER BY f COLLATE 'fr-FR' automatically uses the neutral language. with no error raised, COLLATE 'fr-FR' is accepted and then replaced with 'neutral' at runtime. There are two reasons for why there is no error raised.
First, for compatibility with legacy MAP files. Manifold previously allowed ANSI fields in indexes to use arbitrary collates, and that information is already stored in existing MAP files. In some cases, therefore, Manifold should not raise any errors, and these cases are important.
Second, query operations on text fields frequently convert ANSI text to Unicode. Even if Manifold raised an error on an incompatible collate for ... ORDER BY f, with f being a field, Manifold should not raise an error for ... ORDER BY f & 'x' or for ... ORDER BY Coalesce(f, ''), because in both of these cases the expressions are Unicode. Why then raise an error on ... ORDER BY f ?
The second point above is actually a feature. If we have an ANSI field and still want to use a linguistic collate on it, we can convert it to Unicode dynamically: ... ORDER BY CAST(f AS NVARCHAR) COLLATE 'fr-FR' will use 'fr-FR'.
Title case in the 'neutral' language - Manifold builds in the past implemented casing for the neutral language using linguistic rules for en-US. Current builds now treat the neutral language differently, using the following rules: All uninterrupted sequences of letter + digit + apostrophe are treated as separate words. The first letter in each word is converted to upper case. All other letters are converted to lower case. Digits and apostrophes are left unchanged.
Linguistic rules for en-US are somewhat similar to the above, but they also handle abbreviations. For example USA in title case remains USA because all letters are upper case.
Log examples, showing the function and the result:
> ? StringToTitleCase('mary')
nvarchar: Mary
> ? StringToTitleCase('anDRew')
nvarchar: Andrew
> ? StringToTitleCase('abc123')
nvarchar: Abc123
> ? StringToTitleCase('123abc')
nvarchar: 123Abc
> ? StringToTitleCase('o\'hara')
nvarchar: O'hara --- why not O'Hara? Consider the en-US case...
> ? StringToTitleCaseCollate('o\'hara', CollateCode('en-US'))
nvarchar: O'hara --- Same result
Command Window - Query Builder
Example: Create a Table with a Constraint - Create a simple table that includes a simple constraint upon one of its fields.
Adding an Index to a Table - A basic topic on adding an index using the built-in capabilities of the mfd_id field.
Example: Add a Spatial Index to a Table - A typical use of an index is to provide a spatial index on a geom field in a table, so the geom data can be visualized in a drawing. This topic provides the step by step procedure for adding a spatial index.
Example: Add a UUID-based Index to a Table - Create a new computed field that is filled with UUID values on creation and then create an index on that field. This technique creates an indexed field that has guaranteed unique values for all records and thus the indexed field and record values may be used in other projects.