Example: Get Tokens From a String

We can use regular expressions to extract tokens from strings using the Transform pane's Reduce template, reducing to a regular expression occurence.   

 

A token in a string is a sequence of one or more characters delimited by a token separator character.  Token separator characters are often white space characters like space characters or tab characters, characters used in dates or latitude, longitude notation such as the degree ° character or the single quote ' character to mean minutes or the double quote " character to mean seconds.

 

 

Suppose we have a table like the above, which provides latitude and longitude values as plain text (nvarchar) data types using degrees, minutes and seconds notation where the degrees, minutes, and seconds values are all part of the same text string and separated by token separator characters such as a space character, a degree ° character,  a single quote ' character, or a double quote " character.   We can refer to the degrees, minutes, and seconds parts of the text string as tokens.

 

We would like to extract each of the three token values into its own field, for example, extracting the latitude degrees value of 53 into a new field called lat_deg.   That is easy to do with the Transform pane.

 

With the focus on the table window, in the Transform pane we choose the lat field and then the Reduce template

 

 

In the Reduce to box we choose regular expression occurrence.  For the regular expression Value, we enter the regular expression

 

[^\s]*\s

 

That regular expression will only pick out sequences that use any whitespace character (\s) between tokens, to start with a simple example.   The construction [^\s] means "any character except a white space character".   When followed by an asterisk * character as in [^\s]* that means "any sequence of zero or more characters other than white space characters".   All together, the expression [^\s]*\s means "any sequence of zero or more non-white space characters that is followed by a white space character".   That is exactly what we want to grab a token that is set off by a white space character from the rest of the string.

 

Tech tip: Note that we use the expression [^\s]*\s and not the expression .*\s as someone new to regular expressions might try.  The reason why is simple and explained in the The Repetition Count * Character is Greedy section of the Regular Expressions topic.

 

We use 0 for the Occurrence, meaning the first occurrence using zero-based counting.  For the Result  we enter a new field name, lat_deg, for the name of the new field to be created.    We press Preview to see what that will do.

 

 

The preview shows the regular expression will pick the first token out of the first record, but it will not pick anything out of the second record, where there are no whitespace characters.

 

If we wanted to get the second token from records using whitespace characters as token separators, we can use 1 for the Occurrence, meaning the second occurrence using zero-based counting:

 

 

Press Preview to see what that does:

 

 

Using 1 for the Occurrence matches the second token separated by white space token separator characters.   To get the third token, we use 2 for the Occurrence:

 

 

Press Preview.

 

 

That extracts the seconds value from the lat string, the third token.

 

Important: In the above example, the lat and lon strings in the table above both have a space character at the end of the string in each record that uses white space characters as separators.  We cannot see that space character in the screenshots, but it is there.   We added a space character to the end of only those records using a two step process.   First we used the Select pane to select all records that do not have a degree ° character in them, and then we used the Transform pane's Concatenate template to add a space character to only those records.

 

 

For example, to add a space character to the end of the lat field values for selected records, we would use the above settings in the Transform pane's Concatenate template.  In the Value box wehave entered a space character. It is not visible in the screenshot above, but it is there.  The arrangement above adds a space character to the end of the lat field for all select records

 

We add a space character to those records so the simple regular expression we are using will work for the last token as well as for the first two tokens.  Keep in mind the regular expression we are using is

 

[^\s]*\s

 

However, that regular expression ends with a \s white space character.   If the field does not end with a white space character, like a space character, the last token will not be picked up.  

 

If we did not want to bother with adding a space character to the end of the field, we could use a slightly different regular expression for the third and final token we want to get:

 

\s[^\s]*$

 

That regular expression matches any sequence of characters that start with a white space \s character, followed by one or more non-white space characters, followed by the end $ of the string.

 

We use that regular expression within the Transform pane's Reduce template:

 

 

Note that since the new regular expression is looking for a sequence of characters at the very end of the string, we use 0 in the Occurrence parameter to match the first such instance, since there are no additional instances after the end of the string.

 

 

Press Preview and we see that the modified expression will, indeed, pick out the final token we seek, even though there is not a space at the end of the string.

Other Token Separator Characters

We know how to extract from strings tokens that are separated by white space characters like spaces or tabs.  That is a very common task when rearranging attribute fields, an in in this case it works for records like the first one.  If the fields from which we want to extract tokens only use white space characters as token delimiters, we now know all we need to extract  tokens.

 

However, the simple regular expression we are using will not work not for the second record, where non-white space characters are used as separators within the string.  How can we add those to the regular expression?  

 

That is easy to do by modifying the regular expression we use to include the other characters that are used as separators in some records, the degree ° character,  the single quote ' character, and the  double quote " character.   The modified regular expression we will use is:

 

[^\s°'"]*(\s|°|'|")

 

To decode the above expression, [^\s°'"] means any character other than a white space character, the degree ° character,  the single quote ' character, or the  double quote " character.   Adding an asterisk * character in [^\s°'"]* adds a repretition count, meaning one or more of any character other than those listed.  

 

The expression (\s|°|'|") means a single character that is either a white space character, or the degree ° character,  or the single quote ' character, or the  double quote " character.   Therefore, the expression

 

[^\s°'"]*(\s|°|'|")

 

means any group of zero or more characters other than those we intend as separators followed by one of the separator characters.   That will match a 53 followed by a space or a 53 followed by a degree ° character or followed by a single or double quote character.

 

 

We can use that modified regular expression in the Transform pane's Reduce template, using a value of 0 in the Occurrence parameter to get the first token.  Press Preview.

 

 

The preview shows the modified regular expression now grabs the first token from the second record as well as the first record.    

 

Note that the results, if we were to press Transform to create and to populate the new lat_deg field with the results, also include the separator character, either a space character in the case of the first record or the degree ° character in the case of the second record.  If we just want the numeric digits in the result, after creating the new field we could use Trim to eliminate unwanted characters at the end of the lat_deg field.

 

In the Transform pane, we press Transform to see how that works.  

 

 

A new lat_deg field is created and is populated with the result of the Transform template.  

 

To trim the unwanted space and degree characters from the new lat_deg field we use the Transform pane, picking the lat_deg field and then the Trim template.

 

 

In the Trim template we choose start and end for the Trim parameter, and whitespace and custom for the Characters parameter.   In the Value box we enter the degree ° character,  the single quote ' character, and the  double quote " character as custom characters to trim.   

 

For the Result we leave the default choice of (same field).  Press Preview.

 

 

The preview shows how the unwanted extra space character (not visible in the screenshot, but it is there...) and the unwanted extra degree ° character will be trimmed.  Press Transform in the Transform pane to trim the characters as the preview shows:

 

 

We can repeat the use of the Reduce template, but with a 1 for the Occurrence parameter to extract the second token from the lat field.

 

 

We configure the Transform pane to put the result into a new field called lat_min, since the second token gives the minutes value for the latitude and longitude values in the table.

 

Press Preview.

 

 

The preview shows how the second token will be extracted.  If we like what we see, we press Transform to create a new lat_min field and populate it with the results.

 

We continue in this way to use the Reduce template one more time with an Occurrence of 2 to get the third token, and to put that into a new lat_sec field:

 

 

We then use the Trim template on the lat_min and lat_sec fields to get rid of unwanted extra characters (space, and the single and double quote characters) at the ends of the values.  Because the Trim template retains the values we enter when we switch fields in the Field box, it is quicker to trim a series of fields one after another.

 

 

Repeat the above for the lon field to get lon_deg, lon_min, and lon_sec fields.   Note that since characters such as the minus - sign and the decimal point . character are passed unmodified by the regular expression, the resulting values in the lat_deg and lon_deg fields will correctly have a minus sign in the text value for South latitudes and West longitudes, and the lat_sec and lon_sec values will have decimal seconds values.

Using the Results - An Application

One reason people extract tokens from strings like the lat and lon fields in the example table is to then recombine the resulting fields to create latitude and longitude fields using different formats or different data types.   For example, once we have extracted degrees, minutes, and seconds tokens from the nvarchar text fields lat and lon, that use dissimilar formats, it is very easy to create floating point, numeric latitude and longitude fields from the token fields using the Transform pane's Expression template for numeric fields.

 

To do that, using Edit - Schema we add latitude and longitude fields to the table that are type float64.    To create floating point latitude field values from the three, text, lat_deg, lat_min, and lat_sec values, in the Transform pane we choose the latitude field and then the Expression template.

 

In the Expression template we press the Edit Expression button and enter the following simple expression:

 

CASE WHEN StringContains([lat], '-')

THEN

  CAST([lat_deg] AS FLOAT64) -

  CAST([lat_min] AS FLOAT64)/60 -

  CAST([lat_sec] AS FLOAT64)/3600

ELSE

  CAST([lat_deg] AS FLOAT64) +

  CAST([lat_min] AS FLOAT64)/60 +

  CAST([lat_sec] AS FLOAT64)/3600

END

 

The initial CASE statement tests if there is a minus sign in the lat field.  We have to do that because correctly building a numeric decimal latitude value depends on whether we are dealing with a positive or negative latitude value.  If it is a negative value, we must subtract the minutes and seconds components taken from the tokens, and if it is a positive value, we add  the minutes and seconds components.

 

We use the StringContains function to test if the string contains a minus sign instead of a construction like

 

CASE WHEN (CAST([lat_deg] AS FLOAT64) < 0)

 

because there may be values in the table with zeroes for the lat string, but which are prefixed with a minus sign to indicate that the minutes and seconds values are to the South of the Equator.  

 

Extracting the degree token from such strings might end up with a -0 value, which when CAST as a floating point number (or as an INT32) might not result in a TRUE boolean condition when checked if it is "less than" zero.  In contrast, if we test for the presence of a minus sign then subtracting the numeric equivalents of the minutes and seconds tokens in the case of a "minus zero" degree token will still result in a correct, negative, decimal degree value.

 

 

We place the Result into the same field.  Press Preview to see how the expression will work.   Using Preview is always a good idea to test expressions before making big changes to many values.

 

 

The preview shows that the expression works correctly.   In the table above we have altered the lat value for the second record to put a minus - sign at the beginning of the text (as well as in the lat_deg value), to test if the CASE logic in the expression works correctly.

 

If we are happy with the preview, we press the Transform button, and the expression will populate the latitude field with the lat values converted into decimal degrees as a numeric data type.  We can then use those latitude values to create Geom values or for other purposes.

 

 

Hovering the mouse cursor over the latitude column header shows the field is a genuine, numeric, float64 field.

All in One Query

The procedure above gives a step by step procedure using the Transform pane.   An even quicker solution has been provided by SQL master Sloots in the Manifold user forum, using a query that uses two simple user-defined functions:

 

FUNCTION GetToken(@str NVARCHAR, @idx INT64) FLOAT64 AS (

  CAST(StringRegexpSubstringNth(@str, '[\-0-9.]+', @idx, 'i') AS FLOAT64)

) END;

 

FUNCTION ToDecimal(@s NVARCHAR) FLOAT64 AS (

 CASE WHEN StringContains(@s, '-') THEN

   GetToken(@s, 0) - GetToken(@s, 1) / 60 - GetToken(@s, 2) / 3600

 ELSE

   GetToken(@s, 0) + GetToken(@s, 1) / 60 + GetToken(@s, 2) / 3600

 END

) END;

 

SELECT

  lat, ToDecimal(lat) as latitude,

  lon, ToDecimal(lon) as longitude

INTO [Table 2]

FROM

 [Table];

 

There is no need to consider various possible token separator characters when instead, as the GetToken function does above, all you need is the regular expression [\-0-9.]+  to pick out groups of characters that are 0 through 9 or minus signs or decimal points.   The GetToken function also CASTs each token into float64 data type as well.  Using functions to modularize the query into simple functions that each do one part of the job results in clear, obvious SQL.

 

The result:

 

 

Getting only the characters within the token also eliminates the need to consider if the source field is terminated by a space, and also eliminates the need to later on get rid of unwanted token separator characters.  Well done, Sloots!

 

If we would like to update the table in place without writing results to a new table, that is easy to do with ALTER TABLE and UPDATE statements:

 

FUNCTION GetToken(@str NVARCHAR, @idx INT64) FLOAT64 AS (

  CAST(StringRegexpSubstringNth(@str, '[\-0-9.]+', @idx, 'i') AS FLOAT64)

) END;

 

FUNCTION ToDecimal(@s NVARCHAR) FLOAT64 AS (

 CASE WHEN StringContains(@s, '-') THEN

   GetToken(@s, 0) - GetToken(@s, 1) / 60 - GetToken(@s, 2) / 3600

 ELSE

   GetToken(@s, 0) + GetToken(@s, 1) / 60 + GetToken(@s, 2) / 3600

 END

) END;

 

ALTER TABLE [Table] (

  ADD [latitude] FLOAT64,

  ADD [longitude] FLOAT64

);

 

UPDATE [Table] SET

  [latitude] = ToDecimal(lat),

  [longitude] = ToDecimal(lon);

 

The result:

 

 

SQL sure can save a lot of time!

See Also

Regular Expressions

 

Transform - Numbers: Expression

 

Transform - Text: Concatenate

 

Transform - Text: Reduce

 

Transform - Text: Trim

 

SQL Statements

 

Example: Unique Names using Regular Expressions - We have a table with a text field that contains a list of names, separated by commas.  Some of the names are repeated.   We would like to transform the list of names into a similar list, but without any repetitions of names.   This topic shows how using a regular expression.   It shows the power and speed of a concise regular expression, and the flexibility with which regular expressions can be used in SQL queries as well as in the Transform pane.