CSV, ASC, PSV, TAB

Manifold can read Comma Separated Values (CSV)  and similar formats that represent tables within text files, where each line is a record and the values for fields are separated by a character such as a comma (CSV). Each CSV file contains one table.   A similar format uses a | "pipe" character and is often known as Pipe Separated Values, or PSV file format.    This topic shows how to acquire CSV data from CSV files.  A related topic, CSV Servers, shows how to acquire CSV data from web servers.

 

We can bring data from CSV files into Manifold in two ways:

 

 

 

The first option above, creating a new data source, provides more options when harvesting data from CSV files.   We therefore will focus on connecting to CSV files using the File - Create - New Data Source command.  If we prefer to import the data into our project we can copy the table from the CSV data source and paste it into the project.

 

Use Unicode: CSV as a text format is vulnerable to all the chaos that arises from using different character sets.  When international characters are found in a CSV data file, best results are usually obtained by opening the CSV in a text editor, such as Notepad, and saving it not using UTF-8 encoding but instead Unicode or UTF-8 with BOM encoding.   If strange characters appear in text fields in an imported CSV, try opening the original CSV file in Notepad and do a Save As to see if it is using UTF-8 encoding.  If so, switch to Unicode, save under a different filename, and then import that saved, Unicode, CSV.  If that is not an option in the version of Notepad being used, save the file under a different filename as UTF-8 with BOM encoding, and then import that UTF-8 with BOM version.

 

Connecting to a CSV file:

 

  1. Choose File - Create - New Data Source in the main menu, or right-click in the Project pane and choose New Data Source.

  2. Choose More... in the dropdown menu to launch the New Data Source dialog.

  3. Choose File: csv in the Type box.

  4. Click the [...] browse button to navigate to the desired .csv file and to open it.

  5. Default settings will usually work for most CSV files.   Press Create Data Source.

  6. A new data source appears in the project.  Open it to see the table that is linked in from the CSV file.

 

Creating a local copy of the CSV table:

 

  1. Click on the table linked in from the CSV to highlight it.

  2. Press Ctrl-C or click the Copy button in the Project pane toolbar.

  3. Click anywhere in the main part of the project outside of the new data source's hierarchy.

  4. Press Ctrl-V or click the Paste button in the Project pane toolbar.

 

Tables that are linked from CSV files are neither selectable nor editable since they do not have any indexes.  We cannot add an index to a linked CSV table since CSV format does not support indexes.  

 

Local tables that are imported from CSV or copied and pasted from a linked CSV table are fully capable, fully DBMS Manifold tables.   We can easily add a key field and index to them to enable editing and selection, using the procedure given in the Add an Index to a Table topic.

 

Adding an index to a table:

 

  1. Double-click on the table to open it.

  2. Choose Edit - Schema.

  3. In the Schema dialog click the Add Identity button.

  4. Press the Save Changes button.

 

Create a New Data Source

In the main menu choose File - Create - New Data Source.

 

 

Choose More... to launch the New Data Source dialog.

 

 

Click the browse button to choose a file.  

 

The Select File dialog is a variation of the well-known Windows File Open dialog.   By default, it shows files ending in the standard .asc, .csv, .psv, and .tab file extensions used with CSV files.   Click on the file to load it into the File name box.   If we like, we can write the name of any file manually into the File name box.   For example, if we have a CSV file that uses a .txt extension, we can put the name employees.txt into the File name box and that will be read as well.

 

Controls

 

Name

Name for the new data source, "Data Source" by default.  Specify a more  memorable name as desired.  If we forget the origin of a data source we can hover the mouse over the data source and a tool tip will provide connection information.

Type

To connect to a CSV file, choose File: csv from the very long list of data sources in the pull down menu.

Source

A connection string or path to the CSV file.  The source string can be manually entered or it can be created using a Windows dialog launched by the [...] browse button.

[...] Browse button

Click to use a standard Windows Select File dialog (very similar to the usual File Open dialog) to choose the desired CSV file.

Open as read-only

CSV files in general are not writable even if the file itself on disk is not specified read-only by Windows. All the same, Manifold creates a read/write mfd_meta file that can be used to store characteristics like the width of columns and other properties within the project file.  Checking the Open as read-only box forces the entire data source to be read-only.

First line contains field names

Set to auto detect (filled square) by default. Click this entry to toggle between auto detect (filled square), checked (check mark), and unchecked (empty square).  The first line in most CSV data gives the names of fields.  Some CSV data sources do not provide the names of fields in the first line.  If some of the field names are strange, like a single number, make sure to switch this from auto detect to checked.

Read all fields as text

The system will try to guess what fields contain text and what fields contain integers or floating point numbers.  In some cases, such as Zip codes in US postal addresses, what appears to be a number should be treated as text.  Checking this box imports all fields as text even if they are not marked as text and even if they appear to be numeric fields.

Allow multiline text values

When checked, ignore linefeed and other characters normally denoting the end of a line which occur within text, that is, within text qualifier characters.  Check this box to deal with "csv" files that include multiple paragraphs within a single text field, as often occur when CSV files are created from data originally harvested from web pages, KML or other web-oriented formats.

List delimiter

The character used within the CSV file to separate fields within a record. The default (auto) setting usually guesses correctly.   Enter any character or choose it from the list of standard characters: comma, colon, semicolon, pipe, space or tab.

Decimal separator

The character used within the CSV file to denote the decimal fractional part of a number, almost always either a period or a comma.    The default (auto) setting usually guesses correctly.   Enter any character or choose it from the list of two standard characters: period or comma.  A period is used in North America and many former UK colonies.  A comma is used in Europe, South America and many African countries.

Text qualifier

The character used within the CSV file to wrap values intended to be text.  The default (auto) setting usually guesses correctly.   Enter any character or choose it from the list of two standard characters:  a double " quote character or a single ' quote character.

Create Data Source

Create the new data source in the project pane and close the dialog.

Edit Query

Launch the Command Window loaded with a query that creates the data source using the given settings.  A great way to learn how to use SQL to create data sources.

Cancel

 Exit the dialog without doing anything.

 

 

 

About CSV Format

"Comma-separated value" format is a text file format used to convey information in tables.   Each line in the text file contains data for a single record.   Values for each field are separated by a special character, for example, a comma.   The specific character that is used to separate field values is called the list delimiter character.   Usually a comma is used (hence the name "comma" separated value files...), but other characters can be used as list delimiters, especially if the data in the file most convey text that contains commas.

 

Much of how CSV files are used is a matter of tradition that has become accepted over the centuries since the original introduction of CSV format, about the time of the domestication of the ox, and the introduction of IBM FORTRAN in 1972.    Over the years, various traditions and attempts at standardization have emerged, often trying to deal with how to represent commas in the data when the comma character itself is used as part of the format.

 

The idea behind CSV is simple.  Consider a file called employees.csv which contains the following text:

 

Last Name, Country, Height

Fuller, USA, 172

Suyama, UK, 165

Smith, USA, 185

Dodsworth, UK, 187

Martin, France, 180

Pereira, France, 169

 

By tradition, the first line in a CSV file gives the names of the fields in the table, with the name of each field being separated by a comma , character.   Based on the names of the fields and the contents of the file, we might guess that the intended table is a list of employees, giving the last name of each, their country, and the height of the employee in centimeters.   

 

 

Manifold will read our example employees.csv text file and will create a table from it that looks like the above, with the Last Name and Country fields being text data types and the Height field being an integer data type.  

Telling Text from Numbers

In the example file above it is obvious that the values for Height are intended to be numbers.   When all records for a specific field contain only the numeric characters 0, 1, 2, ..., 9 it is usually a fair guess that field is intended to be a number.  But sometimes that is not a correct guess.  For example, Zip codes for US postal addresses cannot be treated as numbers given that  leading zeros in Zip codes are significant.   If our Zip code is 02138, a Zip code in Cambridge, Massachusetts, the leading zero must be preserved.

 

To avoid the need to guess,  CSV files use long-accepted traditions for manually specifying what values are supposed to be text.  A convention almost universally respected in programs that create or consume CSV files is that any content wrapped with quote characters is to be treated as text.   We could write the employees.csv file as:

 

"Last Name","Country","Height"

"Fuller","USA",172

"Suyama","UK",165

"Smith","USA",185

"Dodsworth","UK",187

"Martin","France",180

"Pereira","France",169

 

Since of course it would be far too easy for everyone to agree on such a simple rule, some people use double quote " characters to wrap a value intended as text and other people using a single quote ' character to wrap text.  Manifold can handle either or any other character that is used.

 

Checking the Read all fields as text option tells Manifold to interpret all values as text data types, to force exclusively number content, like Zip codes, to be read as text data whether it is wrapped with quote characters or not.

Dealing with Commas

It seems obvious that whoever first decided to use commas as a list delimiter to separate field values in CSV files did not expect to use CSV files to store text values such as street addresses, since constructions such as "123 Main Street, Smallville, Ohio" obviously use commas, nor did they expect to store text that includes sentences as often appear in descriptions and comments.    That is such a painful restriction in CSV format that there are two ways to get around the restriction:

 

 

The first method is that commas used within quoted text do not count as list delimiters.   For example, if we wrote the employees.csv file as:

 

"Name","Country","Height"

"Fuller, Andrew","USA",172

"Suyama, Michael","UK",165

"Smith, Tim","USA",185

"Dodsworth, Anne","UK",187

"Martin, Xavier","France",180

"Pereira, Laurent","France",169

 

Then the resulting table would be

 

The above is obviously prone to error and confusing.    The second method is to avoid using a common character, like a comma, that is often found in text and instead to use a rare character such as a pipe | symbol.     For example, if we wrote the employees.csv file as:

 

Name|Country|Height

Fuller, Andrew|USA|172

Suyama, Michael|UK|165

Smith, Tim|USA|185

Dodsworth, Anne|UK|187

Martin, Xavier|France|180

Pereira, Laurent|France|169

 

Then the resulting table, whether or not we chose to continue using quotes to set off text, still would be

 

Using commas is such a wretched choice when saving text info in tables and the pipe character tends to lend such clarity that using the pipe character is becoming more popular.   In fact, it is so popular that Pipe separated values (PSV) files have acquired a name and a three letter extension all their own.  Super!

Commas and Periods in Numbers

If we have spent our entire lives in the US we may be surprised to learn that in many other countries people do not write pi as 3.141 but instead write it as 3,141 and by that they do not mean three thousand one hundred forty one but instead mean the same as 3.141 means in the US.   That is, they use the comma ,  character to denote the fractional part of a decimal fraction.   Likewise, someone in France might be surprised to learn that in the US they do not write pi as 3,141.  

 

Cultures in some countries, primarily in Europe, South America and Africa, utilize a comma , character to indicate a decimal point.   In those cultures, the value of pi, a magic number relating the circumference of a circle to the radius of the circle, is written approximately as 3,141.  Other countries, primarily in North America, and the UK and former British colonies, use the period . character, to write pi as approximately 3.141.    To confuse the matter further,  either the comma or the period may be used as a "thousands" separator to group digits.      

 

A CSV file that contains text such as 3,141 depending on the culture that wrote the text could be intended to mean a numeric value about three and a tenth, or a numeric value of three thousand one hundred and forty one.    Likewise, text such as 2.718 in some cultures could either mean the value of e (another magic number first named as "e" by Leonhard Euler in the early 1700s), that is, a fractional number more than two but less than three, or in other cultures 2.718 could mean two thousand seven hundred eighteen.

 

Given the popularization of CSV by American companies such as IBM, the use of a comma character to separate field values made European-style use of the comma character as a decimal separator infeasible in CSV.      The use of a comma character to separate field values was also a painful restriction, in that it prevented a uniform way of emerging to use commas in field names, such as Population, 1990 or in text values such as complete addresses, as in 123 Main Street, Richmond, Virginia.   

 

Almost immediately, therefore, Europeans and other hip people created variations on CSV format that used other characters to separate field values.  Popular choices over the years have been the space character (obviously a limiting choice given the tendency of text to use space characters), the tab character, a semicolon ; character, a colon : character and, most recently, the pipe | character.    Using pipes to separate field values has become so popular that such files even have a name of their own: pipe-separated value  files, using the .psv three-letter extension.

 

Manifold will auto-recognize the above popular choices in most circumstances, or the desired separator can be specified.

Other Characters used as List Delimiters

We have touched on using the pipe character as an alternative to commas.   Other characters are also popular.   Older CSV files will often use characters that are not commas but which are also frequently found in ordinary text, such as colons, semicolons, space characters or tab characters.     Sometimes CSV files are created using manual editing where HTML tables are converted into CSV format, using tricks like replacing <td> tags with a list delimiter.   In such cases users will sometimes use a character, like a caret ^ character, that does not occur within the data.  Manifold can handle such characters as well.

 

The following table will be created using of the various characters shown in the following Notepad windows in use as list delimiters:

 

 

Comma , character:

 

 

Pipe | character:

 

 

Caret ^ character:

 

 

Tilde ~ character:

 

 

We can enter whatever character we want into the List delimiter combo box in the New Data Source dialog.

 

 

For example, in the illustration above we have entered the caret ^ character as the List delimiter.

End of Line Characters within Text

In CSV format, each line is one record.  Within each line, groups of characters delimited by whatever is the List delimiter character, a comma by default, give the values that should go into each field in that record.  The end of the line is marked by some end of line character or charactesr, which within MS-DOS and, thus by inheritance, Windows, is a sequence of two characters, a Carriage Return character, often abbreviated CR, and a Line Feed character, often abbreviated LF.

 

 

Text editors such as Notepad++ can show the end of line character or characters.  For example, the illustration above shows the example employees.csv file as seen in Notepad++ with special character display turned on.  Each line ends in a CR LF two-character sequence.

 

As anybody who has worked with formatting text files from different sources knows, systems other than Windows, such as a Linux text editor or a web page, might use only a Line Feed character to end a file.   CSV files from such systems will have only a single, Line Feed character at the end of each line.

 

 

The illustration above uses Notepad++ to show the first two lines of a CSV file created by a non-Windows system.   Notepad++ has been told to "wrap" long lines so they fit into the available window space.   Each line ends in only a single LF character.   

 

 

The Manifold CSV dataport will automatically read that CSV file correctly, accepting either CR LF or only LF as an end of line designation.  Each LF character is understood to end a line, that is, the end of each record, with the result being the table seen above.

 

Suppose, however, that one of the fields was intended to contain longer text that consisted of several paragraphs.   Consider the text below:

 

Criterion (i): The Buddha statues and the cave art in Bamiyan Valley are an outstanding representation of the Gandharan school in Buddhist art in the Central Asian region.

 

Criterion (ii): The artistic and architectural remains of Bamiyan Valley, and an important Buddhist centre on the Silk Road, are an exceptional testimony to the interchange of Indian, Hellenistic, Roman, Sasanian influences as the basis for the development of a particular artistic expression in the Gandharan school. To this can be added the Islamic influence in a later period.

 

Criterion (iii): The Bamiyan Valley bears an exceptional testimony to a cultural tradition in the Central Asian region, which has disappeared.

 

Criterion (iv): The Bamiyan Valley is an outstanding example of a cultural landscape which illustrates a significant period in Buddhism.

 

Criterion (vi): The Bamiyan Valley is the most monumental expression of the western Buddhism. It was an important centre of pilgrimage over many centuries. Due to their symbolic values, the monuments have suffered at different times of their existence, including the deliberate destruction in 2001, which shook the whole world.

 

Suppose that text was intended to all fit within a single cell, with a LF character at the end of each paragraph not intended to mean "this is the end of the record" but, instead, intended to mean "this is the end of this paragraph."    The additional LF characters in the middle of a text field would cause the standard CSV approach of understanding an LF character as an end of record to create partial records for each paragraph seen above.

 

We can see that in real life with UNESCO's downloads providing a list of World Heritage Sites.   UNESCO publishes the list in a variety of formats, including Excel XLS.  If imported into Excel and then exported as a CSV, the result will have LF characters embedded into paragraphs that are all intended to be part of the same field.

 

 

In the illustration above, we see the world heritage CSV file opened in Notepad++.   It shows how the first (long) record is ended with a CR LF sequence.   All of the other records are also ended with a CR LF sequence.

 

 

However, if we scroll further down we see that fields for some of the records have entire paragraphs that end with LF characters.  The green arrow marks a double quote " character that indicates the beginning of a text field.   That is followed with several paragraphs of text which have LF characters.  The additional HTML tags are there because UNESCO publishes the list in a form that, apparently, was originally intended for publication on the web.

 

If we create a new data source using the CSV dataport and we do not check the Allow multiline text values box, the extra LF characters will be interpreted as end of record marks and will result in the creation of partial, wrong, records in the table:

 

The illustration above shows what happens:  Each LF ending a paragraph also ended the record and resulted in a partial record for the paragraph.

 

We can fix the error by creating a new data source using the CSV dataport and then checking the Allow multiline text values box.  The extra LF characters within text in a field will be ignored, and only a CR LF or an LF outside of any double quote " character brackets that indicate a text field will be interpreted as an end of record:

 

 

The result is a correctly-imported table.

 

 

If we right-click on the cell containing the text with extra LF characters, we see all of the paragraphs have been imported as the contents of a single cell.

Automatic Adjustments

In most cases, the (auto) setting will work fine given that Manifold is reasonably good at auto-detecting how a given CSV file should be interpreted.  By manually specifying options we can guide the system in better interpreting a file.   If options we specify conflict with each other, the Manifold CSV dataport ignores the value for the less important option.   For example if the List delimiter is set to (auto) and when Manifold scans the file the system discovers that the list separator is a comma, then the dataport will not allow the Decimal separator character to also be a comma.

Exports

Manifold exports CSV files always using comma characters as list delimiters, periods as decimal separators and double quotes as text qualifiers.   Exporting boolean true or false values to CSV puts them within quotes.  When text values contain line breaks, exporting text values to a CSV file replaces line breaks with spaces to make sure exported data can be imported by as many products as possible.

 

Notes

Dates - Different languages have different ways for expression dates.  When recognizing date values as dates, Manifold is guided by whatever languages are specified in the Help - About dialog in the Lang entry, the presumption being that the date formats in use are those employed by whatever language is in use.    Reading a CSV file allows date fields to use multiple different languages.  For example, one field can use US English while another field can use German.

 

See Also

Tables

 

Data Types

 

File - Create - New Data Source

 

CSV Servers

 

Example: Import CSV and Create a Drawing - Import a CSV file by creating a New Data Source, copying and pasting the table conveyed by the CSV, prepping the table for use as a geocoded table and then creating a drawing.  

 

SQL Example: Create Lines from a Table of Points - A classic task in GIS is to rebuild objects from data in file formats, such as CSV, which decompose GIS objects like lines or areas into individual points.  In this example we recover the original lines from a table of points that has been imported from a CSV file.   The points mark the positions of coordinates (vertices) that originally represented lines marking the boundaries of ice shelves in Antarctica.  Each point has a LineID field giving the line of which it was a part, and a PointID field giving the order of the point within that line.   We use SQL and Manifold SQL functions to build lines from that table, automatically grouping points from each line in the correct order to reconstruct each line.