Manifold can read Comma Separated Values (CSV) and similar formats that represent tables within text files, where each line is a record and the values for fields are separated by a character such as a comma (CSV). Each CSV file contains one table. A similar format uses a | "pipe" character as a list delimiter (the character between fields) and is often known as Pipe Separated Values, or PSV file format. The Tab Separated Values format is a variation of CSV that uses tabs between fields and usually uses a .tsv or .tab extension. .tsv has become much more popular in recent years than .tab This topic shows how to acquire CSV data from CSV files. A related topic, CSV Servers, shows how to acquire CSV data from web servers. Manifold can handle CSV files using both ASCII and Unicode (Unicode recommended) encodings.
CSV is a very old format, dating back to the late 1970s and appearing using the "CSV" name by 1983. Many versions of it existed, leading to a movement in 2005 to standardize the format as RFC 4180. RFC 4180 describes CSV using the most popular conventions of the time and is cited in a section at the end of this topic. It was a big help in making CSV a more useful format, despite the tendency of many people to ignore the standard when authoring CSV files.
As a result of the widespread lack of adherence to RFC 4180 or any other would-be standard means that files differing from RFC 4180, for example using a separator other than commas, are still called "comma separated values" or "CSV" files. Manifold's dataport for CSV has (auto) settings and default settings that in most cases can guess what a CSV file uses, always guessing an RFC 4180 compliant file correctly with the exception that by default multiline text values will not be read (see discussion in the RFC 4180 section below for why). Options can be specified when using the New Data Source dialog.
Tech tip: When importing or linking a CSV file, if strange results occur in a table, such as unexpected <NULL> values for the right-most fields in a table or other symptoms that indicate data in one or more records from multiple fields have been combined into one field somewhere in the record, the problem is usually a CSV file that does not honor the basics of RFC 4180. That's often the result of extra or missing commas, or incorrect usage of double quote " characters in the file. For example, CSV files that contain a single double quote " character in the middle of a field without enclosing the field in double quotes do not honor RFC 4180. To get around that problem use the File - Create - New Data Source command to import or link the file with the Text qualifier set to an infrequently used character (such as ~ or ^) that does not likely occur in the file. If the file then imports or links properly (or if it still imports improperly but in a different way), the problem is likely incorrect use of double quotes in the file. Of course, if CSV files are so incorrectly written (or damaged) that they contain basic mistakes like incorrect use of double quotes, they may be full of other errors as well.
When exporting to a Manifold-created CSV file, Manifold always uses RFC 4180, replacing line breaks within fields by spaces to ensure compatibility with as many other CSV importers as possible.
We can bring data from CSV files into Manifold in six main ways, with an optional seventh way that in turn has three options. That provides a total of nine different ways to bring data from CSV files into a Manifold project.
Create a new data source using the File - Create - New Data Source command, choosing More... and then in the Type box choosing File.csv. This leaves the data in the CSV file and links that data into the project as a data source.
Create a new data source by using the File - Link command and choosing in the file type box CSV files (*.asc,*.csv,*.psv,*.tsv,*.tab). This is a simplified way to create a data source by using default settings without running the File - Create - New Data Source command. This leaves the data in the CSV file that uses one of those extensions and links that data into the project as a data source.
Create a new data source of type Web Server: csvserver using the File - Create - New Data Source command and specify a URL to a CSV file on a web server. This leaves the data in the CSV file on the web server and links that data into the project as a data source.
Create a new data source using any of the above three methods and then Import the data from the CSV file into the project by copying the linked table within the data source and pasting it into the main part of the project.
Import data from one or more CSV files using the File - Import command and choosing in the file type box CSV files (*.asc,*.csv,*.psv,*.tsv,*.tab). This copies the data from CSV files using those extensions and saves it in the project.
Import data from one or more files with a .csv or .psv extension by dragging and dropping those files into the Project pane. Make sure the Project pane focus is on a writable part of the project pane, not on a data source. This copies the data from the CSV file and saves it in the project. Note that dragging and dropping only works for files with .csv or .psv extensions since .asc and .tab extensions are used for other file formats besides CSV and thus do not automatically trigger use of the CSV dataport.
Optional:
Import data or Create a new data source from a file with a .csv extension using the optional GDAL package, an open source package that may be installed to use with Manifold in addition to Manifold's own dataports. GDAL may be used in three ways:
Import data using File - Import by choosing in the file type box .Files (GDAL/OGR) (*.*). This uses GDAL code as the importing code to copy data from files with a .csv extension and save it in the project.
Create a new data source by using File - Link by choosing in the file type box .Files (GDAL/OGR) (*.*). This uses GDAL code as the importing code to leave the data in a file with a .csv extension while linking that data into the project as a data source.
Create a new data source using the File - Create - New Data Source command, choosing More... and then in the Type box choosing File.csv.
The first option above, creating a new data source, provides settable
options when harvesting data from CSV files. We therefore
will focus on connecting to CSV files using the File
- Create - New Data Source command. If we prefer to import
the data into our project we can copy
the table from the CSV data source and paste
it into the project.
Use Unicode: CSV as a text format is vulnerable to all the chaos that arises from using different character sets. When international characters are found in a CSV data file, best results are usually obtained by opening the CSV in a text editor, such as Notepad, and saving it not using plain UTF-8 encoding but instead UTF16 or UTF-8 with BOM encoding. If strange characters appear in text fields in an imported CSV, try opening the original CSV file in Notepad and do a Save As to see if it is using UTF-8 encoding. If so, switch the Save As setting to UTF-8 with BOM, save under a different filename, and then import that saved, Unicode, CSV. If that is not an option in the version of Notepad being used, save the file under a different filename as UTF-8 with BOM encoding, and then import that version.
Choose File - Import in the main menu.
Choose CSV files (*.asc,*.csv,*.psv,*.tsv,*.tab) in the file type box.
Navigate to the desired file.
Double-click the desired file or click it to highlight it and press Import.
Default settings will usually work for most CSV files. A new table appears in the project.
Recommended: Add an index - Open the new table (double-click it) and choose Edit - Schema and press the Add Identity button (the second yellow star button) in the dialog's toolbar. Press Save Changes.
A short cut way to import a .csv, .psv or .tsv file is to drag it from the Windows file explorer display and drop it into the Project pane. If you have data sources open, make sure the focus in the Project pane is not on or inside one of them, or the drag and drop command is a command to attempt to create the imported table within that data source. Put the focus on an empty part of the Project pane (click on an empty spot) and then drag and drop the file into the Project pane.
Dragging and dropping a .csv, .psv or .tsv file works correctly for files that conform to CSV standards in a common sense way. For example, if the first line in the file clearly consists of all text values the system will attempt to read the first line as the names of fields. However, if the first line is something like 1,2,3,4,5,6,7 the system will read those values as integers, not text, and thus not use them as the names of fields. In contrast, if the first line was "1","2","3","4","5","6","7" then the system would see from the enclosure of values by quotes that they are to be read as text, and will use the first line for the names of fields.
To import or link CSV files that have numbers in the first line while using those numbers as the names of fields (not a very portable way to name fields, but what the heck...) use the File - Create - New Data Source command and make sure a check mark is placed in the First line contains field names option. Names specified by the first line of the CSV file of course must be legal file names and they all most be different (cannot have two different fields in a table having the same name). See the discussion below on controls for creating a new data source.
Choose File - Create - New Data Source in the main menu, or right-click in the Project pane and choose Create - New Data Source.
Choose More... in the dropdown menu to launch the New Data Source dialog.
Choose File: csv in the Type box.
Click the [...] browse button to navigate to the desired .csv file and to open it.
Default settings will usually work for most CSV files. Press Create Data Source.
A new data source appears in the project. Open it to see the table that is linked in from the CSV file.
Click on the table linked in from the CSV to highlight it.
Press Ctrl-C or click the Copy button in the Project pane toolbar.
Click anywhere in the main part of the project outside of the new data source's hierarchy.
Press Ctrl-V or click the Paste button in the Project pane toolbar.
Tables that are imported or linked from CSV files are neither selectable nor editable by default since they do not have any indexes. We cannot add an index to a linked CSV table since CSV format does not support indexes, but we can add an index to an imported table.
Local tables that are imported from CSV or copied and pasted from a linked CSV table are fully capable, fully DBMS Manifold tables. With two clicks we can easily add a key field and index to them to enable editing and selection, using the procedure given in the Add an Index to a Table topic.
Double-click on the table to open it.
Choose Edit - Schema.
In the Schema dialog click the Add Identity button (the second yellow star button) in the dialog's toolbar button.
Press the Save Changes button.
In the main menu choose File - Create - New Data Source.
Choose More... to launch the New Data Source dialog.
Click the browse button to choose a file.
The Select File dialog is a variation of the well-known Windows File Open dialog. By default, it shows files ending in the standard .asc, .csv, .psv, and .tab file extensions used with CSV files. Click on the file to load it into the File name box. If we like, we can write the name of any file manually into the File name box. For example, if we have a CSV file that uses a .txt extension, we can put the name employees.txt into the File name box and that will be read as well.
Name |
Name for the new data source, "Data Source" by default. Specify a more memorable name as desired. If we forget the origin of a data source we can hover the mouse over the data source and a tool tip will provide connection information. |
Type |
To connect to a CSV file, choose File: csv from the very long list of data sources in the pull down menu. |
Source |
A connection string or path to the CSV file. The source string can be manually entered or it can be created using a Windows dialog launched by the [...] browse button. |
[...] Browse button |
Click to use a standard Windows Select File dialog (very similar to the usual File Open dialog) to choose the desired CSV file. |
Open as read-only |
CSV files in general are not writable even if the file itself on disk is not specified read-only by Windows. All the same, Manifold creates a read/write mfd_meta file that can be used to store characteristics like the width of columns and other properties within the project file. Checking the Open as read-only box forces the entire data source to be read-only. |
First line contains field names |
The checkbox allows us to choose three different settings by clicking the box:
auto - Box filled with a dash in Windows 11 or a small filled box in older Windows systems. This is the default. If auto-detection indicates the first line uses text for all fields, use those text values for field names.
unchecked - An empty box. Do not use the first line as field names.
checked - Box filled with a check mark. Read all values in the first line as text and use those values for field names.
The first line in most CSV data gives the names of fields. Some CSV data sources do not provide the names of fields in the first line. If some of the field names in the first line are strange, like a single number, make sure to switch this from auto detect to checked.
Important: when this option is checked the field names given by the CSV file must be legal field names in Manifold. For example, they cannot contain line breaks, and they all must be unique (cannot repeat the same name). See the discussion on Names in the Schema topic. |
Read all fields as text |
By default he system will try to guess what fields contain text and what fields contain integers or floating point numbers. In some cases, such as ZIP codes in US postal addresses, what appears to be a number should be treated as text. Checking this box imports all fields as text even if they are not marked as text and even if they appear to be numeric fields. |
Allow multiline text values |
When checked, ignore linefeed and other characters normally denoting the end of a line which occur within text, that is, within text qualifier characters. Check this box to deal with "csv" files that include multiple paragraphs within a single text field, as often occur when CSV files are created from data originally harvested from web pages, KML or other web-oriented formats. |
List delimiter |
The character used within the CSV data to separate fields within a record. The default (auto) setting usually guesses correctly. Enter any character by entering it directly into the box instead of (auto) or choose it from the list of standard characters: comma, colon, semicolon, pipe, space or tab. RFC 4180 compliant usage is to specify a comma , character. |
Decimal separator |
The character used within the CSV file to denote the decimal fractional part of a number, almost always either a period or a comma. The default (auto) setting usually guesses correctly. Enter any character or choose it from the list of two standard characters: period or comma. A period is used in North America and many former UK colonies. A comma is used in Europe, South America and many African countries. |
Text qualifier |
The character used within the CSV data to wrap values intended to be text. The default (auto) setting usually guesses correctly. Chose the character from the drop down list that provides two standard characters: a double " quote character or a single ' quote character, or enter the character of your choice by entering it into the same box instead of (auto). The text qualifier character also escapes the same text qualifier. RFC 4180 compliant usage is to specify a double " quote character. |
Create Data Source |
Create the new data source in the project pane and close the dialog. |
Edit Query |
Launch the Command Window loaded with a query that creates the data source using the given settings. A great way to learn how to use SQL to create data sources. |
Cancel |
Exit the dialog without doing anything. |
"Comma-separated value" format is a text file format used to convey information in tables. Each line in the text file contains data for a single record. Values for each field are separated by a special character, for example, a comma. The specific character that is used to separate field values is called the list delimiter character. Usually a comma is used (hence the name "comma" separated value files...), but other characters can be used as list delimiters, especially if the data in the file most convey text that contains commas.
Much of how CSV files are used is a matter of tradition that has become accepted over the centuries since the original introduction of CSV format, about the time of the domestication of the ox, and the introduction of IBM FORTRAN in 1972. Over the years, various traditions and attempts at standardization have emerged, often trying to deal with how to represent commas in the data when the comma character itself is used as part of the format.
The idea behind CSV is simple. Consider a file called employees.csv which contains the following text:
Last Name, Country, Height
Fuller, USA, 172
Suyama, UK, 165
Smith, USA, 185
Dodsworth, UK, 187
Martin, France, 180
Pereira, France, 169
By tradition, the first line in a CSV file gives the names of the fields in the table, with the name of each field being separated by a comma , character. Based on the names of the fields and the contents of the file, we might guess that the intended table is a list of employees, giving the last name of each, their country, and the height of the employee in centimeters.
Manifold will read our example employees.csv text file and will create a table from it that looks like the above, with the Last Name and Country fields being text data types and the Height field being an integer data type.
In the example file above it is obvious that the values for Height are intended to be numbers. When all records for a specific field contain only the numeric characters 0, 1, 2, ..., 9 it is usually a fair guess that field is intended to be a number. But sometimes that is not a correct guess. For example, ZIP codes for US postal addresses cannot be treated as numbers given that leading zeros in ZIP codes are significant. If our ZIP code is 02138, a ZIP code in Cambridge, Massachusetts, the leading zero must be preserved.
To avoid the need to guess, CSV files use long-accepted traditions for manually specifying what values are supposed to be text. A convention almost universally respected in programs that create or consume CSV files is that any content wrapped with quote characters is to be treated as text. We could write the employees.csv file as:
"Last Name","Country","Height"
"Fuller","USA",172
"Suyama","UK",165
"Smith","USA",185
"Dodsworth","UK",187
"Martin","France",180
"Pereira","France",169
Since of course it would be far too easy for everyone to agree on such a simple rule, some people use double quote " characters to wrap a value intended as text and other people using a single quote ' character to wrap text. Manifold can handle either or any other character that is used.
Checking the Read all fields as text option tells Manifold to interpret all values as text data types, to force exclusively number content, like ZIP codes, to be read as text data whether it is wrapped with quote characters or not.
Manifold can handle CSV files using both ASCII (ANSI) and Unicode (UTF-8 recommended) encodings. When importing text fields from a CSV file that uses ASCII encoding, the text fields will be imported as data type varchar. When importing text fields from a CSV file that uses Unicode encodings, the text fields will be imported as data type nvarchar.
Everything in a CSV file is text, but sometimes, like in the example above it is clear that some fields are intended as numbers and not text. Manifold will try to guess what data type a field should be and import it that way. Fields that contain only text characters will usually reliably be imported as text, either as varchar or nvarchar depending on what encoding the source file used. Fields containing numbers that have a decimal separator will be imported as FLOAT64 data types. Otherwise, fields containing numbers will be imported as INT32 unless the numbers they contain are too large for INT32 data type, in which case they will be imported as text. INT32 is used since in the vast majority of CSV files integer values will fit into INT32 and that takes up half the space of INT64 values. If a CSV file's fields have such large numbers that INT64 is required, they will be imported as text and then it is easy to add an INT64 field to the table and to copy/CAST the text version of the number into an INT64 data type. Dates and times are usually recognized in most formats as DATETIME types, so long as the date comes first (with the year last in the date format) followed by the time. Fields intended as Booleans will be imported either as integers if they represent Boolean values as 0 or 1 or as text if they use True or False or equivalent words.
It seems obvious that whoever first decided to use commas as a list delimiter to separate field values in CSV files did not expect to use CSV files to store text values such as street addresses, since constructions such as "123 Main Street, Smallville, Ohio" obviously use commas, nor did they expect to store text that includes sentences as often appear in descriptions and comments. That is such a painful restriction in CSV format that there are two ways to get around the restriction:
The first method is that commas used within quoted text do not count as list delimiters. For example, if we wrote the employees.csv file as:
"Name","Country","Height"
"Fuller, Andrew","USA",172
"Suyama, Michael","UK",165
"Smith, Tim","USA",185
"Dodsworth, Anne","UK",187
"Martin, Xavier","France",180
"Pereira, Laurent","France",169
Then the resulting table would be
The above is obviously prone to error and confusing. The second method is to avoid using a common character, like a comma, that is often found in text and instead to use a rare character such as a pipe | symbol. For example, if we wrote the employees.csv file as:
Name|Country|Height
Fuller, Andrew|USA|172
Suyama, Michael|UK|165
Smith, Tim|USA|185
Dodsworth, Anne|UK|187
Martin, Xavier|France|180
Pereira, Laurent|France|169
Then the resulting table, whether or not we chose to continue using quotes to set off text, still would be
Using commas is such a wretched choice when saving text info in tables and the pipe character tends to lend such clarity that using the pipe character is becoming more popular. In fact, it is so popular that Pipe separated values (PSV) files have acquired a name and a three letter extension all their own. Super!
If we have spent our entire lives in the US we may be surprised to learn that in many other countries people do not write pi as 3.141 but instead write it as 3,141 and by that they do not mean three thousand one hundred forty one but instead mean the same as 3.141 means in the US. That is, they use the comma , character to denote the fractional part of a decimal fraction. Likewise, someone in France might be surprised to learn that in the US they do not write pi as 3,141.
Cultures in some countries, primarily in Europe, South America and Africa, utilize a comma , character to indicate a decimal point. In those cultures, the value of pi, a magic number relating the circumference of a circle to the radius of the circle, is written approximately as 3,141. Other countries, primarily in North America, and the UK and former British colonies, use the period . character, to write pi as approximately 3.141. To confuse the matter further, either the comma or the period may be used as a "thousands" separator to group digits.
A CSV file that contains text such as 3,141 depending on the culture that wrote the text could be intended to mean a numeric value about three and a tenth, or a numeric value of three thousand one hundred and forty one. Likewise, text such as 2.718 in some cultures could either mean the value of e (another magic number first named as "e" by Leonhard Euler in the early 1700s), that is, a fractional number more than two but less than three, or in other cultures 2.718 could mean two thousand seven hundred eighteen.
Given the popularization of CSV by American companies such as IBM, the use of a comma character to separate field values made European-style use of the comma character as a decimal separator infeasible in CSV. The use of a comma character to separate field values was also a painful restriction, in that it prevented a uniform way of emerging to use commas in field names, such as Population, 1990 or in text values such as complete addresses, as in 123 Main Street, Richmond, Virginia.
Almost immediately, therefore, Europeans and other hip people created variations on CSV format that used other characters to separate field values. Popular choices over the years have been the space character (obviously a limiting choice given the tendency of text to use space characters), the tab character, a semicolon ; character, a colon : character and, most recently, the pipe | character. Using pipes to separate field values has become so popular that such files even have a name of their own: pipe-separated value files, using the .psv three-letter extension.
Manifold will auto-recognize the above popular choices in most circumstances, or the desired separator can be specified.
We have touched on using the pipe character as an alternative to commas. Other characters are also popular. Older CSV files will often use characters that are not commas but which are also frequently found in ordinary text, such as colons, semicolons, space characters or tab characters. Sometimes CSV files are created using manual editing where HTML tables are converted into CSV format, using tricks like replacing <td> tags with a list delimiter. In such cases users will sometimes use a character, like a caret ^ character, that does not occur within the data. Manifold can handle such characters as well.
The following table will be created using of the various characters shown in the following Notepad windows in use as list delimiters:
Comma , character:
Pipe | character:
Caret ^ character:
Tilde ~ character:
We can enter whatever character we want into the List delimiter combo box in the New Data Source dialog.
For example, in the illustration above we have entered the caret ^ character as the List delimiter.
In CSV format, each line is one record. Within each line, groups of characters delimited by whatever is the List delimiter character, a comma by default, give the values that should go into each field in that record. The end of the line is marked by some end of line character or characters, which within MS-DOS and, thus by inheritance, Windows, is a sequence of two characters, a Carriage Return character, often abbreviated CR, and a Line Feed character, often abbreviated LF.
Text editors such as Notepad++ can show the end of line character or characters. For example, the illustration above shows the example employees.csv file as seen in Notepad++ with special character display turned on. Each line ends in a CR LF two-character sequence.
As anybody who has worked with formatting text files from different sources knows, systems other than Windows, such as a Linux text editor or a web page, might use only a Line Feed character to end a file. CSV files from such systems will have only a single, Line Feed character at the end of each line.
The illustration above uses Notepad++ to show the first two lines of a CSV file created by a non-Windows system. Notepad++ has been told to "wrap" long lines so they fit into the available window space. Each line ends in only a single LF character.
The Manifold CSV dataport will automatically read that CSV file correctly, accepting either CR LF or only LF as an end of line designation. Each LF character is understood to end a line, that is, the end of each record, with the result being the table seen above.
Suppose, however, that one of the fields was intended to contain longer text that consisted of several paragraphs. Consider the text below:
Criterion (i): The Buddha statues and the cave art in Bamiyan Valley are an outstanding representation of the Gandharan school in Buddhist art in the Central Asian region.
Criterion (ii): The artistic and architectural remains of Bamiyan Valley, and an important Buddhist centre on the Silk Road, are an exceptional testimony to the interchange of Indian, Hellenistic, Roman, Sasanian influences as the basis for the development of a particular artistic expression in the Gandharan school. To this can be added the Islamic influence in a later period.
Criterion (iii): The Bamiyan Valley bears an exceptional testimony to a cultural tradition in the Central Asian region, which has disappeared.
Criterion (iv): The Bamiyan Valley is an outstanding example of a cultural landscape which illustrates a significant period in Buddhism.
Criterion (vi): The Bamiyan Valley is the most monumental expression of the western Buddhism. It was an important centre of pilgrimage over many centuries. Due to their symbolic values, the monuments have suffered at different times of their existence, including the deliberate destruction in 2001, which shook the whole world.
Suppose that text was intended to all fit within a single cell, with a LF character at the end of each paragraph not intended to mean "this is the end of the record" but, instead, intended to mean "this is the end of this paragraph." The additional LF characters in the middle of a text field would cause the standard CSV approach of understanding an LF character as an end of record to create partial records for each paragraph seen above.
We can see that in real life with UNESCO's downloads providing a list of World Heritage Sites. UNESCO publishes the list in a variety of formats, including Excel XLS. If imported into Excel and then exported as a CSV, the result will have LF characters embedded into paragraphs that are all intended to be part of the same field.
In the illustration above, we see the world heritage CSV file opened in Notepad++. It shows how the first (long) record is ended with a CR LF sequence. All of the other records are also ended with a CR LF sequence.
However, if we scroll further down we see that fields for some of the records have entire paragraphs that end with LF characters. The green arrow marks a double quote " character that indicates the beginning of a text field. That is followed with several paragraphs of text which have LF characters. The additional HTML tags are there because UNESCO publishes the list in a form that, apparently, was originally intended for publication on the web.
If we create a new data source using the CSV dataport and we do not check the Allow multiline text values box, the extra LF characters will be interpreted as end of record marks and will result in the creation of partial, wrong, records in the table:
The illustration above shows what happens: Each LF ending a paragraph also ended the record and resulted in a partial record for the paragraph.
We can fix the error by creating a new data source using the CSV dataport and then checking the Allow multiline text values box. The extra LF characters within text in a field will be ignored, and only a CR LF or an LF outside of any double quote " character brackets that indicate a text field will be interpreted as an end of record:
The result is a correctly-imported table.
If we right-click on the cell containing the text with extra LF characters, we see all of the paragraphs have been imported as the contents of a single cell.
In most cases, the (auto) setting will work fine given that Manifold is reasonably good at auto-detecting how a given CSV file should be interpreted. By manually specifying options we can guide the system in better interpreting a file. If options we specify conflict with each other, the Manifold CSV dataport ignores the value for the less important option. For example if the List delimiter is set to (auto) and when Manifold scans the file the system discovers that the list separator is a comma, then the dataport will not allow the Decimal separator character to also be a comma.
Manifold exports CSV files always using comma characters as list delimiters, periods as decimal separators and double quotes as text qualifiers. Exporting boolean true or false values to CSV puts them within quotes. When text values contain line breaks, exporting text values to a CSV file replaces line breaks with spaces to make sure exported data can be imported by as many products as possible.
RFC 4180 implies that all fields in CSV files are text files. Although RFC 4180 does not mention it, for the convenience of users almost all CSV importers will scan a CSV file and by default will import fields that exclusively contain numbers as numeric data types. Manifold does the same by default, including import of numeric fields that have a decimal separator in them as floating point numbers or integers otherwise.
RFC 4180 mentions that each field "may or may not be enclosed in double quotes" but it does not assign any special meaning to the use of double quotes to enclose a field except to say that if a field is not enclosed by double quotes then double quotes, line breaks and commas may not appear within the field. A double quote that appears inside a field must be escaped by preceding it with a double quote.
In addition to those uses of double quotes specified by RFC 4180, a nearly universal rule (also followed by Manifold) is that when a field is enclosed by double quotes it must be imported as a text field. That is essential when using CSV files to convey data like ZIP codes (postal codes) in the US, which use numeric characters but where leading zero is significant. For example, the ZIP code 02138 is a different code than 2138 when comparing data. However, if ZIP code fields are not enclosed by double quotes, they will be imported as numeric fields in most CSV implementations and 02138 as a text sequence of five characters will be imported as the number 2138. There is an option when using the File - Create - New Data Source command to import or to link CSV files to read all fields as text whether or not they are enclosed by double quotes.
RFC 4180 mentions text encoding only in the MIME discussion, to wit: "Common usage of CSV is US-ASCII, but other character sets defined by IANA for the "text" tree may be used in conjunction with the "charset" parameter." Manifold can handle CSV files using both ASCII and Unicode (Unicode recommended) encodings. When importing text fields from a CSV file that uses ASCII encoding, the text fields will be imported as data type varchar. When importing text fields from a CSV file that uses Unicode encodings, the text fields will be imported as data type nvarchar.
The RFC 4180 standard defines CSV format by citing the rules followed by most CSV implementations, as follows:
1. Each record is located on a separate line, delimited by a line break (CRLF). For example:
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
2. The last record in the file may or may not have an ending line break. For example:
aaa,bbb,ccc CRLF
zzz,yyy,xxx
3. There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional "header" parameter of this MIME type). For example:
field_name,field_name,field_name CRLF
aaa,bbb,ccc CRLF
zzz,yyy,xxx CRLF
4. Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma. For example:
aaa,bbb,ccc
5. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:
"aaa","b""bb","ccc"
To describe a modernized form of RFC 4180 as it is generally applied we can add four more rules to the above seven rules given by RFC 4180:
8. When the header line of a CSV file is used to specify the names of fields, those names must be legal field names. For example, they must be different from each other and they cannot contain line breaks.
9. Fields enclosed by double quotes must be imported or linked
as text type fields regardless of their contents.
10. When fields not enclosed by double quotes contain text that can
be interpreted as numeric values or dates, import them as numeric or datetime
data types.
11. When fields contain line breaks as allowed by point 6 above, provide an option for importing such multiline text values but by default treat the line breaks within them as ends of line.
Although the sixth specification in RFC 4180 explicitly states that multiline fields are allowed so long as the field is enclosed by double quotes, by overwhelming margins such multiline fields in real world CSV files indicate an error in the file and not a deliberate attempt to enclosed a multiline text value. Manifold therefore by default does not read such multiline fields as a multiline text value. Doing that makes it much easier to find such errors in CSV files, because errors will be more obvious as the afflicted records will stand out for being shorter in terms of real data contents. The CSV file can then be corrected by manually editing the file with Notepad++ or some other text editor and then re-imported or re-linked to get correct data.
If a file legitimately contains multiline text fields, it can be imported accurately by using the File - Create - New Data Source command, and checking the Allow multiline text values option to read multiline fields that are enclosed by text qualifier characters (double quotes by default).
Dates - Different languages have different ways for expression dates. When recognizing date values as dates, Manifold is guided by whatever languages are specified in the Help - About dialog in the Lang entry, the presumption being that the date formats in use are those employed by whatever language is in use. Reading a CSV file allows date fields to use multiple different languages. For example, one field can use US English while another field can use German.
File - Create - New Data Source
Example: Import CSV and Create a Drawing - Import a CSV file by creating a New Data Source, copying and pasting the table conveyed by the CSV, prepping the table for use as a geocoded table and then creating a drawing.
SQL Example: Create Lines from a Table of Points - A classic task in GIS is to rebuild objects from data in file formats, such as CSV, which decompose GIS objects like lines or areas into individual points. In this example we recover the original lines from a table of points that has been imported from a CSV file. The points mark the positions of coordinates (vertices) that originally represented lines marking the boundaries of ice shelves in Antarctica. Each point has a LineID field giving the line of which it was a part, and a PointID field giving the order of the point within that line. We use SQL and Manifold SQL functions to build lines from that table, automatically grouping points from each line in the correct order to reconstruct each line.