HTML

Manifold can read tables within HTML files.   This capability is often used to harvest tables from saved web pages.    

 

Manifold uses Microsoft facilities to connect to all Microsoft Office formats, including .html, .htm and other legacy Office formats such as .db, .mdb, .xls, and .wkx, together with newer Office formats such as .xlsx and .accdb.  If Manifold cannot import from such formats, that means the Windows system we are using is missing the necessary facilities.  Please see the Microsoft Office Formats - MDB, XLS and Friends topic for a solution.

Create an Example HTML File

In this example we visit a Wikipedia page giving a table with a list of Roman amphitheaters.   We would like to import that table into a Manifold project. To do that we will save the table in an HTML file.

 

 

In theory, we could simply tell our browser to save the page we are looking at as an HTML file.   The problem with that is that modern web pages contain a seeming infinity of junk, often including many tables that are not of interest.    It is easier to simply copy the table we are interested in, paste it into some convenient editor, and then save as an HTML file.   

 

We highlight the table of interest in the web page and we press Ctrl-C to copy the table to the Windows Clipboard.

 

 

We launch Microsoft Word to a new, blank document and we Ctrl-V to paste the table from the clipboard.   Word tries its best to copy everything it can, including links and images from the copied Wikipedia table.

 

 

We save the document as a web page.   Manifold can import from either .htm or .html.  

 

Import from HTML

Launch Manifold and choose File - Import.

 

 

To import from HTML format:

 

  1. Choose File-Import from the main menu.

  2. In the Import dialog browse to the folder containing data of interest.

  3. Double-click the file ending in .htm or .html for the data of interest.

  4. One or more tables will be created.

 

 

 

If tables from the .htm are not created as shown above, that means the Windows system we are using is missing facilities necessary for a connection to HTML. Please see the Microsoft Office Formats - MDB, XLS and Friends topic for a solution.

 

 

We can double-click on tables that are created to view them.  

 

 

The table appears as imported.  Given the astonishing amount of junk encountered in tables in web pages in modern times, it is mildly surprising the table is as clean as it is.   The gray background shows that the table has no index and thus is neither selectable nor editable.  

 

We follow the one-click procedure in the Add an Index to a Table  topic to add an index to the table.  This step is not illustrated, but the table will now have white background color to show it is selectable and editable.

 

 

We can now edit the table as we like.   For example, we can right-click on the first cell of the first row and choose Edit to see the contents of that cell.   This is a fairly typical situation for a Wikipedia table, where numerous links are embedded in the table.   

 

We can increase the display width of the first column by dragging the boundary of the column header to the right, or by using the Layers pane to increase the width, as we have done in the illustration below, to 320 points.     We can see that almost all of the records in the first column have links embedded.

 

 

We can get rid of those links by using Regular Expressions and the Transform pane.

 

With the focus on the table, in the Transform pane we pick the City_(Roman name) field, and then we double-click the Replace template.

 

 

In the Replace template, we choose regular expression for the Replace option.  

 

In the Search for box, we enter the regular expression #.*#  and we enter nothing in the Replace with box.    All text matching the regular expression pattern will be replaced with nothing, that is, deleting all instances of text that match the pattern.

 

The regular expression #.*# says to match any sequence of characters that begins with a # character, followed by one or more of any character, and ended with a # character.  That's exactly the link expression we want to eliminate from that field, to leave only the city names.

 

For the Result destination, we choose Same Field, to edit the City_(Roman name) field in place.   If we preferred, from the pull down menu in the Result box we could have chosen any other text field in the table as the result destination, or we could have chosen New Field and entered the name of a new text field that the template would create in the table for the result destination.

 

Press the Preview button to see a preview.

 

 

We do not have to do a preview before applying the transform, but when using regular expressions to modify the source field in place, it is often a good idea to check our work with a preview before making mass changes.   When we press the Preview button, a preview column in blue preview color appears on the right side of the window, with the name of the template being previewed in the column head.  We can drag that column to a different position or resize it to make comparisons easier.  

 

The preview column shows that the regular expression we use will pick out the text between hash # characters, and will replace it with nothing, thus deleting that text.  That is exactly what we want, so we can go ahead and apply the transform template with confidence.

 

Press Transform.

 

 

The template immediately replaces all text in the City_(Roman name) field that matches the regular expression, that is, the text between hash # characters, with nothing, deleting the various links.  

 

Notes

Plenty to do - Most tables we harvest from web pages will require significant tinkering.  We will adjust the field names to more sensible names, we will use many different editing techniques, and we may find ourselves copying between fields to clean up messy imports.   The more expertise we develop with tools like the Transform pane, transform templates, regular expressions, the Select pane and similar, the less effort we will expend and the quicker our work will go.

 

See Also

Tables

 

Add an Index to a Table

 

Regular Expressions

 

Transform Pane

 

File - Create - New Data Source

 

DBMS Data Sources - Notes

 

Example: Closing without Saving - An example that shows how File - Close without saving the project can affect local tables and components differently from those saved already into a data source, such as an .mdb file database.

 

Example: Create and Use New Data Source using an MDB Database - This example Illustrates the step-by-step creation of a new data source using an .mdb file database, followed by use of SQL.  Although now deprecated in favor of the more current Access Database Engine formats, .mdb files are ubiquitous in the Microsoft world, one of the more popular file formats in which file databases are encountered.  

 

Example: Switching between Manifold and Native Query Engines - How to use the !manifold and !native commands to switch a query in the Command window from use the Manifold query engine to whatever query engine is provided by a data source.

 

Microsoft Office Formats - MDB, XLS and Friends