Table Windows and Big Data

If we are used to working with small data sets we might have expectations of table windows that don't match the reality of working with big data.   For many people, Manifold is the first application they've ever used that can open a table window into a table that is larger than the radius of the Earth.   The reality of working with such big tables can conflict with what we've been accustomed to when working with smaller, more human scale, tables.

 

For example, we might expect to be able to open a table window, to see the first records at the beginning of the table and then to be able to scroll to the very bottom of the table by dragging the vertical scroll bar at the right of the table to the bottom.  When we do that with a big table in Manifold and the window display moves only a few thousand records out of the millions we know are in the table, we might wonder, "Hey... What happened?  Where's the bottom?"

 

When we work with spreadsheets involving a few thousand rows or when we work with a database that has a few thousand records it makes logical sense that the table window is a view into the entire table and that the vertical scroll bar shows, at least in some relative way, the position of the current view into the table from beginning to end.  Scroll the bar all the way down and we see the last few records.  Scroll the bar all the way up and we see those in the beginning.

 

But that is not a good mental model to apply when working with tables involving billions, or even just millions, of records.    A window showing a few dozen records out of millions shows such a microscopic fraction of the total table that there is no sensible meaning to a vertical scroll bar in the context of the entire table.   

 

We could, of course, use Manifold controls such as Ctrl-End to jump to the end of a table and display a screen full of records from the end of the table.  We could then scroll up from there.  But even if we do that we are still seeing just those few records which fit on those screens to which we can jump or scroll through - a fraction of a large table.

 

Using the vertical scroll bar to get near a single record out of millions would be like trying to use a horizontal scroll bar on a map showing all of the United States to jump to a particular street address in Kansas City.  That is not a realistic expectation.

 

Instead of using a scroll bar for the entire US to try to find a specific address in Kansas we would use a different approach: we would zoom into that address by using some automated search tool, for example, by entering the address into a search box that would zoom us into just the immediate area around that address.

 

Table windows in Manifold are like that as well.  They are a view of the table intended for browsing a screen's worth of records a time, so once we find individual records using some automated means, like the Select pane, we can see in the table window the desired record in context, to edit that record and so on.   Table windows also use fill strategies, as discussed in the Big Tables and Table Fill Strategies sections of the Tables topic.

 

When we use table windows with small tables it is true we can browse the data in a window to review many of the records and thus get our heads around the data or find records that interest us.    But that doesn't work when data sets have millions of records because we could spend weeks peeking at data through windows showing a few dozen records at a time and still not see more than a fraction of the records.  

 

For example, a famous thread in the georeference forum discussed a LiDAR point cloud data set that contained 1.72 billion records in the table.   How big is a table that shows 1.72 billion records?  If we displayed the table in a series of screens where each page full of records was the height of a typical computer monitor screen the total length of the table would be over 8600 km (over 5300 miles), about 1.35 times the radius of the Earth.   That is such a large table that no amount of interactive viewing of the table would show anything more than the tiniest fraction of the table.   Such large tables can be handled with SQL or programmatically, or they can display their contents in drawing or image layers where most of the data will not be displayed as zoomed out views are averaged down and simplified, but they cannot be productively browsed with interactive table windows.

 

Table browsers usually are not an effective way of getting our head around such big data, but instead generate a fake impression of having seen representative data.    Consider the table in the LiDAR example above, which one screen at a time adds up to screens that end to end cover 1.35 times the radius of the Earth.   Do a thought experiment, with each time you browse a page full of records you add that record to a line that extends from your computer monitor, across the room where you sit, and then outside and across the street.  How many thousands of screens would you have to browse to get a few blocks from your computer?  Imagine laying a path with tiles the size of one screen full of records.  Imagine extending that path through the town where you live and out into the country, for miles and kilometers until you reach the next town.    In all those hundreds of thousands of screens, you still wouldn't have seen even 1/10th of 1% of records, but you might think you have seen a representative sample.

 

The reality is that a scroll of pages that is twice as long as the North American continent is wide is such an unthinkably large amount of text that humans cannot visualize how immensely larger it is than what makes sense to browse.   That leads to the fake impression that browsing such large tables interactively provides some genuine insight, when it does the opposite, giving impressions of what is in the data that are not genuinely representative.   The way to get correct impressions of what is in the data, to get our heads around what is really in the data, is by using tools like SQL to write sensible queries and to perform insightful analyses, that, like magic, slice and dice their way through millions of records to find or to manipulate just those records we want.

 

Manually sifting through millions of records a table full at a time is no way to find a needle in a haystack and that's not what table windows are for in big data.   Instead, table windows are just a way to browse very small glimpses of a big data set.  They are convenient for editing records found by other means in the context of records around them, to look at views of a few hundred records here or there to see if some command had wildly unintended effects and for other such specific, usually limited, purposes.   

 

Nonetheless, once we have a tool like Manifold on hand even if we procured it for our big data projects we might also use it casually as well, as a personal information manager or for data sets involving just a few hundred or a few thousand records.  It's just like how many IT professionals who use Oracle for their enterprise might also use Oracle to keep track of a hobby collection like stamps or coins or wines.   In such cases the table window will be very handy for browsing tables, editing records and so on.

 

For a discussion of Manifold facilities that allow working with bigger tables while retaining the casual convenience of interactive browsing of table windows, see the Big Tables and Table Fill Strategies sections of the Tables topic.