Shapefiles Strangely Out of Shape

Shapefiles are a file standard for storing GIS data introduced many years ago by ESRI for a desktop GIS package.  Despite employing a Stone Age level of database technology they have become one of the most common standards for interchanging GIS information.   


Unfortunately, shapefiles include numerous limitations that make them a poor choice as an interchange format in modern times.  These include:







Shapefiles are fine for data interchange of very simple, unprojected data, the original purpose for which they were created.  Using them for data interchange or data storage beyond that causes no end of trouble.


While the painful limits of .dbf, the restrictions on single object types per shapefile and endless difficulties caused by using shapefiles to convey projected information are old news to GIS practitioners, a new generation of troubles can arise when using software like Manifold that can open shapefiles in dynamic read/write mode to edit shapefiles on the fly.   Troubles arise because the shapefile spec says nothing about dynamic modifications, so different interpretations of how that might be done can cause different interpretations of the same shapefiles by different software packages.


Problems may arise due to differing interpretations of how to best harmonize the ensemble of files that make a "shapefile," which is not just one file but an ensemble of files that must contain content that exactly corresponds with each other.   For example, each .shp file has an accompanying .dbf file in  dBASE II DBMS format.   The number of objects in the .shp file must coincide exactly with the number of records in the .dbf file.  


The official spec is written with the expectation that a .shp file will be written out together with the corresponding .dbf with both populated with objects and records as required; however,  the spec provides no documented way to dynamically delete an object in an existing .shp file that has been opened in read/write mode.   The .dbf provides a way to delete a record in a .dbf by marking the record as "deleted" (by changing one byte in the record header) but the shapefile spec does not specify any corresponding way to delete the object associated with that record in the .shp.  The shapefile spec thus leaves open some ambiguity in how to represent dynamic deletions of objects.


A software package could use the "nuclear option" of simply writing out a new .shp file along with a new .dbf for all objects but the deleted one and thus guaranteeing lowest common denominator conformity with other programs.   But doing that is not really dynamic editing of an existing .shp file: it is, instead, a fake dynamism that simply edits objects in memory and writes out new files to replace the original files in a classically non-dynamic manner.   One key reason to want dynamic read/write capability is to avoid the slowness of having to write out an entire, potentially large, file when making simple changes such as the deletion of a few objects.


Many programs, including Manifold, offer dynamic editing and allow deletion of objects by leveraging the primary role of .dbf in the shapefile standard.   When an object is deleted, say, in a drawing window, the corresponding record for that object in the .dbf is marked as deleted.   When Manifold opens a shapefile, if a .shp file contains an object for which the corresponding record in the .dbf is marked as deleted, Manifold treats that object as deleted as well, using the rationale that since each object in a .shp should have a corresponding .dbf record if the record has been marked deleted the object should be considered deleted as well.  That's not a bad call in a universe of programs which provide dynamic read/write editing of shapefiles.  But it is not the only possible call.


Programs which do not offer dynamic read/write editing can simply always write out new .dbf and .shp files which harmonize exactly.   However, even those programs which always write out harmonized .dbf  and .shp pairs can encounter shapefiles created by other programs where the .dbf and .shp files have implicit or explicit disagreements.   For example, they may open a shapefile set where the .dbf has marked a record as deleted while the .shp file retains an object for that record.


In real life where GIS users will encounter an entire zoo of shapefile pathologies a working program must make pragmatic decisions about how to best handle shapefiles where the files involved disagree.  Such disagreements can occur as a result of file damage, erroneous program operation or many other reasons.   The usual strategy is to try to recover as much data as is possible from "broken" shapefiles.  Whether to treat as "broken" and a candidate for recovery a .shp file that contains objects for which the associated records in the .dbf file are marked as deleted is a matter of opinion, as is the decision to consider the deletion flag on the record an error and not an intentional edit.


Programs which put a higher value on possibly recovering data from incorrectly written or damaged shapefiles may choose to assume the deletion flag is an error and the object should be retained.  That can be a logical approach in a universe of programs which are committed to always writing harmonious .dbf and .shp pairs and where inconsistency between the two files is logically taken as an error.   Manifold Release 8 and prior Manifold GIS editions take that approach.


Given recent trends toward dynamic read/write editing of shapefiles there are now many more programs which choose to dynamically allow editing by utilizing the deletion flag on records in the .dbf file. It is therefore now routine to encounter shapefiles where the .dbf and .shp files are discordant, and where one style of interpretation will show objects that the other style of interpretation will not show.  For example, if we create a data source from a shapefile in Manifold, open it for editing and then delete an object and then if we open that shapefile in a program such as Manifold 8 that gives priority to the .shp and which ignores deletion flags in the .dbf the object will still be there.   


Given the lack of explicit guidance in what has now become a very old shapefile spec both programs will be "right."   Resolving such discordances therefore boils down to which approach we prefer if we want to have dynamic read/write editing of shapefiles.