This post has been floating around in email conversations with colleagues for the past month, however, last week a semi-related Twitter discussion prompted me to finally document my thoughts in a public form. For those interested in that conversation you should also read Andrew Hoppin’s excellent post on the CivicCommons blog. I’m a big fan of what CivicCommons is doing but think the situation with ESRI demonstrates some of the challenges in evaluating civic software options and opportunities for further refinement of the CC Marketplace.
I recently made a frustrating–and failed–attempt to access a government data set encoded in ESRI’s proprietary and closed File Geodatabase format.
This led to an email exchange with Jack Dangermond at ESRI asking about the difference between ESRI’s stated desire to help open up government data and my experience where ESRI’s commitment to closed formats prevented me from accessing data–at least without first buying a copy of ArcGIS.
Jack wrote a lengthy and thoughtful response that can be summarized as follows:
1) ESRI’s closed formats shouldn’t prevent me from accessing the data–there are many options for extracting what I need without buying their software. While true in concept, given the particularities of the data I needed this turned out not to be the case. There was absolutely no way for me to read the data I needed without buying ESRI software.
Sure, the government could intervene and reformat their data to better serve me, however, they were unable or unwilling. And from their perspective had already met the spirit of the Open Government Directive by releasing a bulk download of what I needed. Unfortunately for me that download was in ArcGIS’ default format, a format that can only be opened with ESRI tools.
2) ESRI isn’t interested in repeating the experience with Shapefile where any developer could read or write to the files. In Jack’s words, this created “lots of work for users ‘fixing’ these corrupt files because some software developers did not implement the specification correctly.”
That is pretty different from my experience where the open Shapefile spec ushered in an era of collaboration and led to a proliferation of new tools. Never once in my fifteen years in the industry have I encountered a corrupted file caused by a bad tool.
Jack claims this as a technical consideration, however, anyone that has familiarity with geospatial software should also notice a strong business case for this perspective: ESRI doesn’t benefit from others participating in the software marketplace or building these “bad tools.”
Closed formats result in closed data. Despite the availability of interchange formats no one is using them and it’s not clear if they are a complete solution. In fact, I couldn’t find any open source tools that read or write ESRI’s XML-based interchange format, nor could I find anything on the web that had been released in that format. ESRI may think they’ve checked the data portability box but no one else seems to agree. Same is true for proprietary drivers for file access: ESRI showed up several years late with a binary driver solution that is still incomplete (it couldn’t open my data).
There’s simply no technical excuse for closed formats. If ESRI doesn’t want to maintain its own format spec it should use someone else’s: SQLite or many other formats like it could offer a great foundation for an open file-based spatial data store.
3) ESRI is a private company and has every right to create proprietary formats.
But I have a problem when ESRI also claims to be an enabler of open government, facilitating data sharing and collaboration. I have an even bigger problem when it takes taxpayer money to manage public data, and then requires taxpayers to buy ESRI software in order to access that data.
One of the fundamental tenants of the OGD is providing data in bulk, platform-independent formats. Instead of supporting this outcome, Jack explained his preferred approach: APIs exposed via ESRI software that allow users to access what they need from data stored within closed ESRI file formats.
This is a great strategy if you’re in the business of selling GIS software, however, this is a bad idea if you’re interested in sharing data. APIs aren’t the same as bulk access (bulk access is critical for many GIS applications). And APIs with EULAs != open data.
ESRI has chosen to vertically integrate their GIS platform and has no interest in allowing others in. This is a fundamental incompatibility with the current conception of open government. They have every right to follow this course but there’s no reason why those of us that have invested years specifying the terms for dissemination of public data should celebrate their participation until they follow the guidelines we’ve set.
It doesn’t have to be this way. Their software is incredibly useful and in many cases has no counterpart, open or closed. ESRI could become a valuable partner in opening government data–if they embrace open formats and let others in.
For those actually interested in the gory details, here’s the TL;DR’er version of my recent encounter:
Several weeks back I needed to make a map for a big chunk the Pacific Northwest. I leveraged all kinds of useful open data (OSM for streets, Lidar from local governments, etc.) but above all else I needed really good stream and river data. Lucky for me the USGS maintains a detailed data set that maps every stream and pond in the entire U.S., even the tiny intermittent ones!
I surfed over to the “Data” page, and a few minutes and several hundred megabytes later had the data I needed on my hard drive. Then I unzipped the files and encountered something unexpected: a constellation of files with new and strange file extensions.
I’ve been working with GIS tools and data in a professional capacity for going on fifteen years and I consider myself pretty savvy. However, over the last decade all of my work has come to depend on open source GIS tools—my ArcGIS license and the parallel port dongle it required stayed behind when I left university. So while I can tell you all about spatial indexes and encoding formats for transmitting geometric primitives, I missed the memo on ESRI’s new File Geodatabase format; the format now being used to manage and disseminate data at the USGS.
Never fear, there are all kinds of cool open source tools designed just for these situations. I opened up my terminal and started trying out various translation utilities, thinking I’d just convert this data into some more familiar and readable (by me) format.
A several hours later I was starting to get frustrated.
All my normal go-to tools were letting me down and my Google searches were coming back empty. Clearly this new Geodatabase format mattered: it is ESRI’s official replacement for the now deprecated Shapefile format. I learned about the new features it offered (useful things I take for granted now that I primarily use spatial databases like PostGIS). I was also getting a sneaking suspicion I wouldn’t be able to open the file.
ESRI’s new file format, while exceedingly useful is also intentionally closed. And not just in the “we don’t support others opening the file” sense; this format is closed in that no one else knows how to open it, period.
Given the complexity and lack of a roadmap from ESRI, I learned that reverse engineering was a futile undertaking. But after calling in a lifeline to a colleague I discovered that ESRI had recently developed a set of binary drivers that offer API level access. These drivers were several years late to the game and were still incomplete, but at least they might let me get access to the data I needed.
Again, no dice.
After creating an ESRI.com account and signing an EULA, installing the drivers and recompiling GDAL with the ESRI FileDB plugin turned on I get back a cryptic error message saying that my particular files were in an unsupported format. I then discover that the USGS maintains its data in version 9.2 of the file format. ESRI’s drivers only support version 10. Further investigation reveals I can forward port the data. But only if I own a copy of ArcGIS. Mine for a mere $1,500.
Having invested a full day already, I regroup and attempt to use the USGS Data Portal (an ESRI product, of course). It turns out that the portal offers a Shapefile export for the layers I need–Shapefiles offered a lower-fidelity version of the data, fortunately still useful for my application. After a few failed attempts navigating its clunky interface I manage to select most of Oregon and Washington, the area I need. Submit. Nothing.
The next day I receive a polite email from a USGS staffer letting me know my request for data was too large and could not be processed by ESRI’s data portal. If I wanted this data I’d need to submit dozens of smaller requests and recombine the data once downloaded. After a bit of testing I discovered that this process would take several days to complete given the sluggishness of the portal’s processing pipeline.
I emailed back to the USGS staffer asking if I could download the the data I needed in a bulk (but readable by me) format. I could not. I asked if others had encountered similar problems. They had. I asked if there was a solution in the works. There was not.
However, they offered kindly to mail me a series of DVDs containing an older Shapefile export covering the entire country. This was not ideal but under the circumstances it was as good as it would get.
Two weeks and a hundred gigabytes of data later I’d managed to filter down an out-dated and incomplete copy of the data I downloaded in those first naive moments of this journey.
At least I had my map.
Full disclosure: I work for an organization that, among other things, builds open source GIS software. The views expressed here are my own.