deluxe-data.md

Through my frustration as a data engineer, I have developed a taste for how I want my data to be organized, to make its consumption as convenient and efficient as possible.

tl;dr:

- one .hdf5 file for data from 1MB-10GB
- all raw source data in embedded .zip
- automated construction process also in .zip
- elections-us-1776-2016.hdf5, 1GB file with all election data

As I'm working to codify this taste, I came across this writeup:

https://github.com/jtleek/datasharing

My own conclusions and preferences are virtually identical. Jeff Leek does a great job of explaining them concisely and clearly.

In fact I only have two differences of opinion:

'Data locality' should be one of the foundational principles of tidy data sharing.

In other words, multiple tables can and should be contained within the same file.

Otherwise there is a large risk of cross-contamination (mistakenly combining related tables from different experiments).

Also it makes data sharing incredibly easy; datasets under 50MB can be emailed around, and up to 10GB can be uploaded/stored in standard cloud drives. It is much less convenient and manageable to have even 5 or 10 separate files to wrangle.
I recommend different file formats.

I propose using HDF5 for columnar data sharing at this scale. It is 20 years old and a bit clunky to use if you haven't crafted your 20 lines of helper functions yet, but it is ubiquitous, mature, typed, compact--adequate along every axis.[1]

In addition, the code book, raw data, and translation scripts should also be in the same package along with the data and metadata. These can be stored in a .zip file that is embedded within the .hdf5 file, either jammed into the userdata portion of the .hdf5 (safest), or appended (most convenient). The code book should not be Word files but .txt or some readable markup like reST or markdown.

This format fulfills the spirit of the desires expressed, in a way that nothing else I've seen does. Better formats may exist or be in development, but .h5 is adequate and so the tidy data should still be published in .h5 format until another better format is more widely adopted.

HDF5 with embedded .zip

The .hdf5 file itself has the tidy data ready for import by major stats packages, with all parsed data and reasonable aggregations in predictably structured tables. The embedded .zip has:

all the purely raw source data files[2]
a script which can use the raw source data to reproduce an identical .hdf5 file
a .tsv file with the sha1 and source url of each of the raw data files
all metadata in machine-parseable and human-readable format (.txt)
a .log file of the conversion process

Obviously the compilation scripts can be whatever works, but they should be complete and produce the whole package with one single command (recorded in the log file). They should also not have any non-public dependencies, the unfortunate nature of software law permitting.

Example Tidy Dataset

So imagine a file, elections-us-1776-2016.hdf5, that had all actual election data in tidy form. It would be probably be toplevel organized by Election/Race, with axes for Region and Date.

I would like a sincere debate on which internal form is most convenient for the most popular analyses[3]. But the end goal would be this giant cube of data, less than 10GB and ideally less than 1GB, that would just be the gold standard of US election data. How to assess the scale? In terms of units--89,000 local governments now, 240 years but maybe we only have data for 60 elections in that time. 10 races. 3 candidates/options. So a matrix of 180m election units. We've been generous on every estimate so far, so it's reasonable to round down to 100m election data points. Maybe only 1m-10m when we see how sparsely populated it actually is.

So that's 4 bytes * 100m = 400MB? With metadata, 500MB. with source data included, suppose 1GB.

Estimated 50 x 60 individual pieces need to be done. Maybe only 500-1000 (most recent 20 years) are actually readily available.

That's achievable. A 1GB download that includes the data all tidy and regular and easily slice-able, with all raw source data included. The entire chain from source to tidy .h5 is auditable and reproducible, so any problems can be traced back to the code or the source, and be fixed (and a patch sent to the home office to be incorporated) or worked-around as needed.

and the source data and build code are checked into github.com/deluxe-data/elections-us/

Roles:

spiders, finding, fetching, determining what content there is and estimating its value; checking in the raw data sources and their notes as a proto-codebook. probably 50 people (some could do multiple though)
parsers, taking unparsed data in the repository and adding it to the build, making sure that it all works (another 50)
verifier, consumes the results and makes sure it makes sense. probably doing some exploration trying to find natural anomalies. (as they happen)
releaser, releasing versions with appropriate process. (1)
organizer, organizing people to do above (1)

So dozens, maybe 100 people.

Include table of contributors, with estimated number of hours contributed per year.

I'm working on the framework that could be used for #2. Each major data project like this could use the same framework but would need its own roles 1-5 filled.

[1] If anyone has any better candidates than .h5, I'd love to hear them. I don't think there are any, without sacrificing a large amount of more than one of bundling, precision, immediacy, and compatibility--and this last is why I don't think there is anything better than .h5 for this current purpose, or we would have all heard about it, and it would be at least ubiquitous as .h5 already. If there's not already mature open source libraries for at least R, MATLAB, Python, it's not a contender. It has to be usable out-of-the-box.

I can't imagine another container format besides .zip that is reasonable for the source data.

[2] When this format was in practical usage at Jawbone, the the compressed .zip was 30% of the total file.

[3] whether it should be a matrix called /elections/us/president, which would have values like: (2016-11-07, us-wa-43, D-Clinton) 145273 (2016-11-07, us-wa-43, R-Trump) 71632 (2016-11-07, us-wa-43, G-Stein) 4232 (2016-11-07, us-wa-43, L-Johnson) 2872 These actually map to numeric indexes, explicitly defined in other tables.

With of course values continually being filled out on every axis, with also more added every election-year. Summary rows also provided for each place subgrouping (/elections/us-wa, /elections/us). There'd also be /elections/us-wa-43/governor and /elections/us-wa-43/wa-senate and /elections/us-wa-43/us-senate, etc.

Alternatively, it could be a compound array grouped at toplevel by place-date: /elections/us-wa-43/2016/us-senate this has the advantage of being sealable. also perhaps less prone to error due to one less indirection.

saulpw/deluxe-data.md

HDF5 with embedded .zip

Example Tidy Dataset