When Election Results Aren’t Yet Data
As our volunteers have gathered details on the scope and availability of election results from across the country, one thing became clear: not all election results are created equal.
Some states provide results data in multiple formats and variations; all you have to do is choose and click. Florida has a download for every election. In Pennsylvania, we found that for $7 the state provides a CD with 12 years of consistently formatted results. Idaho has multiple files covering different reporting levels.
But that's not every state. Some have made the transition from producing PDFs to CSVs, such as West Virginia. Others, like Mississippi, basically provide a picture of the results. For states where data isn't the norm, OpenElections needs to fill the gap, turning results into data.
This isn't a glamorous job, but we'd like to tell you a little about how we go about it. For states that provide electronically-generated PDFs, like West Virginia does for elections from 2000-2006, there are several good options for parsing PDF tables into data. The command-line utility pdftotext from the xPDF package works well in many cases, while the excellent Tabula (a product of ProPublica, La Nación and the Knight-Mozilla OpenNews project) can do wonders with more complex files. For West Virginia, xPDF was all we needed (along with some search and replace in a text editor) to make CSV files from the original PDFs. Here's an example command that generates a fixed-width text file while preserving the rows and column format of the original file:
$ pdftotext -layout 2000 House of Rep Pri.pdf
We used TextWrangler, a free text editor for the Mac, to convert the spaces between columns into tabs, and from there it was trivial to copy results into CSV files. In the process of converting these results, we found several apparent errors (typos or likely copy and paste mistakes) and notified the Secretary of State's office. To its credit, the office responded quickly and is in the process of fixing the original files (and we'll update our CSVs when they do).
In Mississippi, however, there's are no programmatic options, or at least no good ones. Data entry is the best way for us to get precinct-level results that are contained in county-by-county files like this one. Here's what we're dealing with: a scanned image of a fax:
When it comes to doing data entry, we need to be very specific about what we want and how we want it stored in the CSV file. For our Mississippi files, we've developed a guide to the process that we'll adapt to other states where manual entry is required. Which is where you come in, if you're up for it. If you'd like to try your hand at a Mississippi file (or another state with PDFs) let us know in the Google Group and we can get you setup. Or you can fork the Mississippi results repository on Github and send us an email per the instructions in the README file.
We know that data entry is neither fun nor exciting (well, most of the time), but think of this: you'll be part of a project that will provide a great service to journalists, researchers and citizens. And we still have some t-shirts left, too.