For many Tableau dashboards, we generate PDFs of specific tabular layouts, e.g. Idaho's Public Health District #5.
The Case total.pdf itself looks fine:
But when parsing the PDF's digital text layer with PDF Parse for Node, we see:
Total Cases by County
Fields without a number equal zero. Total cases in
Twin Falls county include outbreak in county jail.
County Confirmed Probable
Twin Falls
Cassia
Minidoka
Jerome
Blaine
Gooding
Lincoln
Camas 7
24
83
40
126
94
98
388
32
111
369
773
851
985
1,078
2,970
The column order "snakes" around, in no clear repeated order within visible columns or rows. There are no clean row entries, like Twin Falls 2,970 388\r\n
. We can set in place some assumptions to unravel headers from County names from column data, but is very fragile compared to a more standard PDF structure order.
Here is a PDF from the Green River Health Department, Kentucky, which also includes a data table.
Parsing the PDF's digital text layer, we get:
GRDHD COVID-19 Case Summary as of 9:00 AM October 20, 2020
County Confirmed
Cases
Recovered
Cases
Current
Hospitalizations
Ever
Hospitalized
Deaths
Daviess 1,785 1,580 13 127 27
Hancock 112 87 1 7 1
Henderson 1,200 873 12 94 24
McLean 150 103 3 15 3
Ohio 584 494 4 40 9
Union 481 395 3 34 6
Webster 282 228 1 21 5
Total 4,594 3,760 37 338 75
With normalized spacing in our PDF-reading code, this is much closer a semi-reliable extraction. Once we hard-code some (validated) assumptions, we can safely separate and extract figures for our data-collection workflow without asking humans to hand-enter each figure.