Tableau's generated PDF challenges

For many Tableau dashboards, we generate PDFs of specific tabular layouts, e.g. Idaho's Public Health District #5.

An example dashboard table

Its PDF generation

The PDF contents

The Case total.pdf itself looks fine:

But when parsing the PDF's digital text layer with PDF Parse for Node, we see:

Total Cases by County
Fields without a number equal zero. Total cases in
Twin Falls county include outbreak in county jail.
County Confirmed Probable
Twin Falls
Cassia
Minidoka
Jerome
Blaine
Gooding
Lincoln
Camas 7
24
83
40
126
94
98
388
32
111
369
773
851
985
1,078
2,970

Significant Parsing Challenges

The column order "snakes" around, in no clear repeated order within visible columns or rows. There are no clean row entries, like Twin Falls 2,970 388\r\n. We can set in place some assumptions to unravel headers from County names from column data, but is very fragile compared to a more standard PDF structure order.

More Typical PDF contents

Here is a PDF from the Green River Health Department, Kentucky, which also includes a data table.

The PDF contents

Parsing the PDF's digital text layer, we get:

GRDHD COVID-19 Case Summary as of 9:00 AM October 20, 2020
County Confirmed
Cases
Recovered
Cases
Current
Hospitalizations
Ever
Hospitalized
Deaths
Daviess 1,785 1,580 13 127 27
Hancock 112 87 1 7 1
Henderson 1,200 873 12 94 24
McLean 150 103 3 15 3
Ohio 584 494 4 40 9
Union 481 395 3 34 6
Webster 282 228 1 21 5
Total 4,594 3,760 37 338 75

Fewer parsing challenges

With normalized spacing in our PDF-reading code, this is much closer a semi-reliable extraction. Once we hard-code some (validated) assumptions, we can safely separate and extract figures for our data-collection workflow without asking humans to hand-enter each figure.

tiffehr/index.md

Tableau's generated PDF challenges

An example dashboard table

Its PDF generation

The PDF contents

Significant Parsing Challenges

More Typical PDF contents

The PDF contents

Fewer parsing challenges