title | author | date | output |
---|---|---|---|
glimpse_labels() -- glimpse() with variable labels! |
Trey Spiller |
February 22, 2017 |
html_document |
Most data analysts use non-R statistics software packages such Stata, SPSS, and SAS.
All of those packages allow one to specify certain pieces of metadata about the contents of a dataset. One type is "variable labels," which are a brief description of what the variable is or means. For data from a survey questionnaire, the variable label might include the text of the question from which the variable came. In such situations, having the variable labels attached to the data can be quite useful. Some packages include functions that display the data contents with the labels (e.g., Stata's describe
function).
Fortunately, the tidyverse package haven
imports datasets created by the three stats packages listed above. An imported dataset is of the tbl_df
class, and each variable has a label
attribute that contains its variable label.
The function glimpse_labels()
is a first pass at including variable labels in the printout generated by tibble::glimpse()
. This gist shows how it works and points out some problems that would need to be solved for it to be fully functional.
Before we get going, lets create some data with variable labels to work with. We create a small tbl_df and add a label to the variables by assigning text to their label
attribute. We're just adding label text for the x
variable right now.
Note that the code that assigns the label text simultaneously creates the label
attribute.
# create some labelled data
dat1 <- tibble::data_frame(x = 1:20, y = 21:40, z = letters[1:20], a = 41:60)
attributes(dat1[["x"]])$label <- "this is the x label"
attributes(dat1[["y"]])$label <- ""
attributes(dat1[["z"]])$label <- ""
attributes(dat1[["a"]])$label <- ""
First, let's see what the regular glimpse()
function shows:
tibble::glimpse(dat1)
## Observations: 20
## Variables: 4
## $ x <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1...
## $ y <int> 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, ...
## $ z <chr> "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", ...
## $ a <int> 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, ...
There's some useful stuff here. As expected, however, the existence of the label attribute is not reflected in the printout.
Currently, in order to view the variable labels for this tbl_df, we'd do something like the following, which gives us a named character vector of the variable labels.
# Ugly character vector output
vapply(dat1, function(x) attributes(x)[["label"]], character(1))
## x y z
## "this is the x label" "" ""
## a
## ""
It's clear how much less useful this is than what we get from tibble::glimpse()
.
The glimpse_labels() function is currently in the R package spillr, available on Github. You can install it from Github as below, or copy the function code directly from here.
If you want to install the package, make sure you have devtools and run:
library(devtools)
install_github("treysp/spillr")
Now, let's see the printout for glimpse_labels()
:
library(spillr)
glimpse_labels(dat1)
## Observations: 20
## Variables: 4
## $ x this is the x label <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1...
## $ y <int> 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, ...
## $ z <chr> "a", "b", "c", "d", "e", "f", "g", "h", ...
## $ a <int> 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, ...
The label text is now displayed between the variable name x
and its class <int>
. The other variables don't have label text, so they have blank space of the same width as the x
variable's label. If the other variables didn't have a label
attribute at all, the printout would still look like this.
By default, the width of the printout is the width of the console. Depending on the width of your console when you run it and the length of the variable labels, a variable may be longer than the console is wide. Let's see what happens when that occurs:
dat2 <- dat1
attributes(dat2[["x"]])$label <- paste0(rep("[x-label]", 8), collapse = " ")
attributes(dat2[["y"]])$label <- paste0(rep("[y-label]", 3), collapse = " ")
attributes(dat2[["z"]])$label <- paste0(rep("[z-label]", 1), collapse = " ")
glimpse_labels(dat2)
## Observations: 20
## Variables: 4
## $ x [x-label] [x-label] [x-label] [x-label] [x-label] [x-label] [x-
## label] [x-label] <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,...
## $ y [y-label] [y-label] [y-label] <int> 21, 22, 23, 24, 25, 26, 27, 28...
## $ z [z-label] <chr> "a", "b", "c", "d", "e", "f", ...
## $ a <int> 41, 42, 43, 44, 45, 46, 47, 48...
We can see that x
's label was wrapped over two lines. Because its last line is shorter than y
's label, it is padded so the variable classes are lined up (x
's <int>
is directly above y
's <int>
).
When there is a wrapped label, glimpse_labels()
identifies the longest label that does NOT wrap, makes the label column that wide, and pads any wrapped labels shorter than that width to make everything line up.
However, sometimes the last line of a wrapped label is longer than the label column - here's what happens then:
dat3 <- dat2
attributes(dat3[["x"]])$label <- paste0(rep("[x label]", 10), collapse = " ")
glimpse_labels(dat3)
## Observations: 20
## Variables: 4
## $ x [x label] [x label] [x label] [x label] [x label] [x label] [x
## label] [x label] [x label] [x label] <int> 1, 2, 3, 4, 5, 6, 7, 8,...
## $ y [y-label] [y-label] [y-label] <int> 21, 22, 23, 24, 25, 26, 27, 28...
## $ z [z-label] <chr> "a", "b", "c", "d", "e", "f", ...
## $ a <int> 41, 42, 43, 44, 45, 46, 47, 48...
The x
variable's class and data display have now been pushed over to the right such that they don't line up with those for the other variables.
Things start to get messy when there are multiple wrapped labels in close proximity:
dat4 <- dat3
attributes(dat4[["z"]])$label <- paste0(rep("[z-label]", 11), collapse = " ")
attributes(dat4[["a"]])$label <- "[a-label]"
glimpse_labels(dat4)
## Observations: 20
## Variables: 4
## $ x [x label] [x label] [x label] [x label] [x label] [x label] [x
## label] [x label] [x label] [x label] <int> 1, 2, 3, 4, 5, 6, 7, 8,...
## $ y [y-label] [y-label] [y-label] <int> 21, 22, 23, 24, 25, 26, 27, 28...
## $ z [z-label] [z-label] [z-label] [z-label] [z-label] [z-label] [z-
## label] [z-label] [z-label] [z-label] [z-label] <chr> "a", "b", "c"...
## $ a [a-label] <int> 41, 42, 43, 44, 45, 46, 47, 48...
The variable classes are vertically aligned for variables y
and a
, but it's very hard to tell.
One potential solution would be to have the variable class and data display on the next row when a wrapped label is longer than the label column width (so in the example above the x
class and data display would be on a new line before the y
variable and the z
class and data display would be an a new line before the a
variable).
Another option would be to add spacing lines between variable rows, perhaps depending on whether a variable's label wrapped.
Suggested solutions welcome!
The current implementation will break if stringr::str_wrap()
can't wrap the label such that it fits in the console width. The wrapping algorithm is based on text with spaces or hyphens between words, so I think this would only happen if there were words longer than the console width.