Created
July 12, 2017 15:56
-
-
Save MrFlick/4297321683f9b3d483a4303e5acd586e to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Dplyr Introduction" | |
author: "Matthew Flickinger" | |
date: "July 12, 2017" | |
output: | |
pdf_document: default | |
html_document: default | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE, error=TRUE) | |
``` | |
## Introduction to Dplyr | |
This document gives an overview of many of the features of the dplyr library include in the "tidyverse" of related R pacakges. First we will load the library and a sample dataset. | |
```{r init} | |
#install.packages("tidyverse") | |
library(tidyverse) | |
#install.packages("nycflights13") | |
library(nycflights13) | |
# Show fewer rows by default in this document | |
options(tibble.print_min = 5L, tibble.print_max = 5L) | |
``` | |
We will primarly be using the `flights` data | |
```{r flights} | |
flights | |
``` | |
## Filtering Rows | |
Find all flights from Detroit in June (in 2013) | |
```{r filtering1} | |
# Same as using base R | |
# flights[flights$dest=="DTW" & flights$month==6, ] | |
# subset(flights, dest=="DTW" & month==6) | |
filter(flights, dest=="DTW" & month==6) | |
``` | |
`filter()` expects a data source as the first parameter, and a single expression as the second parameter. Combine multiple criteria with `&` for "and" -- `|` for "or". | |
## Selecting Columns | |
```{r selection1} | |
# List columns | |
select(flights, dep_time, arr_time, carrier) | |
# Exclude columns | |
select(flights, -year, -tailnum) | |
# Select column range (in data.frame order) | |
select(flights, month:dep_delay) | |
# Name starts with | |
select(flights, starts_with("d")) | |
# Name ends with | |
select(flights, ends_with("time")) | |
# Name contains | |
select(flights, contains("arr")) | |
# Name doesn't start with | |
select(flights, -starts_with("d")) | |
# Move column to the beginning | |
select(flights, flight, everything()) | |
``` | |
Look at the `?select` help page for a list of function to help you select multiple columns. | |
## Verb composition with pipes | |
Traditioanlly, we combine functions via nesting, which works but is hard to read | |
```{r pipe1} | |
select(filter(flights, dest=="DTW"), carrier) | |
``` | |
The `%>%` allows us to take an object, and pass it as the first parameter to another function. The above is the same as | |
```{r pipe2} | |
flights %>% | |
filter(dest=="DTW") %>% | |
select(carrier) | |
``` | |
You can unroll any function with this operator | |
```{r pipe3} | |
round(exp(sin(.5)),2) | |
.5 %>% sin() %>% exp %>% round(2) | |
``` | |
## Sorting Data | |
Use `arrange()` to sort data. You just specify the column names you want to sort by, use `desc()` to reverse the sort order for a given column. | |
```{r arrange1} | |
flights %>% arrange(sched_dep_time) | |
flights %>% arrange(month, desc(day)) | |
flights %>% arrange(desc(dep_time-sched_dep_time )) | |
``` | |
## Creating New Variables | |
Use `mutate()` to create columns from existing columns or values | |
```{r mutate1} | |
flights %>% | |
mutate(speed = distance/(air_time/60)) %>% | |
arrange(desc(speed)) %>% | |
select(flight, speed) | |
``` | |
You can create multiple columns by separating them with a comma; you can use any previously created columns as well | |
```{r mutate2} | |
flights %>% | |
mutate( | |
dist_km = distance * 1.61, | |
hours = air_time / 60, | |
kph = dist_km/hours ) %>% | |
select(flight, kph) | |
``` | |
Use `summarize()` to collapse observations (only keeps columns for which you specified a summarization strategy) | |
```{r mutate3} | |
flights %>% | |
filter(!is.na(arr_delay)) %>% | |
summarize(avg_arr_delay = mean(arr_delay)) | |
``` | |
## Grouping Data | |
Perhaps the most powerful feature of `dplyr` is its grouping abilities. You can specify a column (or columns) for which `mutate()` and `summarize()` happen independently for each unique value in that column (or unique combination or values). | |
Using `summarize()` will reduce the total number of rows | |
```{r groupby1} | |
flights %>% | |
filter(!is.na(arr_delay)) %>% | |
group_by(carrier) %>% | |
summarize(avg_arr_delay = mean(arr_delay)) | |
``` | |
Using `mutate()` will keep the same number of rows and won't drop any columns | |
```{r groupby2} | |
flights %>% | |
filter(!is.na(arr_delay)) %>% | |
group_by(carrier) %>% | |
mutate(avg_arr_delay = mean(arr_delay)) %>% | |
select(carrier, arr_delay, avg_arr_delay) | |
``` | |
## Joining data | |
When finding carriers with the largest flight delay, we were left with a carrier code rather than a carrier name; but who exactly is `9E`? | |
```{r join1} | |
flights %>% | |
filter(!is.na(arr_delay)) %>% | |
group_by(carrier) %>% | |
summarize(avg_arr_delay = mean(arr_delay)) | |
``` | |
There is another table that has a lookup from carrier code to carrier name called airlines | |
```{r join2} | |
airlines | |
``` | |
We can use `left_join` to merge in the carrier name | |
```{r join3} | |
flights %>% | |
filter(!is.na(arr_delay)) %>% | |
group_by(carrier) %>% | |
summarize(avg_arr_delay = mean(arr_delay)) %>% | |
left_join(airlines) | |
``` | |
Here we use two sample tables `x` and `y` to demonstrate the other types of joins | |
```{r join4} | |
x <- tribble( | |
~key, ~xval, | |
1, "x1", | |
2, "x2", | |
3, "x3") | |
y <- tribble( | |
~key, ~yval, | |
1, "y1", | |
2, "y2", | |
4, "y3") | |
inner_join(x, y) | |
left_join(x, y) | |
right_join(x, y) | |
full_join(x, y) | |
``` | |
And you can use non-merging joins to keep or drop rows that match keys from another table. Note that no new columns are added, just the rows of the input tables are filtered | |
```{r join5} | |
z <- tribble( | |
~key, ~zval, | |
1, "z1", | |
3, "z2") | |
semi_join(x,z) | |
semi_join(y,z) | |
anti_join(x,z) | |
``` | |
The join commands will join on all matching column names. You can more explicitly control this as well. The planes table has information about the aircraft used during the flight. It also happens to have a column named "year" indicating when the aircraft was built. When joining this data to flights, we only want to join on "tailnum" -- not "tailnum" and "year". | |
```{r join6} | |
flights %>% | |
inner_join(planes) %>% | |
nrow() # wrong, only planes from 2013 are selected | |
flights %>% | |
inner_join(planes, "tailnum") %>% | |
nrow() # right | |
``` | |
## Subsetting functions | |
`distinct()` will return unique combinations of column values and nothing else | |
```{r} | |
flights %>% | |
distinct(tailnum, carrier) | |
``` | |
The `count()` is like `distinct()` except it also returns the number of times each value was observed. It's basically a shortcut for `group_by() %>% summarize()`. For example | |
```{r} | |
flights %>% count(carrier) | |
flights %>% group_by(carrier) %>% summarize(n=n()) | |
``` | |
`sample_n()` will randomly choose a set of rows from your table (different each time) | |
```{r} | |
flights %>% sample_n(3) | |
flights %>% sample_n(3) | |
``` | |
You might also consider `anti_join` and `semi_join` to be subsetting commands rather than joining commands. | |
## _at/_if/_all | |
The `summarize()`, `mutate()` and `group_by()` functions all have `_all()`, `_at()` and `_if()` variants that make it easier to apply the same function or functions to multiple columns. | |
`mutate_at()`, `summarize_at()` and `group_by_at()` allow you to choose columns in the same way you can do with select using the `vars()` helper function. This will take the mean of all columns that end in "_time" | |
```{r} | |
flights %>% | |
summarize_at(vars(ends_with("time")), mean, na.rm=T) | |
``` | |
`mutate_if()`, `summarize_if()` and `group_by_if()` allow you run a function on each column to choose only columns that meet a certain criteria. This can use useful for extracting columns of a certain class. Note you can also apply more than one function to these columns if you use the `funs()` helper function. This example will calculate the mean and variance for all numeric columns. | |
```{r} | |
flights %>% | |
summarize_if(is.numeric, funs(mean, var), na.rm=T) | |
``` | |
The `_all()` versions of these functions will apply the same transformations to call non-grouped columns in the data source. | |
## Other useful functions | |
The `lead()` and `lag()` functions are useful for selecting the next or previous values in a sequence (especially for time series data). | |
```{r} | |
x<-1:5 | |
lead(x) | |
lag(x) | |
``` | |
The `coalesce()` function will return the first non-missing value from the vectors you pass to it. This is useful when you have multiple columns where only one column contains a value and you want to collapse them to a single vector | |
```{r} | |
coalesce(c(NA,2,NA), c(1, NA, NA), 3) | |
``` | |
When using other `dplyr` verbs, the `n()` and `n_distinct()` functions will return the total number of observations or the number of unique observations respectively. In this example we look at the tail number for each plane to see how many total flights it took and also look at the number of distinct flight numbers that plane was a part of | |
```{r} | |
flights %>% | |
group_by(tailnum) %>% | |
summarize(flights=n(), routes=n_distinct(flight)) | |
``` | |
The `recode()` function allows you to swap out certain values in a vector with different values. | |
```{r} | |
recode(letters[1:5], b="boo") | |
``` | |
The `case_when()` function allows more complex transformations than `recode()`. It's a good alternative to a bunch of nested `ifelse()` calls that you might need to use in base R. Each parameter should be a formula with a left-hand side value that evaulates to TRUE or FALSE and a right-hand side to return when that boolean value is TRUE. Only the value for the first TRUE is returned. | |
Here's a classic example of the "fizz buzz" problem where you are supposed to return the numbers 1-50 but replace all those values divisible by 5 with "fizz" and the values divisible by 7 with "buzz" and those divisible by both 5 and 7 by "fizz buzz" | |
```{r} | |
x <- 1:50 | |
case_when( | |
x %% 35 == 0 ~ "fizz buzz", | |
x %% 5 == 0 ~ "fizz", | |
x %% 7 == 0 ~ "buzz", | |
TRUE ~ as.character(x) | |
) | |
``` | |
## Combining data frames | |
The `bind_rows()` and `bind_columns()` functions are alternatives to the base functions `rbind()` and `cbind()` that are list-friendly. Many times you end up with data.frames in a list that you want to combine in a single data.frame. These functions can help. | |
In this example, we have a list of two tibbles. We can combine them with `bind_rows` | |
```{r} | |
x <- list( | |
data_frame(a=1:2, z=letters[1:2]), | |
data_frame(a=14:20, z=letters[14:20]) | |
) | |
bind_rows(x) | |
bind_rows(x[[1]], x[[2]]) | |
``` | |
## Programming with dplyr | |
Since dplyr uses non-standard evaluation to allow you to soecify data.frame column names without quotes, it can be tricky to write functions that use dply commands. Note that the first attempt at writting a function doesn't work | |
```{r} | |
# Normal command, works fun | |
flights %>% | |
group_by(carrier) %>% | |
summarize(delay=mean(arr_delay, na.rm=T)) | |
# DOESN'T WORK | |
f <- function(x) { | |
flights %>% | |
group_by(x) %>% | |
summarize(delay=mean(arr_delay, na.rm=T)) | |
} | |
f(carrier) | |
``` | |
The latest version of dplyr (0.7) introduced new way to write functions. Previously you would use the standard-evaluation version of functions that ended in an underscore (use `mutate_` rather than `mutate`); but the new version now uses "quosures" to allow you to pass column names. Here are two examples of functions that will work | |
```{r} | |
f <- function(x) { | |
flights %>% group_by(!!x) %>% | |
summarize(delay = mean(arr_delay, na.rm=T)) | |
} | |
f(quo(carrier)) | |
g <- function(x) { | |
x <- enquo(x) | |
flights %>% group_by(!!x) %>% | |
summarize(delay = mean(arr_delay, na.rm=T)) | |
} | |
g(carrier) | |
``` | |
We can either use `quo()` to create our own quosure with the column name, or we can use `enquo()` to turn a function parameter into a quosure. | |
Finally, in base R it's complicated to dynamically set the name of a parameter to a function (the name being the part to the left of the `=` in a call like `f(a=b)`). The latest dplyr functions now also allow you to use the value of a variable as a parameter name if you use `:=` rather than `=`. For example | |
```{r} | |
h <- function(x) { | |
x <- enquo(x) | |
outname <- paste(quo_name(x), "delay", sep="_") | |
flights %>% group_by(!!x) %>% | |
summarize(!!outname := mean(arr_delay, na.rm=T)) | |
} | |
h(carrier) | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment