A basic/naive test
- Found gem about
join + update-by-reference
from Left join using data.table - Saw a tweet reply by Michael chirico about using
setkey
during join - Combine them together
"I want to do fast & memory efficient joins but I don't want to read the performance benchmark...". You can skip to Keyfindings & Examples sections below.
# NOTE: Performance and memory efficiency also depend on data/hardware/systems/etc
library(data.table) ; set.seed(1)
x <- data.table(a = 1:1e8, b = sample(12:19, 1e7, TRUE))
y <- data.table(a = sample(x$a, 2e5), c = sample(2:8, 2e5, TRUE))
x_2 <- copy(x) ; y_2 <- copy(y)
x_3 <- copy(x) ; y_3 <- copy(y)
x_4 <- copy(x) ; y_4 <- copy(y)
benchmark <- bench::mark( # TODO validate with `microbenchmark`
time_unit = 'ms', check = FALSE, # Turn off `check` for `setkey`
normal_join = x <- y[x, on = 'a'],
update_by_reference = x_2[y_2, on = 'a', c := c],
setkey_n_update = setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ],
# Wrapper for complex cases (ie many cols). See codes below.
wrap_setkey_n_update = setjoin(x_4, y_4, on = 'a')
)
benchmark[, c('expression', 'median', 'mem_alloc')]
# # A tibble: 4 x 3
# expression median mem_alloc
# <bch:expr> <dbl> <bch:byt>
# 1 normal_join 5593. 4.1GB
# 2 update_by_reference 724. 767.6MB
# 3 setkey_n_update 48.5 767.7MB
# 4 wrap_setkey_n_update 56.9 772.2MB
# NOTE: `dplyr::left_join` was also tested and it's the slowest with ~9,000 ms,
# use more memory than both `update_by_reference` and `setkey_n_update`
# but use less memory than normal_join. It consumed about ~2.0GB of memory.
# I didn't include it as I want to focus solely on {data.table}
# Wrapper
setjoin <- function(dt, dt_join, on, cols, drop) {
# TODO (WORK IN PROGRESS ONLY) Get cols
# Focus on the main codes about `join + update-by-reference + setkey` below, not these 'aesthetic' elements
cols <- {
if ( ! deparse(substitute(on)) %like% '\\.\\(' ) {
if (is.character(on)) {
if ( missing(cols) & missing(drop) )
setdiff( names(dt_join), on )
else if ( ! missing(drop) )
setdiff( names(dt_join), c(on, drop) )
else cols
}
}
# TODO Case: complex NSE. Currently, if `on` is complex, `cols` must be specified.
else cols
}
# NSE
expr <- quote({
setkey(dt, on) ; setkey(dt_join, on)
dt[dt_join, on = on, (cols) := mget( paste0('i.', cols) ) ]
# dt[dt_join, on = on, (cols) := dt_join[, ..cols] ]
})
eval( do.call(
substitute,
list(expr, list(on = substitute(on),
cols = substitute(cols)))
))
}
I thought the performance gains were unrealistic and decided to test again using system time. Two rounds were tested, where the first round was the same as above and the second round was tested for two rounds of similar joins to investigate the performance gains of setkey
(ie setkey
is only set for the first time and used for subsequent joins). Wrapper of setjoin
was not tested here, for brevity. It has similar performance as setkey + update
.
# `bench::system_time` ---------------------------
library(data.table) ; set.seed(1)
x <- data.table(a = 1:1e8, b = sample(12:19, 1e7, TRUE))
y <- data.table(a = sample(x$a, 2e5), c = sample(2:8, 2e5, TRUE))
z <- data.table(a = sample(x$a, 2e5), c = sample(2:8, 2e5, TRUE))
x_2 <- copy(x) ; y_2 <- copy(y) ; z_2 <- copy(z)
x_3 <- copy(x) ; y_3 <- copy(y) ; z_3 <- copy(z)
# _ Join once ------------------------------------
normal_join <- bench::system_time({
x <- y[x, on = 'a']
})
update_by_reference <- bench::system_time({
x_2[y_2, on = 'a', c := c]
})
update_n_setkey <- bench::system_time({
setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ]
})
join_once <- rbind(normal_join, update_by_reference, update_n_setkey)[, 'real']
# _ Join twice -----------------------------------
# Same chunk as above
library(data.table) ; set.seed(1)
x <- data.table(a = 1:1e8, b = sample(12:19, 1e7, TRUE))
y <- data.table(a = sample(x$a, 2e5), c = sample(2:8, 2e5, TRUE))
z <- data.table(a = sample(x$a, 2e5), c = sample(2:8, 2e5, TRUE))
x_2 <- copy(x) ; y_2 <- copy(y) ; z_2 <- copy(z)
x_3 <- copy(x) ; y_3 <- copy(y) ; z_3 <- copy(z)
# Different chunk
normal_join <- bench::system_time({
x <- y[x, on = 'a']
y <- z[x, on = 'a']
})
update_by_reference <- bench::system_time({
x_2[y_2, on = 'a', c := c]
x_2[z_2, on = 'a', c := c]
})
update_n_setkey <- bench::system_time({
setkey(x_3, a) [ setkey(y_3, a), on = 'a', c := c ]
x_3[ setkey(z_3, a), on = 'a', c := c ] # NOTE -------- No `setkey` for x_3 here --------
})
join_twice <- rbind(normal_join, update_by_reference, update_n_setkey)[, 'real']
# _ Compare --------------------------------------
data <- data.table(n_join = c( rep('join_once', 3), rep('join_twice', 3) ),
type = names(join_once),
seconds = c(join_once, join_twice)) [,
seconds := round(seconds, 1)]
data
# n_join type seconds
# <char> <char> <num>
# 1: join_once normal_join 5.6
# 2: join_once update_by_reference 0.8
# 3: join_once update_n_setkey 0.7
# 4: join_twice normal_join 10.2
# 5: join_twice update_by_reference 1.5
# 6: join_twice update_n_setkey 0.9
library(ggplot2)
ggplot(data, aes(n_join, seconds, fill = type)) +
geom_col(position = 'dodge', width = .5) +
geom_text(aes(label = seconds), position = position_dodge(width = 0.9), vjust = - .25, color = 'grey')
# Personal package was used below for theming
Somehow bench::mark
didn't capture the overhead of setkey
during the initial test, hence the unrealistic of ~100x gains were shown.
setkey + update
andupdate
are ~11 and ~6.5 times faster thannormal join
, respectively- on first join, performance of
setkey + update
is similar toupdate
as overhead ofsetkey
largely offsets its own performance gains - on second and subsequent joins, as
setkey
is not required,setkey + update
is faster thanupdate
by ~1.8 times (or faster thannormal join
by ~11 times)
For performant & memory efficient joins, use either update
or setkey + update
, where the latter is faster at the cost of more codes.
Let's see some pseudo codes, for brevity. The logics are the same.
For one or a few columns
a <- data.table(x = ..., y = ..., z = ..., ...)
b <- data.table(x = ..., y = ..., z = ..., ...)
# `update`
a[b, on = .(x), y := y]
a[b, on = .(x), `:=` (y = y, z = z, ...)]
# `setkey + update`
setkey(a, x) [ setkey(b, x), on = .(x), y := y ]
setkey(a, x) [ setkey(b, x), on = .(x), `:=` (y = y, z = z, ...) ]
For many columns
cols <- c('x', 'y', ...)
# `update`
a[b, on = .(x), (cols) := mget( paste0('i.', cols) )]
# `setkey + update`
setkey(a, x) [ setkey(b, x), on = .(x), (cols) := mget( paste0('i.', cols) ) ]
Wrapper for fast & memory efficient joins...many of them...with similar join-pattern, wrap them like setjoin
above
- with
update
- with or without
setkey
setjoin(a, b, on = ...) # join all columns
setjoin(a, b, on = ..., select = c('columns_to_be_included', ...))
setjoin(a, b, on = ..., drop = c('columns_to_be_excluded', ...))
# With that, you can even use it with `magrittr` pipe
a %>%
setjoin(...) %>%
setjoin(...)
With setkey
, argument on
can be omitted. It can also be included for readability, especially for collaborating with others.
Have a nice day ☺☺☺♥♣♠
The above are not my ideas at all. Needed some fast join for some projects, did some benchmark and just sharing with #RStats community ☺. Please give all credits to the people below.
- The R Foundation
- CRAN
- 75 contributors of data.table team, more than 4,000 commits, since 2008
- Left join using data.table
- Tweet reply by Michael chirico about using
setkey
during join - #RStats
- and more! (too tired typing...you know ☺)
using setkey and repeated runs seems pointless because it may escape sorting evaluation when it detects object is already sortedsorry, just realized it is not repeated evaluation, but you just use extra package for measuring time.