This is a short exploration of the most efficient way to read a complete file
(including newlines) into R - previously I'd used readLines()
plus paste()
but that's clearly the least efficient option.
Here are the options:
-
Use
readLines()
andpaste()
read_file1 <- function(path) { paste0(paste0(readLines(path), collapse = "\n"), "\n") }
-
Find out the size of the file and then use
readChar()
read_file2 <- function(path) { size <- file.info(path)$size readChar(path, size, useBytes = TRUE) }
-
As above, but using
readBin()
, then converting to a character vector. Unfortunately you can't read into a character vector directly because usetype = "character"
is limited to 10000 charactersread_file3 <- function(path) { size <- file.info(path)$size rawToChar(readBin(path, "raw", size)) }
-
A safer approach that doesn't use a separate call to
file.info()
- this avoids race conditions where the file changes between asking for its size and reading it. (Suggested by @klmr)read_file4 <- function(path, chunk_size = 1e4) { con <- file(path, "rb", raw = TRUE) on.exit(close(con)) # Guess approximate number of chunks n <- file.info(path)$size / chunk_size chunks <- vector("list", n) i <- 1L chunks[[i]] <- readBin(con, "raw", n = chunk_size) while(length(chunks[[i]]) == chunk_size) { i <- i + 1L chunks[[i]] <- readBin(con, "raw", n = chunk_size) } rawToChar(unlist(chunks, use.names = FALSE)) }
-
An alternative would be to use C++. This version was supplied by @tim_yates
library(Rcpp) sourceCpp("read-file.cpp")
-
An alternative would be to use C++.
read_file_cpp1
came from @tim_yates, andread_file_cpp2
from @the_beliallibrary(Rcpp) sourceCpp("read-file.cpp")
We'll compare the results on a file included with R:
path <- file.path(R.home("doc"), "COPYING")
file.info(path)$size / 1024
# [1] 17.6
First we need to check they all return the same results. (They won't if the file doesn't include a trailing newline)
stopifnot(identical(read_file1(path), read_file2(path)))
stopifnot(identical(read_file1(path), read_file3(path)))
stopifnot(identical(read_file1(path), read_file4(path)))
stopifnot(identical(read_file1(path), read_file_cpp2(path)))
stopifnot(identical(read_file1(path), read_file_cpp2(path)))
The benchmarking results are clear: readChar()
is the best base R option, and is
about four times faster for this file. The safer approach using chunked readBin()
reads is about 50% slower. The C++ functions both fast (2x faster than readChar()
and 10x faster than readLines()
) and safe.
microbenchmark(
readLines = read_file1(path),
readChar = read_file2(path),
readBin = read_file3(path),
chunked_read = read_file4(path),
Rcpp = read_file_cpp1(path),
Rcpp2 = read_file_cpp2(path)
)
# Unit: microseconds
# expr min lq median uq max neval
# readLines 608.8 613.3 628.7 654.2 707 100
# readChar 124.0 130.6 135.1 140.4 261 100
# readBin 141.2 146.9 150.5 157.1 229 100
# chunked_read 199.9 205.7 211.3 218.5 1301 100
# Rcpp 68.4 72.3 75.7 87.9 115 100
# Rcpp2 53.5 57.3 59.5 61.6 1109 100