#r · GitHub

Ubuntu Install
General observations
Basics
Data Types
Subsetting
regular expression
Tidyverse library
- installation
- use/load
- pipe operator
- dplyr
- tidyr
- readr
  - Import
    - RStudio
    - code
  - Export
Descriptive statistics
Statistical model
- linear regression
Plot

Ubuntu Install

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'
sudo apt update
sudo apt install r-base

General observations

get help by append ? to a function name, data set, symbol...
implicit printing, therefore, usually you don't need to write print()
counting stats at 1; so to get the first item in a vector x[1];
in function calls, you can specify the arguments by name or order, e.g.
- plot(iris$Species, iris$Petal.Width), plot(x = iris$Species, y = iris$Petal.Width)
vectorized language, therefore there's much less need to iterate over objects, e.g.
- to get double of each element of this vector x <- c(3, 5, 8), you only need x * 2
AND, OR comparisions can be made with single ou double signs (&, &&, |, ||), they behave in different ways

Basics

help

? append a question mark to a function, data set, library name and get info about it

Comments

# = single-line comments
R doesn’t support multi-lined comments

clear console

cat("\014") is the code to send "CTRL + L" to the console

list available data sets

data() = lists all available data sets, including from libraries (if they're loaded)

view data set as table

View(data set)

Declare variables and functions

NOTE: it's also permissible to declare variable and functions with equal sign

variables: x <- 20
functions: myF <- function() {...}
vectors: x <- c(5, 8, 12)

function

functions with ellipsis

add <- function(...) {
  args <- list(...)
  sum <- 0
  
  for (n in args) {
    sum <- sum + n
  }
  
  return(sum)
}

add(2, 3, 5, 4)

variable info

class() or typeof() = the only difference is that class calls double 'numeric' and typeof 'double'
str() = short for structure, displays the internal structure of the given object

Operators

%%, %/% = remainder and quotient
: = creates the series of numbers in sequence for a vector
%in% = if element belongs to a vector
& |, && || =
- single operators examine the vector element by element and return a vector filled with logical values (TRUE or FALSE)
- double operators examine only the first element of each vector and return a logical value (TRUE or FALSE)

x <- c( TRUE, FALSE, TRUE );
y <- c( TRUE, FALSE, FALSE);

print(x & y) # TRUE FALSE FALSE
print(x && y) # TRUE

if

ifelse() function

ifelse() is a vector equivalent form of the if...else statement

x <- c(3, 5, 8, 12)

# ifelse(test, yes, no)
# returns a value with the same shape as test, usually a vector
# filled with elements selected from either yes or no
# depending on whether the element of test is TRUE or FALSE
ifelse(x %% 2 == 0, "even", "odd")

for loop

# loop 1 through 10 (inclusive)
for (n in 0:10) {
  print(n)
}

# loop vector elements
x <- c(5, 8, 12, 15)
for (n in x) {
  print(n)
}

while

x <- 1

while (x <= 10) {
  print(x)
  x = x + 1
}

switch

color <- "b"
switch(color, "r" = "red", "b" = "blue", "unknown")

get user input

x <- scan()

filter data

iris$Petal.Width [iris$Species == "setosa"]
plot(iris$Petal.Width [iris$Species == "setosa"| iris$Species == "virginica" ])

apply

double <- function(x) {
  return(x * 2)
}
x <- matrix(c(3, 5, 8, 12), nrow = 2)

# apply(X, MARGIN, FUN, …)
# x = matrix; MARGIN = 1 for rows, 2 for cols; FUN = function to apply 
apply(x, 2, double)

Data Types

In R, everything is a object

vectors

vector of single value

R doesn't have primitive data types in the way that other languages do. In R even the simplest numeric value is an example of a vector.

used often:
- logical = TRUE, FALSE
- numeric/double = can be a integer or contain a decimal value
- character = enclosed with quotes (single or double)
not used often:
- integer = declare explicitly with x <- 10L
- complex = numbers, e.g. 3 + 2i
- raw = created with charToRaw()

vector of multiple values

must contain only one data type
created with c()
e.g. numeric vectors x <- c(5, 8, 12)
index starts at 1

list

can contain many different types of elements
like vectors, functions, lists...
l <- list(c(3, 5, 8), "my string...", TRUE, list("a", "b"), myFunction)
x <- list(a = "aaa", b = "bbb") = can have named elements

matrix

matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
like vectors, matrices store information of the same data type
two-dimensional
e.g. m <- matrix(c(3, 5, 8, 12, 15, 18), 2, 3)
m[1,2 = access specific element
m[1,] = access all of the row 1, m[,2] = access all of the column 2

array

while matrices are two-dimensional, arrays can be any number of dimensions
store only one data type
array(c('green','yellow'), dim = c(2,3,2)) = creates 2 matrices with 2 rows and 3 columns each

factorm

created using a vector (of categoricals values), it stores the vector along with the its distincts values

v <- c("pinapple", "banana", "banana", "apple", "pinapple", "banana")
f <- factor(v)

print(f) # print vector and levels (each distinct vector value)
print(nlevels(f)) # print how many distinct vector value
print(levels(f)) # print each distinct vector value

dataFrame

is a form of matrix, which is tabular and can contain different data types
columns are variables and rows are observations

df <- data.frame(
  Name = c("John", "Matt"),
  Age = c(25, 27),
  City = c("Boston", "NY")
)

print(nrow(df))
print(ncol(df))
print(dim(df)) # get both nrow and ncol

Subsetting

Subsetting in R is a useful indexing feature for accessing object elements,
it can be used to select and filter variables and observations.

subsetting symbols

single brackets

[] = get a subset of length 1 or more
- usually, object and its subset are of the same type; therefore, subset of vector will be a vector, subset of a data frame will be a data frame...
  - however, there's one inconsistency - if the subset contains only one value, R will reduce the result to the lowest dimension and then subset and container may have different type
- both names and indices can be used
- negative integers indicate exclusion
- variables are interpolated

double brackets

[[]] = extract only one element (not necessarily just one value); i.e. vectors yield single value, data frames yield column vector
- names or indices can be used
- variables are interpolated
- usually, not the same type as the object container
- dimension of returned value isn't necessarily 1

dollar sign

$ = special case of [[ in which you access a single item by a name
- therefore, iris$Species and iris[["Species"]] are equivalent
- cannot use integer indices
- if name contain special characters, name must be enclose in backticks

atomic vectors

a <- c(3, 5, 8,12)

# accessing with numbers
a[1]
a[c(1, 3)] # positive get multiple specified elements
a[-c(2, 4)] # negative exclude elements

# accessing with logical values
a[c(TRUE,FALSE,TRUE,FALSE)] # select elements where the value is TRUE
a[a > 5] # therefore, this is possible

recycling rule

if two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector
if longer object length is not a multiple of shorter object length, the program will throw a warning but it'll still return a result

a <- c(2, 3, 5, 8)
b <- c(1, 2)
a * b # result: 2, 6, 5, 16

lists

x <- list("a", "b", "c")

# single bracket returns a object of class 'list'
class(x[2]) # list
# double brackets returns a single element (not of class 'list')
class(x[[2]]) # character

# named lists
y <- list(f = 1:3, s = "a", t = 4:6)
y$f
y[["f"]]

matrices and arrays

m <- matrix(c(3, 5, 8, 12, 15, 18, 21, 25, 30), nrow = 3, byrow = TRUE)

m[1,] # entire first row
m[, 1] # blank subsetting selects all rows/column; here entire first column
m[2, 1] # element at second row, first column
m[1:2, 2:3] # get rows 1 from 2, their columns 2 from 3
m[c(1, 3), c(1, 3)] # get rows 1 and 3, their columns 1 and 3

# using a 2 column matrix to subset a matrix
# each row of the matrix will specify a row and a column
select <- matrix(c(1, 1, 1, 3, 3, 1, 3, 3), ncol = 2, byrow = TRUE)
m[select] # result: 3 8 21 30

Data frames and tibbles

mtcars[3] # single index will return specified column(s)
mtcars[3, 1] # two indices will behave like matrices, first is row and second is column

`hp$Name` or `hp[["Name"]]` # access by name
mtcars[3, "mpg"] # access by both, index and name; third row, column named "mpg"
mtcars$mpg[3] # access by both, name and index

# filtering by column
# column (second argument) is left blank, to return all columns
iris[iris$Species == "setosa", ]
iris[iris$Petal.Width > 0.5 & iris$Species == "setosa", ] # multiple filters

regular expression

# grepl returns a vector of logical values
g <- grepl("Toyota", rownames(mtcars))
mtcars[g, ]

# grep returns a vector with the indices that contain a match
g <- grep("Toyota", rownames(mtcars))
mtcars[g, ]

# using grepl together with dplyr
library(tidyverse)
iris %>%
  filter(grepl("setosa", Species))

Tidyverse library

tidyverse is a set of packages that make easier to perform everyday data analyses and work in harmony (packages share common API)

installation

sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev = ubuntu packages needed
- or install.packages("tidyverse") = to install from the r script
library(tidyverse) = to load a library

use/load

library(tidyverse)
- from now on, any tidyverse function (like dplyr::filter) can be called without dplyr::
- you only need to append dplyr:: if there're name collisions and you need to call the function that was overwritten

pipe operator

%>% = simplify chaining, that is, passsing a single data to several functions

library(tidyverse)

# without pipe ('.data')
f <- filter(.data = mpg, model == "a4")
s <- select(.data = f, manufacturer, model, year)
s

# using pipe
mpg %>%
  filter(model == "a4") %>%
  select(manufacturer, model, year)

dplyr

manipulate data sets

library(tidyverse)

mtcars %>%
  filter(
    mpg > 20,
    cyl == 4,
    wt < 2.5,
    grepl("Toyota", rownames(mtcars))
  ) %>%
  arrange(mpg) %>%
  select (mpg, cyl, wt)

tidyr

helps create tidy data, that is:
- every column is a variable
- every row is a observation
- every cell is a single value

gather, pivot_longer()

lengthens data, increasing the number of rows and decreasing the number of columns
gather() is retired, recommendation is to use instead pivot_longer()

library(tidyverse)

df <- data.frame(
  name = c("John", "Mary", "Jake"),
  a = c(7, 9, 18),
  b = c(18, 5, 3),
  c = c(32, 17, 35)
)

# 'key' and 'value' will be the names of the new cols 
# 'key' will be a categorical variable holding the 'multiple columns' names
# and 'value' will hold the 'multiple columns' values

df %>%
  # gather(key, value, ...multiple columns)
  gather("drug", "volume", a, b, c)

df %>%
  # pivot_longer(columns vector, names_to, values_to)
  pivot_longer(
    cols = c(a, b, c),
    names_to = "drug",
    values_to = "volume")

spread, pivot_wider()

widens data, increasing the number of columns and decreasing the number of rows
spread() is retired, recommendation is to use instead pivot_wider()

library(tidyverse)

df <- data.frame(
  name = c("John", "John", "Mary", "Mary"),
  drug = c("a", "b", "a", "b"),
  volume = c(7, 18, 9, 5)
)

# spread
df %>%
  # each individual value in key
  # will be converted to a column
  spread(key = "drug", value = "volume")

# pivot_wider()
df %>%
  # each individual value in names_from
  # will be converted to a column
  pivot_wider(names_from = "drug", values_from = "volume")

separate

splits a single column into multiple columns

library(tidyverse)

df <- data.frame(
  Name = c("John", "Mary"),
  Job = c("Teacher, Designer", "Manager, Developer")
)

df %>%
  separate(
    col = "Job",
    into = c("Job 1", "Job 2"), # names of the new columns to be created
    sep = ", "
  )

unite

combines multiple columns into on

library(tidyverse)

df <- data.frame(
  Name = c("John", "Mary"),
  Job1 = c("Teacher", "Manager"),
  Job2 = c("Designer", "Developer")
)

df %>%
  # unite(col = name of new column, ...columns to unite, sep = separator)
  unite(
    col = "Jobs",
    "Job1",
    "Job2",
    sep = ", "
  )

extract

given a regular expression with capturing groups, extract() turns each group into a new column

library(tidyverse)

df <- data.frame(
  Name = c(
    "John Edwards Smith",
    "Mary Kate Miller Brown",
    "Matt Richards"
  )
)

df %>%
  extract(
    col = "Name",
    into = c("First name", "Last name"),
    regex = "([A-z]*).*\\s([A-z]*)"
  )

readr

Import

RStudio

in the bottom-right panel, click in the file name and select 'Import', 'From text(readr)'
as you configure you data, the corresponding code line is shown

code

# read_csv is from readr (included in tidyverse)
library(tidyverse)

setwd("~/Dev/r/")
hp <- read_csv("hp.csv")
hp

Export

# write_csv is from readr (included in tidyverse)
library(tidyverse)

setwd("~/Dev/r/")
write_csv(iris, "iris.csv")

Descriptive statistics

summarize and describe a given data set

table()

create a frequency table from a categorical variable (column)
table(iris$Species)

min, median, mean, max, quantile

min(mtcars$cyl)
median(mtcars$cyl)
mean(mtcars$cyl)
max(mtcars$cyl)
quantile(mtcars$cyl)

# get all at once
summary(mtcars$cyl)

summary()

summary(iris), summary(iris$Petals.width) = details about an object
- if variable is categorical, result is a frequency table
- if variable is quantitative, result is a table containing measures of center (mean, median) and measures of spread (min, 1st qu., 3rd qu., max)

cor()

correlation

# correlation between weight and miles per gallon
cor(mtcars$wt, mtcars$mpg) # result: -0.86

Statistical model

is a set of mathematical equations based on probabilities and used to describe the relationship between two or more variables
purpose: description, inference (estimates the parameters of a larger population), comparison (compare if two sets of data are different in a statistically significant way) and prediction (about new, unknown observations)

linear regression

describes the relationship between two variables, how changes in one variable affects the other variable
is linear model because assume a straight line
both variables must be a continuous numeric value
the variable in the x axis is called 'explanatory variable', and the one in the y axis is called 'outcome variable'
linear predictor function - y = m * x + b
- m is the slope of the line (for each unit increase in x, how much does y increase)
- b is the y intercept (the y value when x is equal to 0)

plot(iris$Petal.Length, iris$Petal.Width)

# lm is the R function to create linear models
model <- lm(
  formula = Petal.Width ~ Petal.Length,
  data = iris
)

# draw straight line on top of the plot
lines(
  x = iris$Petal.Length,
  y = model$fitted,
  col = "red",
  lwd = 3
)

# predict new values from model
predict(
  object = model,
  newdata = data.frame(
    Petal.Length = c(2, 5, 7) # arbitrary values
  )
)

Plot

plot is a graphical technique for representing a data set
usually a graph showing the relationship between one or more variables
in R, plot is usually done
- with base R, that is, without any third-party library
- with a library called ggplot2 (included with tidyverse)

base R vs ggplot

base R mostly use the plot(x, y) function
- but there're also the barplot(), hist() functions
ggplot always use the
- ggplot(data = data, mapping = aes()) function,
- appended by pipe +
- and then layers, scales, facets and/or coordinates

basic ggplot

save plot to variable and then transforming it

library(tidyverse)

# save plot to variable
# only save it, don't display it
p <- ggplot(mtcars, aes(x = cyl)) +
  geom_bar()

# wont save flipped plot into the variable
# only displays it
p + coord_flip()

customization

library(tidyverse)

# needed for third variable in aes()
f <- factor(mtcars$am)
levels(f) <- c("Automatic", "Manual")

ggplot(mtcars, aes( x = wt, y = mpg, shape = f, color = f )) +
  geom_point() +
  labs(
    title = "WT VS MPG",
    x = "weight",
    y = "miles per gallon",
    # change legend title with the aes names
    shape = "Transmission",
    color = "Transmission"
  ) +
  theme( # theme() customize non-data components
    plot.title =
      element_text( face = "bold",
                    hjust = 0.5,
                    margin = margin(8, 0, 16, 0)),
    axis.title =
      element_text( face = "italic"),
    axis.title.x = 
      element_text( margin = margin(8, 0, 4, 0) ),
    axis.title.y = 
      element_text( margin = margin(0, 8, 0, 4) ),
    axis.ticks = element_blank() # remove ticks
  )

zoom, coord_cartesian

library(tidyverse)

ggplot(ChickWeight, aes(x = weight)) +
  geom_histogram() +
  coord_cartesian(xlim = c(200, 300)) # zoom

fill areas under plot

base R

df <- data.frame(
  Month = 1:12,
  Num = as.vector(AirPassengers)[1:12]
)

plot(df$Num, type = "l")

polygon(c(min(df$Month), df$Month, max(df$Month)), c(0, df$Num, 0), col = "steelblue")

ggplot

df <- data.frame(
  Month = 1:12,
  Num = as.vector(AirPassengers)[1:12]
)

ggplot(df, aes(x = Month, y = Num)) +
  # geom_area() + # ymin fixed to 0, which would make plot very high
  geom_ribbon(aes(ymin = 100, ymax = Num)) +
  geom_line()

Categorical univariable analysis

frequency bar chart

x axis: categorical variable
y axis: frequency/count

base R

# plot()
plot(iris$Species)

# barplot()
t <- table(iris$Species) # creates frequency table
barplot(t)

ggplot

ggplot(iris, aes(x = Species)) +
  geom_bar()

Cleveland dot plot

base R

dotchart(table(mtcars$cyl))

ggplot

ggplot(mtcars, aes(x = cyl)) +
  # stat = the statistical transformation to use on the data for this layer
  geom_point(stat = "count") +
  coord_flip()

pie chart

base R

pie(table(mtcars$cyl))

ggplot

ggplot(
  mtcars, aes(x = "", fill = as.factor(cyl))) +
  geom_bar() +
  coord_polar(theta = "y")

Quantitative univariable analysis

histogram

base R

hist(mtcars$mpg)

ggplot

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 5) # binwidth = bar widths

density plot

base R

plot(density(mtcars$mpg))

ggplot

ggplot(mtcars, aes(x = mpg)) +
  geom_density()

Categorical bivariable analysis

percent, grouped and stacked frequency bar chart

base R

t <- table(mtcars$cyl, mtcars$am)

barplot(t, beside = TRUE) # grouped

barplot(t,) # stacked

# percent
percentage <- apply(t, 2, function(x){x*100/sum(x,na.rm=T)})
barplot(percentage)

ggplot

# grouped
ggplot(
  data = mtcars,
  aes(x = factor(am), fill = factor(cyl))) +
  geom_bar(position = "dodge")

# stacked
ggplot(
  data = mtcars,
  aes(x = factor(am), fill = factor(cyl))) +
  geom_bar()

# percent
ggplot(
  data = mtcars,
  aes(x = factor(am), fill = factor(cyl))) +
  geom_bar(position = "fill")

Categ. & quant. bivariable analysis

box plot

base R

plot(ChickWeight$Diet, ChickWeight$weight)

ggplot

ggplot(ChickWeight, aes( x = Diet, y = weight)) +
  geom_boxplot()

Quantitative bivariable analysis

scatter plot

base R

plot(mtcars$wt, mtcars$mpg)

ggplot

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point()

3 variables (by size, color, shape; facet)

ggplot

library(tidyverse)

ggplot(mtcars, aes(x = wt, y = mpg, size = hp)) +
  geom_point()

# both col and shape will need a categorical variable
f <- as.factor(mtcars$am)
levels(f) <- c("Automatic", "Manual") # rename levels

ggplot(mtcars, aes(x = wt, y = mpg, col = f)) +
  geom_point()

ggplot(mtcars, aes(x = wt, y = mpg, shape = f)) +
  geom_point()

# both col and shape
ggplot(mtcars, aes(x = wt, y = mpg, col = f, shape = f)) +
  geom_point()

# multi-panel
ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  facet_grid(. ~ cyl)

gustavoalbuquerquebr/r.md

Ubuntu Install

General observations

Basics

help

Comments

clear console

list available data sets

view data set as table

Declare variables and functions

function

functions with ellipsis

variable info

Operators

if

ifelse() function

for loop

while

switch

get user input

filter data

apply

Data Types

vectors

vector of single value

vector of multiple values

list

matrix

array

factorm

dataFrame

Subsetting

subsetting symbols

single brackets

double brackets

dollar sign

atomic vectors

recycling rule

lists

matrices and arrays

Data frames and tibbles

regular expression

Tidyverse library

installation

use/load

pipe operator

dplyr

tidyr

gather, pivot_longer()

spread, pivot_wider()

separate

unite

extract

readr

Import

RStudio

code

Export

Descriptive statistics

table()

min, median, mean, max, quantile

summary()

cor()

Statistical model

linear regression

Plot

base R vs ggplot

basic ggplot

save plot to variable and then transforming it

customization

zoom, coord_cartesian

fill areas under plot

base R

ggplot

Categorical univariable analysis

frequency bar chart

base R

ggplot

Cleveland dot plot

base R