R Tidyverse
Tidyverse is a collection of R packages.
Contents
Installation
Install the tidyverse package.
Usage
Tidyverse is a collection of packages such as ggplot2, tibble, readr, haven, etc.
These libraries serve as a modernized standard library. At the core are tibbles (a re-engineered data frame), magrittr (the forward pipe operator %>%), and dplyr. As an example:
> a <- tibble(foo = c(1,2,4),
bar = c("a","b","c"))
> b <- a %>%
group_by(foo) %>%
mutate(x = foo+2,
seq = rownumber(),
groupsize = n()) %>%
ungroup()
> b %>%
summarize(xbar = mean(x))The primary dplyr method is mutate, which is a vectorized data step. Each argument to the method is an expression like foo = bar that evaluates to a column; each expression is evaluated sequentially. Variables are either declared or overwritten. The RHS must either be a scalar value (in which case it is recycled for all rows) or a vector exactly as long as the tibble has rows.
Functions available within a mutate block that set a scalar value include:
n() returns the number of rows.
n_distinct(foo) returns the number of distinct levels of foo.
Vector functions available within a mutate block include:
rownumber() creates the sequential row index.
lag(foo) accesses the lagged value of foo; the first row accesses NA.
For a lag other than 1, try lag(foo, n = 2).
lead(foo) is the leading value equivalent.
- Note that these access to values as they were at the time the expression began evaluating; there is no self-referencing updated values.
cummean(foo) creates the running mean of foo.
ntile(foo, 10) creates a n-tiles (deciles in this case) of foo.
Most mathematical functions from base R, such as sum(foo) are useful.
Useful vector functions that actually come from base R include:
cumsum(foo) creates the running total of foo.
To reference the tibble explicitly within a pipeline, by convention it is always available as .data. For example, .data[[varname]].
Logic
To conditionally operate on data step based on a binary condition, use the if_else method within the mutate block. This works as the ifelse function in base R, but is a vectorized operation.
When there are multiple conditions, instead use the case_when method. Each argument to the method is an expression like foo ~ bar; the LHS is a logical expression. For example:
> x %>%
mutate(case_when(bar>0 ~ 1,
bar==0 ~ 10,
TRUE ~ 100))As demonstrated by this example, there should be a terminal fallback case. The argument are evaluated in sequence, and a row's value is taken from the RHS of the first expression evaluating to TRUE.
Grouping
The group_by and rowwise methods transform the tibble to be a grouped tibble. Note that this does not re-order the underlying data. Note also that, by default, groups are ordered ascending.
The group_by function can take any number of grouping variables and also a few options:
By default, any existing groupings are replaced. Specify .add = TRUE to retain them as higher level groups.
If a grouping variable is a factor, then defined levels that do not actually appear in the data may be respected. (The default behavior depends on how the tibble was defined.) Specify .drop = TRUE to ignore levels that do not appear, or .drop = FALSE to respect them.
Grouping affects all dplyr methods so ensure that a tibble is reset, using the ungroup function. Again, note that this does not re-order the underlying data.
Subsetting
To subset the rows in a tibble, use the filter method.
To subset the variables in a tibble, use the select method. There are several selection helper functions that take formulaic instructions that identify variables and return a vector of integer indices.
starts_with("foo") finds variables based on a prefix
ends_with("foo") finds variables based on a suffix
contains("foo") finds variables based on a literal substring
match("foo") finds variables based on a regular expression
everything() is a self-documenting way to refer to all variables
any_of(c("foo","bar")) finds variables by name, silently ignoring any that do not exist in the data
all_of(c("foo","bar")) finds variables by name, but throws an error if any do not exist in the data
Reordering
The arrange method reorders rows in a tibble.
> x >%> arrange(foo, bar, desc(baz))
By default, cases are reordered without respect to groups. To instead use grouping variables as higher level sorting variables, specify .group_by = TRUE.
Forward Pipe
magrittr provides the forward pipe operator (%>%). This serves to forward the LHS as the first argument of the RHS. In other words, f(x) is equivalent to x %>% f().
There are also advanced features like forwarding into an explicit alias (.);
f(1, x) is equivalent to x %>% f(1, .);
f(x%foo, x%bar) is equivalent to x %>% f(.$foo, .$bar).
In version 4.1 of R, a similar forward operator (|>) was introduced into the language. This behaves exactly the same in the simple case, but lacks the advanced features of magrittr. Until 4.1 is old enough to be a null constraint, the recommendation is to use magrittr for portability.
Interpolation
rlang provides several functions and operators for injecting R syntax.
To convert a string into a syntactic symbol, try:
> mysym <- sym("foo")
> myvec <- syms(c("foo", "bar"))There are also parallel data_sym and data_syms functions, which evaluate strings into symbols like .data%foo. As the .data pronoun is only guaranteed in dplyr methods, these functions should only be used in those contexts.
To inject a symbol into syntax, use the 'bang-bang' (!!) or 'triple bang' (!!!) operators. The latter splices a vector of symbols before injection.
> x %>% select(!!mysym) > x %>% select(!!!myvec)
All dplyr methods support these operators immediately. To enable injection in any other context, set the function call inside the inject function.
> inject(interaction(!!!myvec))
On a related note, it is possible to define a function that takes a syntactic symbol as an argument rather than a string. This relies on 'enquosing' the argument, creating an 'enquosure'. The output of enquo can be passed to sym; the output of enquos can be passed to syms.
In addition, the the 'curly curly' operator ({{) implicitly creates the enquosure, the symbol, and finally injects it. As an example:
foo <- function(x, bar) {
x %>% summarize(maximum = max( {{ bar }} ))
}Note that this operator should only be used on the right hand side of an expression, as above.
For the left hand side, on the other hand, a glue-like syntax is supported as:
foo <- function(x, bar, baz) {
x %>% summarize("max{baz}" := max( {{ bar }} ))
}Note that the equals sign (=) had to be replaced with a walrus operator (:=).
Warning
It is almost always safer and easier to keep arguments as strings, and utilize the .data pronoun that is guaranteed in all dplyr methods.
foo <- function(x, bar) {
x %>% summarize(maximum = max(.data[[bar]]))
}Consider the all_of function when working with vectors of string variable names. It is only valid in certain dplyr methods (like select), but it can be chained with across for almost all other contexts.
> x %>% group_by(across(all_of(c("foo","bar"))))
