R Tidyverse

Tidyverse is a collection of R packages.


Installation

Install the tidyverse package.


Usage

Tidyverse is a collection of packages such as ggplot2, tibble, readr, haven, etc.

These libraries serve as a modernized standard library. At the core are tibbles (a re-engineered data frame), magrittr (the forward pipe operator %>%), and dplyr. As an example:

> a <- tibble(foo = c(1,2,4),
              bar = c("a","b","c"))

> b <- a %>%
    group_by(foo) %>%
    mutate(x = foo+2,
           seq = rownumber(),
           groupsize = n()) %>%
    ungroup()

> b %>%
    summarize(xbar = mean(x))

The primary dplyr method is mutate, which is a vectorized data step. Each argument to the method is an expression like foo = bar that evaluates to a column; each expression is evaluated sequentially. Variables are either declared or overwritten. The RHS must either be a scalar value (in which case it is recycled for all rows) or a vector exactly as long as the tibble has rows.

Functions available within a mutate block that set a scalar value include:

Vector functions available within a mutate block include:

Most mathematical functions from base R, such as sum(foo) are useful.

Useful vector functions that actually come from base R include:

To reference the tibble explicitly within a pipeline, by convention it is always available as .data. For example, .data[[varname]].

Logic

To conditionally operate on data step based on a binary condition, use the if_else method within the mutate block. This works as the ifelse function in base R, but is a vectorized operation.

When there are multiple conditions, instead use the case_when method. Each argument to the method is an expression like foo ~ bar; the LHS is a logical expression. For example:

> x %>%
    mutate(case_when(bar>0 ~ 1,
                     bar==0 ~ 10,
                     TRUE ~ 100))

As demonstrated by this example, there should be a terminal fallback case. The argument are evaluated in sequence, and a row's value is taken from the RHS of the first expression evaluating to TRUE.

Grouping

The group_by and rowwise methods transform the tibble to be a grouped tibble. Note that this does not re-order the underlying data. Note also that, by default, groups are ordered ascending.

The group_by function can take any number of grouping variables and also a few options:

Grouping affects all dplyr methods so ensure that a tibble is reset, using the ungroup function. Again, note that this does not re-order the underlying data.

Subsetting

To subset the rows in a tibble, use the filter method.

To subset the variables in a tibble, use the select method. There are several selection helper functions that take formulaic instructions that identify variables and return a vector of integer indices.

Reordering

The arrange method reorders rows in a tibble.

> x >%> arrange(foo, bar, desc(baz))

By default, cases are reordered without respect to groups. To instead use grouping variables as higher level sorting variables, specify .group_by = TRUE.

Forward Pipe

magrittr provides the forward pipe operator (%>%). This serves to forward the LHS as the first argument of the RHS. In other words, f(x) is equivalent to x %>% f().

There are also advanced features like forwarding into an explicit alias (.);

In version 4.1 of R, a similar forward operator (|>) was introduced into the language. This behaves exactly the same in the simple case, but lacks the advanced features of magrittr. Until 4.1 is old enough to be a null constraint, the recommendation is to use magrittr for portability.

Interpolation

rlang provides several functions and operators for injecting R syntax.

To convert a string into a syntactic symbol, try:

> mysym <- sym("foo")
> myvec <- syms(c("foo", "bar"))

There are also parallel data_sym and data_syms functions, which evaluate strings into symbols like .data%foo. As the .data pronoun is only guaranteed in dplyr methods, these functions should only be used in those contexts.

To inject a symbol into syntax, use the 'bang-bang' (!!) or 'triple bang' (!!!) operators. The latter splices a vector of symbols before injection.

> x %>% select(!!mysym)
> x %>% select(!!!myvec)

All dplyr methods support these operators immediately. To enable injection in any other context, set the function call inside the inject function.

> inject(interaction(!!!myvec))

On a related note, it is possible to define a function that takes a syntactic symbol as an argument rather than a string. This relies on 'enquosing' the argument, creating an 'enquosure'. The output of enquo can be passed to sym; the output of enquos can be passed to syms.

In addition, the the 'curly curly' operator ({{) implicitly creates the enquosure, the symbol, and finally injects it. As an example:

foo <- function(x, bar) {
  x %>% summarize(maximum = max( {{ bar }} ))
}

Note that this operator should only be used on the right hand side of an expression, as above.

For the left hand side, on the other hand, a glue-like syntax is supported as:

foo <- function(x, bar, baz) {
  x %>% summarize("max{baz}" := max( {{ bar }} ))
}

Note that the equals sign (=) had to be replaced with a walrus operator (:=).

Warning

It is almost always safer and easier to keep arguments as strings, and utilize the .data pronoun that is guaranteed in all dplyr methods.

foo <- function(x, bar) {
  x %>% summarize(maximum = max(.data[[bar]]))
}

Consider the all_of function when working with vectors of string variable names. It is only valid in certain dplyr methods (like select), but it can be chained with across for almost all other contexts.

> x %>% group_by(across(all_of(c("foo","bar"))))


See also

Tidyverse project reference


CategoryRicottone

R/Tidyverse (last edited 2026-05-21 15:53:53 by DominicRicottone)