R Tidyverse

Tidyverse is a collection of R packages.


Installation

Install the tidyverse package.


Usage

Tidyverse is a collection of packages such as ggplot2, tibble, readr, haven, etc.

These libraries serve as a modernized standard library. At the core are tibbles (a re-engineered data frame), magrittr (the forward pipe operator %>%), and dplyr. As an example:

> a <- tibble(foo = c(1,2,4),
              bar = c("a","b","c"))

> b <- a %>%
    group_by(foo) %>%
    mutate(x = foo+2,
           seq = rownumber(),
           groupsize = n()) %>%
    ungroup()

> b %>%
    summarize(xbar = mean(x))

The primary dplyr method is mutate, which is a vectorized data step. Each argument to the method is an expression like foo = bar that evaluates to a column; each expression is evaluated sequentially. Variables are either declared or overwritten. The RHS must either be a scalar value (in which case it is recycled for all rows) or a vector exactly as long as the tibble has rows.

Functions available within a mutate block that set a scalar value include:

Vector functions available within a mutate block include:

Most mathematical functions from base R, such as sum(foo) are useful.

Useful vector functions that actually come from base R include:

To reference the tibble explicitly within a pipeline, by convention it is always available as .data. For example, .data[[varname]].

Logic

To conditionally operate on data step based on a binary condition, use the if_else method within the mutate block. This works as the ifelse function in base R, but is a vectorized operation.

When there are multiple conditions, instead use the case_when method. Each argument to the method is an expression like foo ~ bar; the LHS is a logical expression. For example:

> x %>%
    mutate(case_when(bar>0 ~ 1,
                     bar==0 ~ 10,
                     TRUE ~ 100))

As demonstrated by this example, there should be a terminal fallback case. The argument are evaluated in sequence, and a row's value is taken from the RHS of the first expression evaluating to TRUE.

Grouping

The group_by and rowwise methods transform the tibble to be a grouped tibble. Note that this does not re-order the underlying data. Note also that, by default, groups are ordered ascending.

The group_by function can take any number of grouping variables and also a few options:

Grouping affects all dplyr functions so ensure that a tibble is reset, using the ungroup function. Again, note that this does not re-order the underlying data.

Reordering

The arrange method reorders rows in a tibble.

> x >%> arrange(foo, bar, desc(baz))

By default, cases are reordered without respect to groups. To instead use grouping variables as higher level sorting variables, specify .group_by = TRUE.

Forward Pipe

magrittr provides the forward pipe operator (%>%). This serves to forward the LHS as the first argument of the RHS. In other words, f(x) is equivalent to x %>% f().

There are also advanced features like forwarding into an explicit alias (.);

In version 4.1 of R, a similar forward operator (|>) was introduced into the language. This behaves exactly the same in the simple case, but lacks the advanced features of magrittr. Until 4.1 is old enough to be a null constraint, the recommendation is to use magrittr for portability.

Curly Curly

rlang provides several functions for evaluating raw R syntax.

More importantly, it provides the 'curly curly' operator ({{). This serves for interpolation within dplyr methods.

Consider this demo:

foo <- function(x, bar, baz) {
  x %>%
    group_by( {{ baz }} ) %>%
    summarize(maximum = max( {{ bar }} ))
}


See also

Tidyverse project reference


CategoryRicottone