Differences between revisions 2 and 6 (spanning 4 versions)

R Tidyverse

Tidyverse is a collection of R packages.

Contents

R Tidyverse

Installation

Install the tidyverse package.

Usage

Tidyverse is a collection of packages such as ggplot2, tibble, readr, haven, etc.

These libraries serve as a modernized standard library. At the core are tibbles (a re-engineered data frame), magrittr (the forward pipe operator %>%), and dplyr. As an example:

> a <- tibble(foo = c(1,2,4),
              bar = c("a","b","c"))

> b <- a %>%
    group_by(foo) %>%
    mutate(x = foo+2,
           seq = rownumber(),
           groupsize = n()) %>%
    ungroup()

> b %>%
    summarize(xbar = mean(x))

The primary dplyr method is mutate, which is a vectorized data step. Each argument to the method is an expression like foo = bar that evaluates to a column; each expression is evaluated sequentially. Variables are either declared or overwritten. The RHS must either be a scalar value (in which case it is recycled for all rows) or a vector exactly as long as the tibble has rows.

Functions available within a mutate block that set a scalar value include:

n() returns the number of rows.
n_distinct(foo) returns the number of distinct levels of foo.

Vector functions available within a mutate block include:

rownumber() creates the sequential row index.
lag(foo) accesses the lagged value of foo; the first row accesses NA.
- For a lag other than 1, try lag(foo, n = 2).
- lead(foo) is the leading value equivalent.
- Note that these access to values as they were at the time the expression began evaluating; there is no self-referencing updated values.
cummean(foo) creates the running mean of foo.
ntile(foo, 10) creates a n-tiles (deciles in this case) of foo.

Most mathematical functions from base R, such as sum(foo) are useful.

Useful vector functions that actually come from base R include:

cumsum(foo) creates the running total of foo.

To reference the tibble explicitly within a pipeline, by convention it is always available as .data. For example, .data[[varname]].

Logic

To conditionally operate on data step based on a binary condition, use the if_else method within the mutate block. This works as the ifelse function in base R, but is a vectorized operation.

When there are multiple conditions, instead use the case_when method. Each argument to the method is an expression like foo ~ bar; the LHS is a logical expression. For example:

> x %>%
    mutate(case_when(bar>0 ~ 1,
                     bar==0 ~ 10,
                     TRUE ~ 100))

As demonstrated by this example, there should be a terminal fallback case. The argument are evaluated in sequence, and a row's value is taken from the RHS of the first expression evaluating to TRUE.

Grouping

The group_by and rowwise methods transform the tibble to be a grouped tibble. Note that this does not re-order the underlying data. Note also that, by default, groups are ordered ascending.

The group_by function can take any number of grouping variables and also a few options:

By default, any existing groupings are replaced. Specify .add = TRUE to retain them as higher level groups.
If a grouping variable is a factor, then defined levels that do not actually appear in the data may be respected. (The default behavior depends on how the tibble was defined.) Specify .drop = TRUE to ignore levels that do not appear, or .drop = FALSE to respect them.

Grouping affects all dplyr functions so ensure that a tibble is reset, using the ungroup function. Again, note that this does not re-order the underlying data.

Reordering

The arrange method reorders rows in a tibble.

> x >%> arrange(foo, bar, desc(baz))

By default, cases are reordered without respect to groups. To instead use grouping variables as higher level sorting variables, specify .group_by = TRUE.

Forward Pipe

magrittr provides the forward pipe operator (%>%). This serves to forward the LHS as the first argument of the RHS. In other words, f(x) is equivalent to x %>% f().

There are also advanced features like forwarding into an explicit alias (.);

f(1, x) is equivalent to x %>% f(1, .);
f(x%foo, x%bar) is equivalent to x %>% f(.$foo, .$bar).

In version 4.1 of R, a similar forward operator (|>) was introduced into the language. This behaves exactly the same in the simple case, but lacks the advanced features of magrittr. Until 4.1 is old enough to be a null constraint, the recommendation is to use magrittr for portability.

Curly Curly

rlang provides several functions for evaluating raw R syntax.

More importantly, it provides the 'curly curly' operator ({{). This serves for interpolation within dplyr methods.

Consider this demo:

foo <- function(x, bar, baz) {
  x %>%
    group_by( {{ baz }} ) %>%
    summarize(maximum = max( {{ bar }} ))
}

-  ⇤ ← Revision 2 as of 2023-09-10 17:20:39 → 
  Size: 501
  Editor: DominicRicottone
  Comment:
+   ← Revision 6 as of 2026-05-15 21:10:11 → ⇥
  Size: 5400
  Editor: DominicRicottone
  Comment: Note
-Deletions are marked like this.
+Additions are marked like this.
 Line 21:
-Tidyverse includes a variety of packages, including [[R/Ggplot2|ggplot2]], [[R/Dplyr|dplyr]], [[R/Tidyr|tidyr]], [[R/Readr|readr]], and [[R/Tibble|tibble]].
+Tidyverse is a collection of packages such as [[R/Ggplot2|ggplot2]], [[R/Tibble|tibble]], [[R/Readr|readr]], [[R/Haven|haven]], etc.

These libraries serve as a modernized standard library. At the core are tibbles (a re-engineered [[R/DataFrames|data frame]]), `magrittr` (the forward pipe operator `%>%`), and `dplyr`. As an example:

{{{
> a <- tibble(foo = c(1,2,4),
              bar = c("a","b","c"))

> b <- a %>%
    group_by(foo) %>%
    mutate(x = foo+2,
           seq = rownumber(),
           groupsize = n()) %>%
    ungroup()

> b %>%
    summarize(xbar = mean(x))
}}}

The primary `dplyr` method is `mutate`, which is a vectorized data step. Each argument to the method is an expression like `foo = bar` that evaluates to a column; each expression is evaluated sequentially. Variables are either declared or overwritten. The RHS must either be a scalar value (in which case it is recycled for all rows) or a vector exactly as long as the tibble has rows.

Functions available within a `mutate` block that set a scalar value include:
 * `n()` returns the number of rows.
 * `n_distinct(foo)` returns the number of distinct levels of `foo`.

Vector functions available within a `mutate` block include:
 * `rownumber()` creates the sequential row index.
 * `lag(foo)` accesses the lagged value of `foo`; the first row accesses `NA`.
   * For a lag other than 1, try `lag(foo, n = 2)`.
   * `lead(foo)` is the leading value equivalent.
   * Note that these access to values as they were at the time the expression began evaluating; there is no self-referencing updated values.
 * `cummean(foo)` creates the running mean of `foo`.
 * `ntile(foo, 10)` creates a n-tiles (deciles in this case) of `foo`.

Most mathematical functions from base R, such as `sum(foo)` are useful.

Useful vector functions that actually come from base R include:
 * `cumsum(foo)` creates the running total of `foo`.

To reference the tibble explicitly within a pipeline, by convention it is always available as `.data`. For example, `.data[[varname]]`.



=== Logic ===

To conditionally operate on data step based on a binary condition, use the `if_else` method within the `mutate` block. This works as the `ifelse` function in base R, but is a vectorized operation.

When there are multiple conditions, instead use the `case_when` method. Each argument to the method is an expression like `foo ~ bar`; the LHS is a logical expression. For example:

{{{
> x %>%
    mutate(case_when(bar>0 ~ 1,
                     bar==0 ~ 10,
                     TRUE ~ 100))
}}}

As demonstrated by this example, there should be a terminal fallback case. The argument are evaluated in sequence, and a row's value is taken from the RHS of the first expression evaluating to `TRUE`.



=== Grouping ===

The `group_by` and `rowwise` methods transform the tibble to be a grouped tibble. Note that this does not re-order the underlying data. Note also that, by default, groups are ordered ascending.

The `group_by` function can take any number of grouping variables and also a few options:
 * By default, any existing groupings are replaced. Specify `.add = TRUE` to retain them as higher level groups.
 * If a grouping variable is a [[R/DataTypes#Factors|factor]], then defined levels that do not actually appear in the data may be respected. (The default behavior depends on how the tibble was defined.) Specify `.drop = TRUE` to ignore levels that do not appear, or `.drop = FALSE` to respect them.

Grouping affects all `dplyr` functions so ensure that a tibble is reset, using the `ungroup` function. Again, note that this does not re-order the underlying data.



=== Reordering ===

The `arrange` method reorders rows in a tibble.

{{{
> x >%> arrange(foo, bar, desc(baz))
}}}

By default, cases are reordered without respect to groups. To instead use grouping variables as higher level sorting variables, specify `.group_by = TRUE`.



=== Forward Pipe ===

`magrittr` provides the forward pipe operator (`%>%`). This serves to forward the LHS as the first argument of the RHS. In other words, `f(x)` is equivalent to `x %>% f()`.

There are also advanced features like forwarding into an explicit alias (`.`);
 * `f(1, x)` is equivalent to `x %>% f(1, .)`;
 * `f(x%foo, x%bar)` is equivalent to `x %>% f(.$foo, .$bar)`.

In version 4.1 of R, a similar forward operator (`|>`) was introduced into the language. This behaves exactly the same in the simple case, but lacks the advanced features of `magrittr`. Until 4.1 is old enough to be a null constraint, the recommendation is to use `magrittr` for portability.



=== Curly Curly ===

`rlang` provides several functions for evaluating raw R syntax.

More importantly, it provides the 'curly curly' operator (`{{`). This serves for interpolation  within `dplyr` methods.

Consider this demo:

{{{
foo <- function(x, bar, baz) {
  x %>%
    group_by( {{ baz }} ) %>%
    summarize(maximum = max( {{ bar }} ))
}
}}}

Diff for "R/Tidyverse"