Differences between revisions 5 and 6
Revision 5 as of 2026-05-07 13:57:56
Size: 5030
Comment: Updates
Revision 6 as of 2026-05-15 21:10:11
Size: 5400
Comment: Note
Deletions are marked like this. Additions are marked like this.
Line 115: Line 115:


=== Curly Curly ===

`rlang` provides several functions for evaluating raw R syntax.

More importantly, it provides the 'curly curly' operator (`{{`). This serves for interpolation within `dplyr` methods.

Consider this demo:

{{{
foo <- function(x, bar, baz) {
  x %>%
    group_by( {{ baz }} ) %>%
    summarize(maximum = max( {{ bar }} ))
}
}}}

R Tidyverse

Tidyverse is a collection of R packages.


Installation

Install the tidyverse package.


Usage

Tidyverse is a collection of packages such as ggplot2, tibble, readr, haven, etc.

These libraries serve as a modernized standard library. At the core are tibbles (a re-engineered data frame), magrittr (the forward pipe operator %>%), and dplyr. As an example:

> a <- tibble(foo = c(1,2,4),
              bar = c("a","b","c"))

> b <- a %>%
    group_by(foo) %>%
    mutate(x = foo+2,
           seq = rownumber(),
           groupsize = n()) %>%
    ungroup()

> b %>%
    summarize(xbar = mean(x))

The primary dplyr method is mutate, which is a vectorized data step. Each argument to the method is an expression like foo = bar that evaluates to a column; each expression is evaluated sequentially. Variables are either declared or overwritten. The RHS must either be a scalar value (in which case it is recycled for all rows) or a vector exactly as long as the tibble has rows.

Functions available within a mutate block that set a scalar value include:

  • n() returns the number of rows.

  • n_distinct(foo) returns the number of distinct levels of foo.

Vector functions available within a mutate block include:

  • rownumber() creates the sequential row index.

  • lag(foo) accesses the lagged value of foo; the first row accesses NA.

    • For a lag other than 1, try lag(foo, n = 2).

    • lead(foo) is the leading value equivalent.

    • Note that these access to values as they were at the time the expression began evaluating; there is no self-referencing updated values.
  • cummean(foo) creates the running mean of foo.

  • ntile(foo, 10) creates a n-tiles (deciles in this case) of foo.

Most mathematical functions from base R, such as sum(foo) are useful.

Useful vector functions that actually come from base R include:

  • cumsum(foo) creates the running total of foo.

To reference the tibble explicitly within a pipeline, by convention it is always available as .data. For example, .data[[varname]].

Logic

To conditionally operate on data step based on a binary condition, use the if_else method within the mutate block. This works as the ifelse function in base R, but is a vectorized operation.

When there are multiple conditions, instead use the case_when method. Each argument to the method is an expression like foo ~ bar; the LHS is a logical expression. For example:

> x %>%
    mutate(case_when(bar>0 ~ 1,
                     bar==0 ~ 10,
                     TRUE ~ 100))

As demonstrated by this example, there should be a terminal fallback case. The argument are evaluated in sequence, and a row's value is taken from the RHS of the first expression evaluating to TRUE.

Grouping

The group_by and rowwise methods transform the tibble to be a grouped tibble. Note that this does not re-order the underlying data. Note also that, by default, groups are ordered ascending.

The group_by function can take any number of grouping variables and also a few options:

  • By default, any existing groupings are replaced. Specify .add = TRUE to retain them as higher level groups.

  • If a grouping variable is a factor, then defined levels that do not actually appear in the data may be respected. (The default behavior depends on how the tibble was defined.) Specify .drop = TRUE to ignore levels that do not appear, or .drop = FALSE to respect them.

Grouping affects all dplyr functions so ensure that a tibble is reset, using the ungroup function. Again, note that this does not re-order the underlying data.

Reordering

The arrange method reorders rows in a tibble.

> x >%> arrange(foo, bar, desc(baz))

By default, cases are reordered without respect to groups. To instead use grouping variables as higher level sorting variables, specify .group_by = TRUE.

Forward Pipe

magrittr provides the forward pipe operator (%>%). This serves to forward the LHS as the first argument of the RHS. In other words, f(x) is equivalent to x %>% f().

There are also advanced features like forwarding into an explicit alias (.);

  • f(1, x) is equivalent to x %>% f(1, .);

  • f(x%foo, x%bar) is equivalent to x %>% f(.$foo, .$bar).

In version 4.1 of R, a similar forward operator (|>) was introduced into the language. This behaves exactly the same in the simple case, but lacks the advanced features of magrittr. Until 4.1 is old enough to be a null constraint, the recommendation is to use magrittr for portability.

Curly Curly

rlang provides several functions for evaluating raw R syntax.

More importantly, it provides the 'curly curly' operator ({{). This serves for interpolation within dplyr methods.

Consider this demo:

foo <- function(x, bar, baz) {
  x %>%
    group_by( {{ baz }} ) %>%
    summarize(maximum = max( {{ bar }} ))
}


See also

Tidyverse project reference


CategoryRicottone

R/Tidyverse (last edited 2026-05-21 15:53:53 by DominicRicottone)