R Tidyverse
Tidyverse is a collection of R packages.
Installation
Install the tidyverse package.
Usage
Tidyverse is a collection of packages such as ggplot2, tibble, readr, haven, etc.
These libraries serve as a modernized standard library. At the core are tibbles (a re-engineered data frame), magrittr (the forward pipe operator %>%), and dplyr. As an example:
> a <- tibble(foo = c(1,2,4),
bar = c("a","b","c"))
> b <- a %>%
group_by(foo) %>%
mutate(x = foo+2,
seq = rownumber(),
groupsize = n()) %>%
ungroup()
> b %>%
summarize(xbar = mean(x))The primary dplyr method is mutate, which is a vectorized data step. Each argument to the method is an expression like foo = bar that evaluates to a column; each expression is evaluated sequentially. Variables are either declared or overwritten. The RHS must either be a scalar value (in which case it is recycled for all rows) or a vector exactly as long as the tibble has rows.
Functions available within a mutate block that set a scalar value include:
n() returns the number of cases.
sum(foo) returns the total of foo across all cases.
Vector functions available within a mutate block include:
rownumber() creates the sequential row index.
lag(foo) accesses the lagged value of foo; the first case accesses NA.
For a lag other than 1, try lag(foo, n = 2).
lead(foo) is the leading value equivalent.
- Note that these access to values as they were at the time the expression began evaluating; there is no self-referencing updated values.
cummean(foo) creates the running mean of foo.
Useful vector functions that actually come from base R include:
cumsum(foo) creates the running total of foo.
To reference the tibble explicitly within a pipeline, by convention it is always available as .data. For example, .data[[varname]].
Logic
To conditionally operate on data step based on a binary condition, use the if_else method within the mutate block. This works as the ifelse function in base R, but is a vectorized operation.
When there are multiple conditions, instead use the case_when method. Each argument to the method is an expression like foo ~ bar; the LHS is a logical expression. For example:
> x %>%
mutate(case_when(bar>0 ~ 1,
bar==0 ~ 10,
TRUE ~ 100))As demonstrated by this example, there should be a terminal fallback case. The argument are evaluated in sequence, and a row's value is taken from the RHS of the first expression evaluating to TRUE.
Grouping
The group_by and rowwise methods transform the tibble to be a grouped tibble. Note that this does not re-order the underlying data. Note also that, by default, groups are ordered ascending.
The group_by function can take any number of grouping variables and also a few options:
By default, any existing groupings are replaced. Specify .add = TRUE to retain them as higher level groups.
If a grouping variable is a factor, then defined levels that do not actually appear in the data may be respected. (The default behavior depends on how the tibble was defined.) Specify .drop = TRUE to ignore levels that do not appear, or .drop = FALSE to respect them.
Grouping affects all dplyr functions so ensure that a tibble is reset, using the ungroup function. Again, note that this does not re-order the underlying data.
Reordering
The arrange method reorders rows in a tibble.
> x >%> arrange(foo, bar, desc(baz))
By default, cases are reordered without respect to groups. To instead use grouping variables as higher level sorting variables, specify .group_by = TRUE.
Forward Pipe
magrittr provides the forward pipe operator (%>%). This serves to forward the LHS as the first argument of the RHS. In other words, f(x) is equivalent to x %>% f().
There are also advanced features like forwarding into an explicit alias (.);
f(1, x) is equivalent to x %>% f(1, .);
f(x%foo, x%bar) is equivalent to x %>% f(.$foo, .$bar).
In version 4.1 of R, a similar forward operator (|>) was introduced into the language. This behaves exactly the same in the simple case, but lacks the advanced features of magrittr. Until 4.1 is old enough to be a null constraint, the recommendation is to use magrittr for portability.
