|
Size: 502
Comment:
|
Size: 5400
Comment: Note
|
| Deletions are marked like this. | Additions are marked like this. |
| Line 21: | Line 21: |
| Tidyverse includes a variety of packages, including [[R/Ggplpot2|ggplot2]], [[R/Dplyr|dplyr]], [[R/Tidyr|tidyr]], [[R/Readr|readr]], and [[R/Tibble|tibble]]. | Tidyverse is a collection of packages such as [[R/Ggplot2|ggplot2]], [[R/Tibble|tibble]], [[R/Readr|readr]], [[R/Haven|haven]], etc. These libraries serve as a modernized standard library. At the core are tibbles (a re-engineered [[R/DataFrames|data frame]]), `magrittr` (the forward pipe operator `%>%`), and `dplyr`. As an example: {{{ > a <- tibble(foo = c(1,2,4), bar = c("a","b","c")) > b <- a %>% group_by(foo) %>% mutate(x = foo+2, seq = rownumber(), groupsize = n()) %>% ungroup() > b %>% summarize(xbar = mean(x)) }}} The primary `dplyr` method is `mutate`, which is a vectorized data step. Each argument to the method is an expression like `foo = bar` that evaluates to a column; each expression is evaluated sequentially. Variables are either declared or overwritten. The RHS must either be a scalar value (in which case it is recycled for all rows) or a vector exactly as long as the tibble has rows. Functions available within a `mutate` block that set a scalar value include: * `n()` returns the number of rows. * `n_distinct(foo)` returns the number of distinct levels of `foo`. Vector functions available within a `mutate` block include: * `rownumber()` creates the sequential row index. * `lag(foo)` accesses the lagged value of `foo`; the first row accesses `NA`. * For a lag other than 1, try `lag(foo, n = 2)`. * `lead(foo)` is the leading value equivalent. * Note that these access to values as they were at the time the expression began evaluating; there is no self-referencing updated values. * `cummean(foo)` creates the running mean of `foo`. * `ntile(foo, 10)` creates a n-tiles (deciles in this case) of `foo`. Most mathematical functions from base R, such as `sum(foo)` are useful. Useful vector functions that actually come from base R include: * `cumsum(foo)` creates the running total of `foo`. To reference the tibble explicitly within a pipeline, by convention it is always available as `.data`. For example, `.data[[varname]]`. === Logic === To conditionally operate on data step based on a binary condition, use the `if_else` method within the `mutate` block. This works as the `ifelse` function in base R, but is a vectorized operation. When there are multiple conditions, instead use the `case_when` method. Each argument to the method is an expression like `foo ~ bar`; the LHS is a logical expression. For example: {{{ > x %>% mutate(case_when(bar>0 ~ 1, bar==0 ~ 10, TRUE ~ 100)) }}} As demonstrated by this example, there should be a terminal fallback case. The argument are evaluated in sequence, and a row's value is taken from the RHS of the first expression evaluating to `TRUE`. === Grouping === The `group_by` and `rowwise` methods transform the tibble to be a grouped tibble. Note that this does not re-order the underlying data. Note also that, by default, groups are ordered ascending. The `group_by` function can take any number of grouping variables and also a few options: * By default, any existing groupings are replaced. Specify `.add = TRUE` to retain them as higher level groups. * If a grouping variable is a [[R/DataTypes#Factors|factor]], then defined levels that do not actually appear in the data may be respected. (The default behavior depends on how the tibble was defined.) Specify `.drop = TRUE` to ignore levels that do not appear, or `.drop = FALSE` to respect them. Grouping affects all `dplyr` functions so ensure that a tibble is reset, using the `ungroup` function. Again, note that this does not re-order the underlying data. === Reordering === The `arrange` method reorders rows in a tibble. {{{ > x >%> arrange(foo, bar, desc(baz)) }}} By default, cases are reordered without respect to groups. To instead use grouping variables as higher level sorting variables, specify `.group_by = TRUE`. === Forward Pipe === `magrittr` provides the forward pipe operator (`%>%`). This serves to forward the LHS as the first argument of the RHS. In other words, `f(x)` is equivalent to `x %>% f()`. There are also advanced features like forwarding into an explicit alias (`.`); * `f(1, x)` is equivalent to `x %>% f(1, .)`; * `f(x%foo, x%bar)` is equivalent to `x %>% f(.$foo, .$bar)`. In version 4.1 of R, a similar forward operator (`|>`) was introduced into the language. This behaves exactly the same in the simple case, but lacks the advanced features of `magrittr`. Until 4.1 is old enough to be a null constraint, the recommendation is to use `magrittr` for portability. === Curly Curly === `rlang` provides several functions for evaluating raw R syntax. More importantly, it provides the 'curly curly' operator (`{{`). This serves for interpolation within `dplyr` methods. Consider this demo: {{{ foo <- function(x, bar, baz) { x %>% group_by( {{ baz }} ) %>% summarize(maximum = max( {{ bar }} )) } }}} |
R Tidyverse
Tidyverse is a collection of R packages.
Installation
Install the tidyverse package.
Usage
Tidyverse is a collection of packages such as ggplot2, tibble, readr, haven, etc.
These libraries serve as a modernized standard library. At the core are tibbles (a re-engineered data frame), magrittr (the forward pipe operator %>%), and dplyr. As an example:
> a <- tibble(foo = c(1,2,4),
bar = c("a","b","c"))
> b <- a %>%
group_by(foo) %>%
mutate(x = foo+2,
seq = rownumber(),
groupsize = n()) %>%
ungroup()
> b %>%
summarize(xbar = mean(x))The primary dplyr method is mutate, which is a vectorized data step. Each argument to the method is an expression like foo = bar that evaluates to a column; each expression is evaluated sequentially. Variables are either declared or overwritten. The RHS must either be a scalar value (in which case it is recycled for all rows) or a vector exactly as long as the tibble has rows.
Functions available within a mutate block that set a scalar value include:
n() returns the number of rows.
n_distinct(foo) returns the number of distinct levels of foo.
Vector functions available within a mutate block include:
rownumber() creates the sequential row index.
lag(foo) accesses the lagged value of foo; the first row accesses NA.
For a lag other than 1, try lag(foo, n = 2).
lead(foo) is the leading value equivalent.
- Note that these access to values as they were at the time the expression began evaluating; there is no self-referencing updated values.
cummean(foo) creates the running mean of foo.
ntile(foo, 10) creates a n-tiles (deciles in this case) of foo.
Most mathematical functions from base R, such as sum(foo) are useful.
Useful vector functions that actually come from base R include:
cumsum(foo) creates the running total of foo.
To reference the tibble explicitly within a pipeline, by convention it is always available as .data. For example, .data[[varname]].
Logic
To conditionally operate on data step based on a binary condition, use the if_else method within the mutate block. This works as the ifelse function in base R, but is a vectorized operation.
When there are multiple conditions, instead use the case_when method. Each argument to the method is an expression like foo ~ bar; the LHS is a logical expression. For example:
> x %>%
mutate(case_when(bar>0 ~ 1,
bar==0 ~ 10,
TRUE ~ 100))As demonstrated by this example, there should be a terminal fallback case. The argument are evaluated in sequence, and a row's value is taken from the RHS of the first expression evaluating to TRUE.
Grouping
The group_by and rowwise methods transform the tibble to be a grouped tibble. Note that this does not re-order the underlying data. Note also that, by default, groups are ordered ascending.
The group_by function can take any number of grouping variables and also a few options:
By default, any existing groupings are replaced. Specify .add = TRUE to retain them as higher level groups.
If a grouping variable is a factor, then defined levels that do not actually appear in the data may be respected. (The default behavior depends on how the tibble was defined.) Specify .drop = TRUE to ignore levels that do not appear, or .drop = FALSE to respect them.
Grouping affects all dplyr functions so ensure that a tibble is reset, using the ungroup function. Again, note that this does not re-order the underlying data.
Reordering
The arrange method reorders rows in a tibble.
> x >%> arrange(foo, bar, desc(baz))
By default, cases are reordered without respect to groups. To instead use grouping variables as higher level sorting variables, specify .group_by = TRUE.
Forward Pipe
magrittr provides the forward pipe operator (%>%). This serves to forward the LHS as the first argument of the RHS. In other words, f(x) is equivalent to x %>% f().
There are also advanced features like forwarding into an explicit alias (.);
f(1, x) is equivalent to x %>% f(1, .);
f(x%foo, x%bar) is equivalent to x %>% f(.$foo, .$bar).
In version 4.1 of R, a similar forward operator (|>) was introduced into the language. This behaves exactly the same in the simple case, but lacks the advanced features of magrittr. Until 4.1 is old enough to be a null constraint, the recommendation is to use magrittr for portability.
Curly Curly
rlang provides several functions for evaluating raw R syntax.
More importantly, it provides the 'curly curly' operator ({{). This serves for interpolation within dplyr methods.
Consider this demo:
foo <- function(x, bar, baz) {
x %>%
group_by( {{ baz }} ) %>%
summarize(maximum = max( {{ bar }} ))
}
