Data Frames
Data frames (data.frame) are matrices with non-homogeneous columns.
Contents
Usage
A data frame is instantiated like:
> df1 <- data.frame(foo = c(1,2,4),
bar = c("a","b","c"))
> df2 <- as.data.frame(matrix(1:9, nrow=3))Note that if the column vectors have different lengths, the shorter inputs will be 'recycled' (i.e., values are repeated as necessary). This can easily lead to errors and should be avoided.
Note also that df2 here has column names like V1 and so on.
To append a row, try:
> df1 <- rbind(df1, c(8, "d"))
To append a column, the following are essentially equivalent:
> df2 <- cbind(df2, V4 = c(10, 10, 10)) > df2$V4 <- c(10, 10, 10)
To access the dimensions of a data frame, try the dim, ncol, and nrow functions.
By default, data frames use sequential integers as row names. To use a column, try:
rownames(df) <- df$id
Indexing
There are several methods for indexing a data frame. All of the following select a single column, but some of the methods return a vector while others return a new data frame containing just a single column.
> # 1. Single bracket, single index > df2[3] ## V3 ## 1 7 ## 2 8 ## 3 9 > # 2. Double brackets, numeric index > df2[[3]] ## [1] 7 8 9 > # 3. Double brackets, variable name > df2[["V3"]] ## [1] 7 8 9 > # 4. Single bracket, two indices > df2[, 3] ## [1] 7 8 9 > # 5. Single bracket, two indices, disallow dropping dimensions > df2[, 3, drop = FALSE] ## V3 ## 1 7 ## 2 8 ## 3 9 > # 6. Dollar sign syntax > df2$V3 ## [1] 7 8 9
More generally, indexing is done with single brackets and two indices (row and column). An index can be either an integer or a vector of integers. Leaving an index blank is equivalent to selecting all.
> df2[1, 1] ## [1] 1 > df2[1, ] ## [1] 1 4 7 > df2[, 1] ## [1] 1 2 3 > df2[1:2, ] ## [1] 1 4 > df2[c(1,3), ] ## [1] 1 7
Subsetting
The best method for removing rows and columns is the subset function. To filter rows based on a condition, try:
> df1 <- subset(df, foo <= 10 | bar == 1)
Use the select option to filter columns.
> df2 <- subset(df, select = foo) > df2 <- subset(df, select = -foo) > df2 <- subset(df, select = foo:bar) > df2 <- subset(df, select = c(foo, bar))
This also demonstrates that subset can be used without a condition, meaning that no rows are filtered.
Logical vector indexing is an alternative. For example, to select rows based on a condition, try:
> df3 <- df[df$foo <= 10 | df$bar == 1, ]
A further extension of this is negative indexing. For example, to remove the second variable, try:
> df4 <- df[, -2] > df4 <- df[, -c(2)]
A final alternative method for removing a single column is to assign NUL.
> df$baz <- NULL
