4 Subsetting
4.1 Quiz
More details are available in the Subsetting Quiz book chapter.
4.1.1 Quiz 1
What is the result of subsetting a vector with positive integers, negative integers, a logical vector, or a character vector?
Answers
- Positive integer: Select elements at specified indices
- Negative integers: Exclude the specified indices
- Logical vector: Select elements where the logical vector element is TRUE (this is how subsetting via a condition work)
- Character vector: Return elements with matching names
4.1.2 Quiz 2
What’s the difference between [, [[, and $ when applied to a list?
Answers
[gets a subsetted list based on the condition
[[gets the selected list element (select via index or name)
x$vargets the selected list element (select via name, shorthand forx[["var"]])
4.1.3 Quiz 3
When should you use drop = FALSE?
Answers
When subsetting a matrix / array / dataframe, length 1 dimensions are dropped. drop = FALSE prevents this.
4.1.4 Quiz 4
If x is a matrix, what does x[] <- 0 do? How is it different from x <- 0?
Answers
x[] <- 0 replaces all slots of the matrix with 0 but keeping its attributes. x <- 0 replaces the matrix with the value 0.
4.1.5 Quiz 5
How can you use a named vector to relabel categorical variables?
Answers
- Create a lookup vector, use original vector values as lookup vector names
- Subset lookup vector using original vector
- (Optional) remove lookup vector names for clarity
Alternatively, one can also use dplyr::case_when()
4.2 Selecting multiple elements
More details are available in the Selecting multiple elements book chapter.
4.2.1 Ex. 1
Fix each of the following common data frame subsetting errors:
Answers
Error 1
Error 2
Error 3
Error 4
4.2.2 Ex. 2
Why does the following code yield five missing values? (Hint: why is it different from x[NA_real_]?)
Answers
NAis recycled, meaning this check is performed on all vector elements.- At each element, the code compares it against NA and returns an NA due to its propagation property.
NB: This is why we use is.na() to get missing values in an object instead.
As to why NA_real_ is different from NA: - NA is a logical value and is recycled in the subsetting. This means that x[NA] is equivalent to x[c(NA, NA, NA, NA, NA)]. - NA_real_ is a double value and is not recycled in the subsetting.
4.2.3 Ex. 3
What does upper.tri() return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?
Answers
upper.tri()returns the upper triangle (top right) portion of the matrix, where the row number is smaller than the column number, in the form of a logical matrix.
- The logical matrix can then be used to subset the original matrix.
- The returned object is a vector, dropping the
dimattribute of a matrix.
- Extra behaviour includes the option to include diagonals from the matrix (i.e. where row number equals column number).
4.2.4 Ex. 4
Why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?
Answers
mtcars[1:20] get the first 20 columns of mtcars. There are not enough columns.
mtcars[1:20, ] gets the first 20 rows of mtcars, which is valid.
4.2.5 Ex. 5
Implement your own function that extracts the diagonal entries from a matrix (it should behave like diag(x) where x is a matrix).
Answers
Let’s make the function extra special by allowing it to extract diagonals from matrices and arrays (2 dimensions and more).
Desired behaviour
Build function
Diagonal entries are entries where number of rows is equal to number of columns. We can programmatically create a matrix containing the relevant indices then subset the matrix for those indices.
4.2.6 Ex. 6
What does df[is.na(df)] <- 0 do? How does it work?
Answers
It replaces NA with 0 in all dataframe cells. It creates a new logical matrix showing the status of all dataframe cells, which can then be used to subset the dataframe. The subset is then modified.
4.3 Selecting a single element
More details are available in the Selecting a single element book chapter.
4.3.1 Ex. 1
Brainstorm as many ways as possible to extract the third value from the cyl variable in the mtcars dataset.
Answers
4.3.2 Ex. 2
Given a linear model, e.g., mod <- lm(mpg ~ wt, data = mtcars), extract the residual degrees of freedom. Then extract the R squared from the model summary (summary(mod))
Answers
4.4 Applications
More details are available in the Applications book chapter.
4.4.1 Ex. 1
How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?
Answers
For both, get a random vector containing column and row indices then subset the dataframe accordingly.
4.4.2 Ex. 2
How would you select a random sample of m rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?
Answers
- Select sample of
mrows: Sample like Ex. 1, with an extra argument tosample() - Select sample of
mcontiguous rows: Randomly select the first row index to sample from, then select rows beginning from that index.
4.4.3 Ex. 3
How could you put the columns in a data frame in alphabetical order?
Answers
- Get vector depicting order of column names
- Subset accordingly