4 Subsetting

4.1 Quiz

More details are available in the Subsetting Quiz book chapter.

4.1.1 Quiz 1

What is the result of subsetting a vector with positive integers, negative integers, a logical vector, or a character vector?

Answers

Positive integer: Select elements at specified indices

Negative integers: Exclude the specified indices

Logical vector: Select elements where the logical vector element is TRUE (this is how subsetting via a condition work)

Character vector: Return elements with matching names

4.1.2 Quiz 2

What’s the difference between [, [[, and $ when applied to a list?

Answers

[ gets a subsetted list based on the condition

[[ gets the selected list element (select via index or name)

x$var gets the selected list element (select via name, shorthand for x[["var"]])

4.1.3 Quiz 3

When should you use drop = FALSE?

Answers

When subsetting a matrix / array / dataframe, length 1 dimensions are dropped. drop = FALSE prevents this.

4.1.4 Quiz 4

If x is a matrix, what does x[] <- 0 do? How is it different from x <- 0?

Answers

x[] <- 0 replaces all slots of the matrix with 0 but keeping its attributes. x <- 0 replaces the matrix with the value 0.

4.1.5 Quiz 5

How can you use a named vector to relabel categorical variables?

Answers

Create a lookup vector, use original vector values as lookup vector names
Subset lookup vector using original vector
(Optional) remove lookup vector names for clarity

Alternatively, one can also use dplyr::case_when()

4.2 Selecting multiple elements

More details are available in the Selecting multiple elements book chapter.

4.2.1 Ex. 1

Fix each of the following common data frame subsetting errors:

Answers

Error 1

Error 2

Error 3

Error 4

4.2.2 Ex. 2

Why does the following code yield five missing values? (Hint: why is it different from x[NA_real_]?)

Answers

NA is recycled, meaning this check is performed on all vector elements.
At each element, the code compares it against NA and returns an NA due to its propagation property.

NB: This is why we use is.na() to get missing values in an object instead.

As to why NA_real_ is different from NA: - NA is a logical value and is recycled in the subsetting. This means that x[NA] is equivalent to x[c(NA, NA, NA, NA, NA)]. - NA_real_ is a double value and is not recycled in the subsetting.

4.2.3 Ex. 3

What does upper.tri() return? How does subsetting a matrix with it work? Do we need any additional subsetting rules to describe its behaviour?

Answers

upper.tri() returns the upper triangle (top right) portion of the matrix, where the row number is smaller than the column number, in the form of a logical matrix.

The logical matrix can then be used to subset the original matrix.
The returned object is a vector, dropping the dim attribute of a matrix.

Extra behaviour includes the option to include diagonals from the matrix (i.e. where row number equals column number).

4.2.4 Ex. 4

Why does mtcars[1:20] return an error? How does it differ from the similar mtcars[1:20, ]?

Answers

mtcars[1:20] get the first 20 columns of mtcars. There are not enough columns.

mtcars[1:20, ] gets the first 20 rows of mtcars, which is valid.

4.2.5 Ex. 5

Implement your own function that extracts the diagonal entries from a matrix (it should behave like diag(x) where x is a matrix).

Answers

Let’s make the function extra special by allowing it to extract diagonals from matrices and arrays (2 dimensions and more).

Desired behaviour

Build function

Diagonal entries are entries where number of rows is equal to number of columns. We can programmatically create a matrix containing the relevant indices then subset the matrix for those indices.

4.2.6 Ex. 6

What does df[is.na(df)] <- 0 do? How does it work?

Answers

It replaces NA with 0 in all dataframe cells. It creates a new logical matrix showing the status of all dataframe cells, which can then be used to subset the dataframe. The subset is then modified.

4.3 Selecting a single element

More details are available in the Selecting a single element book chapter.

4.3.1 Ex. 1

Brainstorm as many ways as possible to extract the third value from the cyl variable in the mtcars dataset.

Answers

4.3.2 Ex. 2

Given a linear model, e.g., mod <- lm(mpg ~ wt, data = mtcars), extract the residual degrees of freedom. Then extract the R squared from the model summary (summary(mod))

Answers

4.4 Applications

More details are available in the Applications book chapter.

4.4.1 Ex. 1

How would you randomly permute the columns of a data frame? (This is an important technique in random forests.) Can you simultaneously permute the rows and columns in one step?

Answers

For both, get a random vector containing column and row indices then subset the dataframe accordingly.

4.4.2 Ex. 2

How would you select a random sample of m rows from a data frame? What if the sample had to be contiguous (i.e., with an initial row, a final row, and every row in between)?

Answers

Select sample of m rows: Sample like Ex. 1, with an extra argument to sample()
Select sample of m contiguous rows: Randomly select the first row index to sample from, then select rows beginning from that index.

4.4.3 Ex. 3

How could you put the columns in a data frame in alphabetical order?

Answers

Get vector depicting order of column names
Subset accordingly