1 Preface

In this report, I demonstrate the use of blocked (multi-omics), sparse (selected variables), projection to latent structures discriminant analysis (sPLS-DA) to classify breast cancer subtypes using multi-omics data from the Cancer Genome Atlas (TCGA) provided as an example dataset in the mixOmics package.

2 Methods

2.1 Dataset preparation

The training set was loaded from the mixOmics package, which contains 150 breast cancer samples from three subtypes: Basal, Her2, and LumA.

Code

# Build training dataset
## Multiomics data
breast_tcga_train <- list(
  mrna = breast.TCGA$data.train$mrna,
  mirna = breast.TCGA$data.train$mirna,
  protein = breast.TCGA$data.train$protein
)

## Training data
breast_tcga_train_outcome <- breast.TCGA$data.train$subtype

## Show cancer subtypes
table(breast_tcga_train_outcome)

breast_tcga_train_outcome
Basal  Her2  LumA 
   45    30    75

2.2 Parameter selection

The following parameters were selected:

Design matrix, defining desired weights between omics levels.
Number of components, determined by cross-validation within the training set.
Number of variables to select for each omics level, determined by cross-validation within the training set across multiple numbers of variables.

2.2.1 Design matrix

The design matrix was selected using cross-correlations between components associated to each omics level.

Code

# Select weights for sPLS-DA
# The current dataset contains paired miRNA, mRNA, and protein data
# The weight could be based on the pairwise correlation between them.
## Calculate pairwise correlations using PLS
correlations <- list(
  # Calculate PLS between pairs of omics
  mrna_protein = pls(
    breast_tcga_train$mrna,
    breast_tcga_train$protein,
    ncomp = 1
  ),
  mrna_mirna = pls(
    breast_tcga_train$mrna,
    breast_tcga_train$mirna,
    ncomp = 1
  ),
  mirna_protein = pls(
    breast_tcga_train$mirna,
    breast_tcga_train$protein,
    ncomp = 1
  )
) %>%
  # Calculate correlation within each PLS
  map(\(pls) cor(pls$variates$X, pls$variates$Y))

print(correlations)

$mrna_protein
          comp1
comp1 0.9031761

$mrna_mirna
          comp1
comp1 0.8456299

$mirna_protein
          comp1
comp1 0.7982008

As the range of correlation was between 0.798 and 0.903, a weight of 0.85 was chosen for the design matrix.

Code

## Define a design matrix with 0.85 as the weight between omics.
breast_tcga_design <- matrix(
  0.85,
  nrow = length(breast_tcga_train),
  ncol = length(breast_tcga_train),
  dimnames = list(names(breast_tcga_train), names(breast_tcga_train))
) %>%
  `diag<-`(0)

print(breast_tcga_design)

        mrna mirna protein
mrna    0.00  0.85    0.85
mirna   0.85  0.00    0.85
protein 0.85  0.85    0.00

2.2.2 Number of components

After a block PLS-DA was performed using 5 components, 10-fold cross-validation with 10 repeats was performed.

Code

# Select number of components for sPLS-DA
## First, fit a test block PLS-DA (non-sparse)
## to choose the number of components
breast_tcga_plsda_test <- block.plsda(
  X = breast_tcga_train,
  Y = breast_tcga_train_outcome,
  design = breast_tcga_design,
  ncomp = 5
)

Design matrix has changed to include Y; each block will be
            linked to Y.

Code

## Test and plot model statistics
## Use 10-fold cross-validation with 10 repeats
breast_tcga_plsda_test_perf <- breast_tcga_plsda_test %>%
  perf(
    validation = "Mfold",
    folds = 10,
    nrepeat = 10
  )

breast_tcga_ncomp <- breast_tcga_plsda_test_perf$choice.ncomp$WeightedVote %>%
  max()


## Plot statistics
plot(breast_tcga_plsda_test_perf)

## Return recommendation
message(
  str_c(
    "The highest recommended number of components is ",
    max(breast_tcga_plsda_test_perf$choice.ncomp$WeightedVote),
    "."
  )
)

The highest recommended number of components is 5.

Figure 1: Cross-validation metrics for selecting the number of components for block sPLS-DA. X axis represents the number of component, Y axis represents the error rate.

Accordingly, block sPLS-DA was performed using 5 components.

2.2.3 Number of variables

The number of variables for components 1 and 2 in each omics level was selected by cross-validation within the training set across multiple numbers of variables.

The tested numbers of variables are 5, 7, 9, 10, 15, 20, 25

As these values take a long time to compute, they have been pre-calculated using the following code:

Code

breast_tcga_tune_nvars <- list(
  mrna = c(seq(5, 9, 2), seq(10, 25, 5)),
  mirna = c(seq(5, 9, 2), seq(10, 25, 5)),
  protein = c(seq(5, 9, 2), seq(10, 25, 5))
)

breast_tcga_tune <- tune.block.splsda(
  X = breast_tcga_train,
  Y = breast_tcga_train_outcome,
  ncomp = 2,
  test.keepX = breast_tcga_tune_nvars,
  design = breast_tcga_design,
  validation = "Mfold",
  folds = 5,
  nrepeat = 2,
  dist = "centroids.dist",
  BPPARAM = current_bpparam
)

## Get the number of selected variables for each component.
breast_tcga_keep_vars <- breast_tcga_tune$choice.keepX

Code

# Manually define the number of variables to keep for each component based on past calculations.
breast_tcga_keep_vars <- list(
  mrna = c(15, 5),
  mirna = c(20, 7),
  protein = c(9, 5)
)

2.3 Final model

The final block sPLS-DA model includes:

Weight matrix with 0.85 as the weight between omics levels.
5 components.
15, 5 variables for mRNA components 1 and 2.
20, 7 variables for miRNA components 1 and 2.
9, 5 variables for protein components 1 and 2.

Code

breast_tcga_splsda <- block.splsda(
  X = breast_tcga_train,
  Y = breast_tcga_train_outcome,
  design = breast_tcga_design,
  ncomp = breast_tcga_ncomp,
  keepX = breast_tcga_keep_vars
)

Design matrix has changed to include Y; each block will be
            linked to Y.

3 Results

3.1 Per-sample plots

These plots demonstrate the position of each samples within the model.

Code

breast_tcga_splsda %>%
  plotDiablo(ncomp = 1)

Figure 2: Correlation between components from each omics level. Samples are represented based on their component 1 score and coloured by cancer subtype. Bottom left corner represents correlation coefficients between components 1 of each omics level. mRNA and protein data are highly correlated.

All samples are clustered by cancer subtype, demonstrating the classification power of this model. The omics levels are highly correlated, with correlation coefficients ranging from 0.8 - 0.9.

Code

breast_tcga_splsda %>%
  plotIndiv(
    ind.names = FALSE,
    legend = TRUE,
    title = "TCGA, block sPLS-DA comps 1 - 2"
  )

Figure 3: Projection of each sample into each omics level’s components 1 and 2. Samples are represented based on their components 1 and 2 scores and coloured by cancer subtype. Clearer clustering by cancer subtypes are seen in the mRNA and protein datasets.

Across all datasets, component 1 scores (x axis) are seen to be well separated by cancer subtype across all omics levels. In contrast, component 2 scores (y axis) separate Her2 samples from other subtypes only in protein data.

Code

breast_tcga_splsda %>%
  plotArrow(
    ind.names = FALSE,
    legend = TRUE
  )

Figure 4: Distance between each omics levels’ centroid and sample’s position. Samples are represented based on their components 1 and 2 scores and coloured by cancer subtype. Start of each arrow represent the omics levels’ centroid, tip of arrow represents each sample’s position.

3.2 Per-variable plots

These plots demonstrate the contribution of each variable to the model.

Code

breast_tcga_splsda %>%
  plotVar(
    var.names = FALSE,
    legend = TRUE,
    title = "TCGA, block sPLS-DA comps 1 - 2"
  )

Contribution of each variable to components 1 and 2. Variables are represented based on their contribution to components 1 and 2 scores and coloured by omics level.

From this plot, miRNA and mRNA data are seen to be negatively correlated in component 1, white protein data is seen to be positively correlated with both miRNA and mRNA data. Component 2 comprises primarily of miRNA and mRNA data.

Code

breast_tcga_splsda %>%
  plotLoadings(
    comp = 1,
    contrib = "max",
    method = "median"
  )

Contributions of each variable to component 1 separated by omics level. The most important variables, based on their loadings, are sorted bottom to top.

From this plot, component 1 is seen to primarily classify basal samples from LumA samples. This supports the findings of Figure 2 and Figure 3, where Basal and LumA clusters are well-defined.

Code

breast_tcga_splsda %>%
  plotLoadings(
    comp = 2,
    contrib = "max",
    method = "median"
  )

Contributions of each variable to component 2 separated by omics level. The most important variables, based on their loadings, are sorted bottom to top.

From this plot, component 2 is seen to primarily classify Her2 samples from LumA samples. Interestingly, the protein data in this component can classify all three cancer subtypes.

Code

breast_tcga_splsda %>%
  circosPlot(
    cutoff = 0.7,
    line = TRUE,
    color.cor = c("red", "blue")
  )

Correlation between variables in different omics in circos plot. Only correlations greater than 0.7 are shown. Internal connecting lines between omics levels show pairwise correlations between variables in each of these omics levels.

In this plot, negative correlations are shown to occur between only miRNA and other omics levels, whereas positive correlations are shown to occur between all omics levels. The features that are most highly correlated (i.e. connected by lines) are shown to be highly variable between cancer subtypes, suggesting that these features are important for classification.

3.3 Model performance

Model performance was assessed using:

Cross-validation within training dataset
Validation against test dataset

3.3.1 Cross-validation within training dataset

Cross-validation was performed using 10-fold cross-validation with 10 repeats within the 150 training samples. The weighted vote error rate is shown:

Code

breast_tcga_splsda_perf <- breast_tcga_splsda %>%
  perf(
    validation = "Mfold",
    folds = 10,
    nrepeat = 10,
    dist = "centroids.dist"
  )

breast_tcga_splsda_perf$MajorityVote.error.rate %>%
  print()

$centroids.dist
                 comp1      comp2     comp3      comp4      comp5
Basal       0.05555556 0.06444444 0.0600000 0.12444444 0.10666667
Her2        0.19333333 0.16333333 0.1966667 0.17333333 0.19000000
LumA        0.06533333 0.03200000 0.1040000 0.06933333 0.07466667
Overall.ER  0.08800000 0.06800000 0.1093333 0.10666667 0.10733333
Overall.BER 0.10474074 0.08659259 0.1202222 0.12237037 0.12377778

3.3.1.1 ROC and AUC plot per omics level

Code

plot(grid)

This plot demonstrates that the Her2 subtype is the most difficult to classific by this model, with sensitivity and specificity consistently weaker than other subtypes. This is supported by other evaluation figures, such as Figure 2 and Figure 3, where the Her2 samples were not shown to be clearly separated from other cancer subtypes.

3.3.2 Validation against test dataset

The model was used to classify 70 test samples from the TCGA breast cancer dataset by the 3 subtypes: Basal, Her2, LumA. Only mRNA and miRNA data was used for classification. The first 2 components of the model were used for classification.

Code

breast_tcga_test <- list(
  mrna = breast.TCGA$data.test$mrna,
  mirna = breast.TCGA$data.test$mirna
)

## Perform prediction
breast_tcga_test_predict <- predict(
  breast_tcga_splsda,
  newdata = breast_tcga_test
)

Warning in predict.block.spls(breast_tcga_splsda, newdata = breast_tcga_test):
Some blocks are missing in 'newdata'; the prediction is based on the following
blocks only: mrna, mirna

Code

## Check prediction performance using confusion matrix.
## Use 2 components in model.
breast_tcga_test_confmatrix <- get.confusion_matrix(
  truth = breast.TCGA$data.test$subtype,
  predicted = breast_tcga_test_predict$WeightedVote$centroids.dist[, 2]
)

print(breast_tcga_test_confmatrix)

      predicted.as.Basal predicted.as.Her2 predicted.as.LumA
Basal                 17                 4                 0
Her2                   1                12                 1
LumA                   0                 4                31

The confusion matrix demonstrated good performance, with the following errors:

4 Basal sample defined as Her2.
0 Basal sample defined as LumA.
1 Her2 sample defined as Basal.
1 Her2 sample defined as LumA.
0 LumA sample defined as Basal.
4 LumA sample defined as Her2.

The remaining 10 samples were correctly classified.

The error rate of this model is 0.149.