In this report, I demonstrate the use of blocked (multi-omics), sparse (selected variables), projection to latent structures discriminant analysis (sPLS-DA) to classify breast cancer subtypes using multi-omics data from the Cancer Genome Atlas (TCGA) provided as an example dataset in the mixOmics package.
2 Methods
2.1 Dataset preparation
The training set was loaded from the mixOmics package, which contains 150 breast cancer samples from three subtypes: Basal, Her2, and LumA.
Code
# Build training dataset## Multiomics databreast_tcga_train <-list(mrna = breast.TCGA$data.train$mrna,mirna = breast.TCGA$data.train$mirna,protein = breast.TCGA$data.train$protein)## Training databreast_tcga_train_outcome <- breast.TCGA$data.train$subtype## Show cancer subtypestable(breast_tcga_train_outcome)
Design matrix, defining desired weights between omics levels.
Number of components, determined by cross-validation within the training set.
Number of variables to select for each omics level, determined by cross-validation within the training set across multiple numbers of variables.
2.2.1 Design matrix
The design matrix was selected using cross-correlations between components associated to each omics level.
Code
# Select weights for sPLS-DA# The current dataset contains paired miRNA, mRNA, and protein data# The weight could be based on the pairwise correlation between them.## Calculate pairwise correlations using PLScorrelations <-list(# Calculate PLS between pairs of omicsmrna_protein =pls( breast_tcga_train$mrna, breast_tcga_train$protein,ncomp =1 ),mrna_mirna =pls( breast_tcga_train$mrna, breast_tcga_train$mirna,ncomp =1 ),mirna_protein =pls( breast_tcga_train$mirna, breast_tcga_train$protein,ncomp =1 )) %>%# Calculate correlation within each PLSmap(\(pls) cor(pls$variates$X, pls$variates$Y))print(correlations)
As the range of correlation was between 0.798 and 0.903, a weight of 0.85 was chosen for the design matrix.
Code
## Define a design matrix with 0.85 as the weight between omics.breast_tcga_design <-matrix(0.85,nrow =length(breast_tcga_train),ncol =length(breast_tcga_train),dimnames =list(names(breast_tcga_train), names(breast_tcga_train))) %>%`diag<-`(0)print(breast_tcga_design)
mrna mirna protein
mrna 0.00 0.85 0.85
mirna 0.85 0.00 0.85
protein 0.85 0.85 0.00
2.2.2 Number of components
After a block PLS-DA was performed using 5 components, 10-fold cross-validation with 10 repeats was performed.
Code
# Select number of components for sPLS-DA## First, fit a test block PLS-DA (non-sparse)## to choose the number of componentsbreast_tcga_plsda_test <-block.plsda(X = breast_tcga_train,Y = breast_tcga_train_outcome,design = breast_tcga_design,ncomp =5)
Design matrix has changed to include Y; each block will be
linked to Y.
Code
## Test and plot model statistics## Use 10-fold cross-validation with 10 repeatsbreast_tcga_plsda_test_perf <- breast_tcga_plsda_test %>%perf(validation ="Mfold",folds =10,nrepeat =10 )breast_tcga_ncomp <- breast_tcga_plsda_test_perf$choice.ncomp$WeightedVote %>%max()## Plot statisticsplot(breast_tcga_plsda_test_perf)## Return recommendationmessage(str_c("The highest recommended number of components is ",max(breast_tcga_plsda_test_perf$choice.ncomp$WeightedVote),"." ))
The highest recommended number of components is 5.
Figure 1: Cross-validation metrics for selecting the number of components for block sPLS-DA. X axis represents the number of component, Y axis represents the error rate.
Accordingly, block sPLS-DA was performed using 5 components.
2.2.3 Number of variables
The number of variables for components 1 and 2 in each omics level was selected by cross-validation within the training set across multiple numbers of variables.
The tested numbers of variables are 5, 7, 9, 10, 15, 20, 25
As these values take a long time to compute, they have been pre-calculated using the following code:
Code
breast_tcga_tune_nvars <-list(mrna =c(seq(5, 9, 2), seq(10, 25, 5)),mirna =c(seq(5, 9, 2), seq(10, 25, 5)),protein =c(seq(5, 9, 2), seq(10, 25, 5)))breast_tcga_tune <-tune.block.splsda(X = breast_tcga_train,Y = breast_tcga_train_outcome,ncomp =2,test.keepX = breast_tcga_tune_nvars,design = breast_tcga_design,validation ="Mfold",folds =5,nrepeat =2,dist ="centroids.dist",BPPARAM = current_bpparam)## Get the number of selected variables for each component.breast_tcga_keep_vars <- breast_tcga_tune$choice.keepX
Code
# Manually define the number of variables to keep for each component based on past calculations.breast_tcga_keep_vars <-list(mrna =c(15, 5),mirna =c(20, 7),protein =c(9, 5))
2.3 Final model
The final block sPLS-DA model includes:
Weight matrix with 0.85 as the weight between omics levels.
Figure 2: Correlation between components from each omics level. Samples are represented based on their component 1 score and coloured by cancer subtype. Bottom left corner represents correlation coefficients between components 1 of each omics level. mRNA and protein data are highly correlated.
All samples are clustered by cancer subtype, demonstrating the classification power of this model. The omics levels are highly correlated, with correlation coefficients ranging from 0.8 - 0.9.
Figure 3: Projection of each sample into each omics level’s components 1 and 2. Samples are represented based on their components 1 and 2 scores and coloured by cancer subtype. Clearer clustering by cancer subtypes are seen in the mRNA and protein datasets.
Across all datasets, component 1 scores (x axis) are seen to be well separated by cancer subtype across all omics levels. In contrast, component 2 scores (y axis) separate Her2 samples from other subtypes only in protein data.
Figure 4: Distance between each omics levels’ centroid and sample’s position. Samples are represented based on their components 1 and 2 scores and coloured by cancer subtype. Start of each arrow represent the omics levels’ centroid, tip of arrow represents each sample’s position.
3.2 Per-variable plots
These plots demonstrate the contribution of each variable to the model.
Contribution of each variable to components 1 and 2. Variables are represented based on their contribution to components 1 and 2 scores and coloured by omics level.
From this plot, miRNA and mRNA data are seen to be negatively correlated in component 1, white protein data is seen to be positively correlated with both miRNA and mRNA data. Component 2 comprises primarily of miRNA and mRNA data.
Contributions of each variable to component 1 separated by omics level. The most important variables, based on their loadings, are sorted bottom to top.
From this plot, component 1 is seen to primarily classify basal samples from LumA samples. This supports the findings of Figure 2 and Figure 3, where Basal and LumA clusters are well-defined.
Contributions of each variable to component 2 separated by omics level. The most important variables, based on their loadings, are sorted bottom to top.
From this plot, component 2 is seen to primarily classify Her2 samples from LumA samples. Interestingly, the protein data in this component can classify all three cancer subtypes.
Correlation between variables in different omics in circos plot. Only correlations greater than 0.7 are shown. Internal connecting lines between omics levels show pairwise correlations between variables in each of these omics levels.
In this plot, negative correlations are shown to occur between only miRNA and other omics levels, whereas positive correlations are shown to occur between all omics levels. The features that are most highly correlated (i.e. connected by lines) are shown to be highly variable between cancer subtypes, suggesting that these features are important for classification.
3.3 Model performance
Model performance was assessed using:
Cross-validation within training dataset
Validation against test dataset
3.3.1 Cross-validation within training dataset
Cross-validation was performed using 10-fold cross-validation with 10 repeats within the 150 training samples. The weighted vote error rate is shown:
This plot demonstrates that the Her2 subtype is the most difficult to classific by this model, with sensitivity and specificity consistently weaker than other subtypes. This is supported by other evaluation figures, such as Figure 2 and Figure 3, where the Her2 samples were not shown to be clearly separated from other cancer subtypes.
3.3.2 Validation against test dataset
The model was used to classify 70 test samples from the TCGA breast cancer dataset by the 3 subtypes: Basal, Her2, LumA. Only mRNA and miRNA data was used for classification. The first 2 components of the model were used for classification.
Warning in predict.block.spls(breast_tcga_splsda, newdata = breast_tcga_test):
Some blocks are missing in 'newdata'; the prediction is based on the following
blocks only: mrna, mirna
Code
## Check prediction performance using confusion matrix.## Use 2 components in model.breast_tcga_test_confmatrix <-get.confusion_matrix(truth = breast.TCGA$data.test$subtype,predicted = breast_tcga_test_predict$WeightedVote$centroids.dist[, 2])print(breast_tcga_test_confmatrix)