For this project, I aim to train and evaluate three supervised learning models to classify lung cancer tissue samples as either adenocarcinoma (LUAD) or squamous cell carcinoma (LUSC). The dataset comprises 1,153 samples and 388 features, with the “label” column serving as the target variable, indicating the true classification of each sample as either LUAD or LUSC. The remaining 387 columns contain gene expression levels corresponding to various genes associated with these tumor samples. The goal is to determine the most effective model for accurately distinguishing between these two lung cancer subtypes.
I will begin by loading the required libraries and importing the dataset, checking its dimensions to ensure it has been loaded correctly. Next, I will convert the last column, “label”, into a factor, as it represents the categorical classification of lung cancer samples (LUAD or LUSC). To understand the class distribution, I will analyze the frequency of each label, which will help determine the most appropriate method for splitting the data.
A 70/30 train-test split will be applied to create separate training and testing datasets while maintaining class proportions. I will then train three supervised learning models—Random Forest, XGBoost, and Support Vector Machine (SVM)—on the training data and evaluate their performance on the test set. Model comparison will be based on key classification metrics such as accuracy, precision, recall, and F1-score to determine the most effective classifier for distinguishing between LUAD and LUSC samples.
suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(stringr)))
suppressMessages(suppressWarnings(library(caret)))
suppressMessages(suppressWarnings(library(nnet)))
suppressMessages(suppressWarnings(library(randomForest)))
suppressMessages(suppressWarnings(library(e1071)))
suppressMessages(suppressWarnings(library(xgboost)))
suppressMessages(suppressWarnings(library(SHAPforxgboost)))
library(readxl)
#Load and check data
df <- read_xlsx("C:\\Users\\fasci\\OneDrive\\Desktop\\BMI\\BMIDS HW 2-1.xlsx")
dim(df)
## [1] 1153 388
df$label <- as.factor(df$label)
#Frequency of ground truth label We want to have an overview of the distribution of the labels.
occurrence <- prop.table(table(df$label))
cbind(freq=table(df$label), occurrence=occurrence)
## freq occurrence
## luad 600 0.5203816
## lusc 553 0.4796184
We first try the random split.
set.seed(18)
dt = sample(nrow(df), nrow(df)*.7)
random_train <- df[dt,]
random_test <- df[-dt,]
prop.table(table(random_train$label))
##
## luad lusc
## 0.5092937 0.4907063
Then Propotional split.
set.seed(1234)
index <- createDataPartition(df$label, p = 0.7, list = FALSE)
train_set <- df[index,]
test_set <- df[-index,]
prop.table(table(train_set$label))
##
## luad lusc
## 0.519802 0.480198
I prefer the proportional split since it gives me the closest frequncy distributions.
I will train and run the following models [Random Forest, XGboost, SVM] and for each model generate the corresponding confusion matrix.
# Set seed
set.seed(1234)
# Build model
rf <- randomForest(label~., data=train_set, ntree=10)
# Make predictions
pred_rf <- predict(rf, test_set)
rf_cm <- confusionMatrix(pred_rf, test_set$label, mode = "prec_recall", positive= "luad")
rf_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction luad lusc
## luad 177 13
## lusc 3 152
##
## Accuracy : 0.9536
## 95% CI : (0.9258, 0.9733)
## No Information Rate : 0.5217
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9068
##
## Mcnemar's Test P-Value : 0.02445
##
## Precision : 0.9316
## Recall : 0.9833
## F1 : 0.9568
## Prevalence : 0.5217
## Detection Rate : 0.5130
## Detection Prevalence : 0.5507
## Balanced Accuracy : 0.9523
##
## 'Positive' Class : luad
##
# Set seed
set.seed(1234)
# Set parameters
tune_grid <- data.frame(nrounds=c(100), max_depth = c(2),eta =c(0.3),gamma=c(0),
colsample_bytree=c(0.8),min_child_weight=c(1),subsample=c(1))
# Build model
xgb1 <- train(label~., data=train_set, method="xgbTree",
metric="Accuracy", trControl=trainControl(method="cv"), tuneGrid=tune_grid)
# Make predictions
pred_xgb <- predict(xgb1, test_set)
confusionMatrix(pred_xgb, test_set$label, mode = "prec_recall", positive="luad")
## Confusion Matrix and Statistics
##
## Reference
## Prediction luad lusc
## luad 179 5
## lusc 1 160
##
## Accuracy : 0.9826
## 95% CI : (0.9625, 0.9936)
## No Information Rate : 0.5217
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9651
##
## Mcnemar's Test P-Value : 0.2207
##
## Precision : 0.9728
## Recall : 0.9944
## F1 : 0.9835
## Prevalence : 0.5217
## Detection Rate : 0.5188
## Detection Prevalence : 0.5333
## Balanced Accuracy : 0.9821
##
## 'Positive' Class : luad
##
# Set seed
set.seed(1234)
# Build model
svm1 <- svm(label~., data=train_set, type='C-classification', kernel = 'linear')
# Make predictions
pred_svm = predict(svm1, test_set)
confusionMatrix(pred_svm, test_set$label, mode = "prec_recall", positive="luad")
## Confusion Matrix and Statistics
##
## Reference
## Prediction luad lusc
## luad 180 2
## lusc 0 163
##
## Accuracy : 0.9942
## 95% CI : (0.9792, 0.9993)
## No Information Rate : 0.5217
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9884
##
## Mcnemar's Test P-Value : 0.4795
##
## Precision : 0.9890
## Recall : 1.0000
## F1 : 0.9945
## Prevalence : 0.5217
## Detection Rate : 0.5217
## Detection Prevalence : 0.5275
## Balanced Accuracy : 0.9939
##
## 'Positive' Class : luad
##
# Load knitr package
library(knitr)
# Create a dataframe with model performance metrics
model_performance <- data.frame(
Model = c("Random Forest", "XGBoost", "SVM"),
Accuracy = c("0.9536", "0.9826", "0.9942"),
Precision = c("0.9316", "0.9728", "0.9890"),
Recall = c("0.9833", "0.9944", "1.000"),
F1_Score = c("0.9568", "0.9835", "0.9945")
)
# Display the table using knitr::kable()
kable(model_performance, caption = "Performance Comparison of Machine Learning Models")
| Model | Accuracy | Precision | Recall | F1_Score |
|---|---|---|---|---|
| Random Forest | 0.9536 | 0.9316 | 0.9833 | 0.9568 |
| XGBoost | 0.9826 | 0.9728 | 0.9944 | 0.9835 |
| SVM | 0.9942 | 0.9890 | 1.000 | 0.9945 |
This study evaluated the performance of three supervised machine learning models—Random Forest (RF), XGBoost, and Support Vector Machine (SVM)—in classifying lung cancer samples as either adenocarcinoma (LUAD) or squamous cell carcinoma (LUSC). The models were assessed using key classification metrics, including accuracy, precision, recall, F1-score, and balanced accuracy.
Random Forest (RF) achieved an accuracy of 95.36%, with a recall of 98.33% and an F1-score of 95.68%. While the model demonstrated strong classification ability, its precision (93.16%) was slightly lower, indicating that it had a higher false positive rate compared to the other models. The McNemar’s test p-value (0.02445) suggests some asymmetry in misclassification errors, meaning the model made more mistakes when predicting LUSC.
XGBoost performed significantly better, with an accuracy of 98.26%, a precision of 97.28%, and an F1-score of 98.35%. It demonstrated a higher recall (99.44%), meaning it correctly identified almost all LUAD cases while maintaining a strong balance between precision and recall. The McNemar’s test p-value (0.2207) indicates that misclassification errors between classes were relatively balanced.
Support Vector Machine (SVM) emerged as the best-performing model, achieving the highest accuracy (99.42%), perfect recall (100%), and an F1-score of 99.45%. It also had the highest kappa score (0.9884), indicating strong agreement between predicted and actual labels. The model correctly classified all LUAD cases (recall = 100%) and had very few false positives, as indicated by its precision of 98.90%. The McNemar’s test p-value (0.4795) suggests a well-balanced classification performance.
Among the three models, SVM demonstrated the best overall
classification performance, with the highest accuracy,
perfect recall, and the highest F1-score. The model
successfully identified all LUAD samples without any false
negatives, which is crucial in clinical applications where
misclassification could lead to incorrect treatment decisions.
XGBoost also performed exceptionally well, with only a
slight drop in precision and accuracy compared to SVM. Random
Forest, while still highly effective, showed more misclassification
errors compared to the other two models.
Given these results, SVM is the most reliable model for
classifying lung cancer subtypes in this dataset. Future improvements
could include hyperparameter tuning, feature selection, and
ensemble methods to further enhance model performance and
computational efficiency.
sessionInfo()
## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_Europe.utf8 LC_CTYPE=English_Europe.utf8
## [3] LC_MONETARY=English_Europe.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Europe.utf8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.49 readxl_1.4.3 SHAPforxgboost_0.1.3
## [4] xgboost_1.7.8.1 e1071_1.7-16 randomForest_4.7-1.2
## [7] nnet_7.3-20 caret_7.0-1 lattice_0.22-6
## [10] ggplot2_3.5.1 stringr_1.5.1 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4041.110 farver_2.1.2
## [4] fastmap_1.2.0 tweenr_2.0.3 pROC_1.18.5
## [7] digest_0.6.37 rpart_4.1.23 timechange_0.3.0
## [10] lifecycle_1.0.4 survival_3.7-0 magrittr_2.0.3
## [13] compiler_4.4.2 rlang_1.1.5 sass_0.4.9
## [16] tools_4.4.2 yaml_2.3.10 data.table_1.16.4
## [19] ggsignif_0.6.4 plyr_1.8.9 RColorBrewer_1.1-3
## [22] abind_1.4-8 withr_3.0.2 purrr_1.0.2
## [25] grid_4.4.2 polyclip_1.10-7 stats4_4.4.2
## [28] ggpubr_0.6.0 colorspace_2.1-1 future_1.34.0
## [31] globals_0.16.3 scales_1.3.0 iterators_1.0.14
## [34] MASS_7.3-61 BBmisc_1.13 cli_3.6.3
## [37] rmarkdown_2.29 generics_0.1.3 future.apply_1.11.3
## [40] reshape2_1.4.4 cachem_1.1.0 ggforce_0.4.2
## [43] proxy_0.4-27 splines_4.4.2 parallel_4.4.2
## [46] cellranger_1.1.0 vctrs_0.6.5 hardhat_1.4.1
## [49] Matrix_1.7-1 carData_3.0-5 jsonlite_1.8.9
## [52] car_3.1-3 rstatix_0.7.2 Formula_1.2-5
## [55] listenv_0.9.1 foreach_1.5.2 tidyr_1.3.1
## [58] gower_1.0.2 jquerylib_0.1.4 recipes_1.1.0
## [61] glue_1.8.0 parallelly_1.42.0 codetools_0.2-20
## [64] lubridate_1.9.4 stringi_1.8.4 gtable_0.3.6
## [67] munsell_0.5.1 tibble_3.2.1 pillar_1.10.1
## [70] htmltools_0.5.8.1 ipred_0.9-15 lava_1.8.1
## [73] R6_2.5.1 evaluate_1.0.3 backports_1.5.0
## [76] broom_1.0.7 bslib_0.9.0 class_7.3-22
## [79] Rcpp_1.0.14 nlme_3.1-166 prodlim_2024.06.25
## [82] checkmate_2.3.2 xfun_0.50 ModelMetrics_1.2.2.2
## [85] pkgconfig_2.0.3