Introduction
- Workflow
Load packages
- Convert Binary column to factor
70/30 Random split
70/30 Proportional split
Supervised Learning Methods
- Random Forest
  - Performance
- XGboost
  - Performance
- SVM
  - Performance
Summary table of classification metrics
Discussion
- Model Performance Comparison
Conclusion
Appendix

Introduction

For this project, I aim to train and evaluate three supervised learning models to classify lung cancer tissue samples as either adenocarcinoma (LUAD) or squamous cell carcinoma (LUSC). The dataset comprises 1,153 samples and 388 features, with the “label” column serving as the target variable, indicating the true classification of each sample as either LUAD or LUSC. The remaining 387 columns contain gene expression levels corresponding to various genes associated with these tumor samples. The goal is to determine the most effective model for accurately distinguishing between these two lung cancer subtypes.

Workflow

I will begin by loading the required libraries and importing the dataset, checking its dimensions to ensure it has been loaded correctly. Next, I will convert the last column, “label”, into a factor, as it represents the categorical classification of lung cancer samples (LUAD or LUSC). To understand the class distribution, I will analyze the frequency of each label, which will help determine the most appropriate method for splitting the data.

A 70/30 train-test split will be applied to create separate training and testing datasets while maintaining class proportions. I will then train three supervised learning models—Random Forest, XGBoost, and Support Vector Machine (SVM)—on the training data and evaluate their performance on the test set. Model comparison will be based on key classification metrics such as accuracy, precision, recall, and F1-score to determine the most effective classifier for distinguishing between LUAD and LUSC samples.

Load packages

suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(stringr)))
suppressMessages(suppressWarnings(library(caret)))
suppressMessages(suppressWarnings(library(nnet)))
suppressMessages(suppressWarnings(library(randomForest)))
suppressMessages(suppressWarnings(library(e1071)))
suppressMessages(suppressWarnings(library(xgboost)))
suppressMessages(suppressWarnings(library(SHAPforxgboost)))
library(readxl)

#Load and check data

df <- read_xlsx("C:\\Users\\fasci\\OneDrive\\Desktop\\BMI\\BMIDS HW 2-1.xlsx")
dim(df)

## [1] 1153  388

Convert Binary column to factor

df$label <- as.factor(df$label)

#Frequency of ground truth label We want to have an overview of the distribution of the labels.

occurrence <- prop.table(table(df$label))
cbind(freq=table(df$label), occurrence=occurrence)

##      freq occurrence
## luad  600  0.5203816
## lusc  553  0.4796184

70/30 Random split

We first try the random split.

set.seed(18)
dt = sample(nrow(df), nrow(df)*.7)
random_train <- df[dt,]
random_test <- df[-dt,]
prop.table(table(random_train$label))

## 
##      luad      lusc 
## 0.5092937 0.4907063

70/30 Proportional split

Then Propotional split.

set.seed(1234)
index <- createDataPartition(df$label, p = 0.7, list = FALSE)

train_set <- df[index,]
test_set <- df[-index,]

prop.table(table(train_set$label))

## 
##     luad     lusc 
## 0.519802 0.480198

I prefer the proportional split since it gives me the closest frequncy distributions.

Supervised Learning Methods

I will train and run the following models [Random Forest, XGboost, SVM] and for each model generate the corresponding confusion matrix.

Random Forest

# Set seed
set.seed(1234)
# Build model
rf <- randomForest(label~., data=train_set, ntree=10)
# Make predictions
pred_rf <- predict(rf, test_set)

Performance

rf_cm <- confusionMatrix(pred_rf, test_set$label, mode = "prec_recall", positive= "luad")
rf_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction luad lusc
##       luad  177   13
##       lusc    3  152
##                                           
##                Accuracy : 0.9536          
##                  95% CI : (0.9258, 0.9733)
##     No Information Rate : 0.5217          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9068          
##                                           
##  Mcnemar's Test P-Value : 0.02445         
##                                           
##               Precision : 0.9316          
##                  Recall : 0.9833          
##                      F1 : 0.9568          
##              Prevalence : 0.5217          
##          Detection Rate : 0.5130          
##    Detection Prevalence : 0.5507          
##       Balanced Accuracy : 0.9523          
##                                           
##        'Positive' Class : luad            
##

XGboost

# Set seed
set.seed(1234)
# Set parameters
tune_grid <-  data.frame(nrounds=c(100), max_depth = c(2),eta =c(0.3),gamma=c(0),
                    colsample_bytree=c(0.8),min_child_weight=c(1),subsample=c(1))
# Build model
xgb1 <- train(label~., data=train_set, method="xgbTree",
                    metric="Accuracy", trControl=trainControl(method="cv"), tuneGrid=tune_grid)
# Make predictions
pred_xgb <- predict(xgb1, test_set)

Performance

confusionMatrix(pred_xgb, test_set$label, mode = "prec_recall", positive="luad")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction luad lusc
##       luad  179    5
##       lusc    1  160
##                                           
##                Accuracy : 0.9826          
##                  95% CI : (0.9625, 0.9936)
##     No Information Rate : 0.5217          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9651          
##                                           
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##               Precision : 0.9728          
##                  Recall : 0.9944          
##                      F1 : 0.9835          
##              Prevalence : 0.5217          
##          Detection Rate : 0.5188          
##    Detection Prevalence : 0.5333          
##       Balanced Accuracy : 0.9821          
##                                           
##        'Positive' Class : luad            
##

SVM

# Set seed
set.seed(1234)
# Build model
svm1 <- svm(label~., data=train_set, type='C-classification', kernel = 'linear')
# Make predictions
pred_svm = predict(svm1, test_set)

Performance

confusionMatrix(pred_svm, test_set$label, mode = "prec_recall", positive="luad")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction luad lusc
##       luad  180    2
##       lusc    0  163
##                                           
##                Accuracy : 0.9942          
##                  95% CI : (0.9792, 0.9993)
##     No Information Rate : 0.5217          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9884          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##               Precision : 0.9890          
##                  Recall : 1.0000          
##                      F1 : 0.9945          
##              Prevalence : 0.5217          
##          Detection Rate : 0.5217          
##    Detection Prevalence : 0.5275          
##       Balanced Accuracy : 0.9939          
##                                           
##        'Positive' Class : luad            
##

Summary table of classification metrics

# Load knitr package
library(knitr)

# Create a dataframe with model performance metrics
model_performance <- data.frame(
  Model = c("Random Forest", "XGBoost", "SVM"),
  Accuracy = c("0.9536", "0.9826", "0.9942"),
  Precision = c("0.9316", "0.9728", "0.9890"),
  Recall = c("0.9833", "0.9944", "1.000"),
  F1_Score = c("0.9568", "0.9835", "0.9945")
)

# Display the table using knitr::kable()
kable(model_performance, caption = "Performance Comparison of Machine Learning Models")

Performance Comparison of Machine Learning Models
Model	Accuracy	Precision	Recall	F1_Score
Random Forest	0.9536	0.9316	0.9833	0.9568
XGBoost	0.9826	0.9728	0.9944	0.9835
SVM	0.9942	0.9890	1.000	0.9945

Discussion

This study evaluated the performance of three supervised machine learning models—Random Forest (RF), XGBoost, and Support Vector Machine (SVM)—in classifying lung cancer samples as either adenocarcinoma (LUAD) or squamous cell carcinoma (LUSC). The models were assessed using key classification metrics, including accuracy, precision, recall, F1-score, and balanced accuracy.

Model Performance Comparison

Random Forest (RF) achieved an accuracy of 95.36%, with a recall of 98.33% and an F1-score of 95.68%. While the model demonstrated strong classification ability, its precision (93.16%) was slightly lower, indicating that it had a higher false positive rate compared to the other models. The McNemar’s test p-value (0.02445) suggests some asymmetry in misclassification errors, meaning the model made more mistakes when predicting LUSC.

XGBoost performed significantly better, with an accuracy of 98.26%, a precision of 97.28%, and an F1-score of 98.35%. It demonstrated a higher recall (99.44%), meaning it correctly identified almost all LUAD cases while maintaining a strong balance between precision and recall. The McNemar’s test p-value (0.2207) indicates that misclassification errors between classes were relatively balanced.

Support Vector Machine (SVM) emerged as the best-performing model, achieving the highest accuracy (99.42%), perfect recall (100%), and an F1-score of 99.45%. It also had the highest kappa score (0.9884), indicating strong agreement between predicted and actual labels. The model correctly classified all LUAD cases (recall = 100%) and had very few false positives, as indicated by its precision of 98.90%. The McNemar’s test p-value (0.4795) suggests a well-balanced classification performance.

Conclusion

Among the three models, SVM demonstrated the best overall classification performance, with the highest accuracy, perfect recall, and the highest F1-score. The model successfully identified all LUAD samples without any false negatives, which is crucial in clinical applications where misclassification could lead to incorrect treatment decisions. XGBoost also performed exceptionally well, with only a slight drop in precision and accuracy compared to SVM. Random Forest, while still highly effective, showed more misclassification errors compared to the other two models.
Given these results, SVM is the most reliable model for classifying lung cancer subtypes in this dataset. Future improvements could include hyperparameter tuning, feature selection, and ensemble methods to further enhance model performance and computational efficiency.

Appendix

sessionInfo()

## R version 4.4.2 (2024-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_Europe.utf8  LC_CTYPE=English_Europe.utf8   
## [3] LC_MONETARY=English_Europe.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=English_Europe.utf8    
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.49           readxl_1.4.3         SHAPforxgboost_0.1.3
##  [4] xgboost_1.7.8.1      e1071_1.7-16         randomForest_4.7-1.2
##  [7] nnet_7.3-20          caret_7.0-1          lattice_0.22-6      
## [10] ggplot2_3.5.1        stringr_1.5.1        dplyr_1.1.4         
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     timeDate_4041.110    farver_2.1.2        
##  [4] fastmap_1.2.0        tweenr_2.0.3         pROC_1.18.5         
##  [7] digest_0.6.37        rpart_4.1.23         timechange_0.3.0    
## [10] lifecycle_1.0.4      survival_3.7-0       magrittr_2.0.3      
## [13] compiler_4.4.2       rlang_1.1.5          sass_0.4.9          
## [16] tools_4.4.2          yaml_2.3.10          data.table_1.16.4   
## [19] ggsignif_0.6.4       plyr_1.8.9           RColorBrewer_1.1-3  
## [22] abind_1.4-8          withr_3.0.2          purrr_1.0.2         
## [25] grid_4.4.2           polyclip_1.10-7      stats4_4.4.2        
## [28] ggpubr_0.6.0         colorspace_2.1-1     future_1.34.0       
## [31] globals_0.16.3       scales_1.3.0         iterators_1.0.14    
## [34] MASS_7.3-61          BBmisc_1.13          cli_3.6.3           
## [37] rmarkdown_2.29       generics_0.1.3       future.apply_1.11.3 
## [40] reshape2_1.4.4       cachem_1.1.0         ggforce_0.4.2       
## [43] proxy_0.4-27         splines_4.4.2        parallel_4.4.2      
## [46] cellranger_1.1.0     vctrs_0.6.5          hardhat_1.4.1       
## [49] Matrix_1.7-1         carData_3.0-5        jsonlite_1.8.9      
## [52] car_3.1-3            rstatix_0.7.2        Formula_1.2-5       
## [55] listenv_0.9.1        foreach_1.5.2        tidyr_1.3.1         
## [58] gower_1.0.2          jquerylib_0.1.4      recipes_1.1.0       
## [61] glue_1.8.0           parallelly_1.42.0    codetools_0.2-20    
## [64] lubridate_1.9.4      stringi_1.8.4        gtable_0.3.6        
## [67] munsell_0.5.1        tibble_3.2.1         pillar_1.10.1       
## [70] htmltools_0.5.8.1    ipred_0.9-15         lava_1.8.1          
## [73] R6_2.5.1             evaluate_1.0.3       backports_1.5.0     
## [76] broom_1.0.7          bslib_0.9.0          class_7.3-22        
## [79] Rcpp_1.0.14          nlme_3.1-166         prodlim_2024.06.25  
## [82] checkmate_2.3.2      xfun_0.50            ModelMetrics_1.2.2.2
## [85] pkgconfig_2.0.3

Supervised_Learning- HW 2

Isaac Kyeremateng,MD|MPH

17-02-2025

Introduction

Workflow

Load packages

Convert Binary column to factor

70/30 Random split

70/30 Proportional split

Supervised Learning Methods

Random Forest

Performance

XGboost

Performance

SVM

Performance

Summary table of classification metrics

Discussion

Model Performance Comparison

Conclusion

Appendix