Phase 1: Dataset Preparation and Feature Selection — Days 1-4
Acquire bulk RNA sequencing datasets for atherosclerosis from GEO database including training cohorts (GSE100927, GSE28829) and validation cohorts (GSE57691, GSE120521). Download normalized expression matrices and clinical metadata. Merge datasets using ComBat batch effect correction (sva package in R). Filter genes with low variance (coefficient of variation <0.1) and low expression (mean TPM <1). From single-cell analysis results, extract the 7 C1Q-associated differentially expressed genes identified in previous experiments. Verify gene expression distribution and check for missing values across all datasets. Perform log2 transformation and z-score normalization for machine learning compatibility.
...
Phase 1: Dataset Preparation and Feature Selection — Days 1-4
Acquire bulk RNA sequencing datasets for atherosclerosis from GEO database including training cohorts (GSE100927, GSE28829) and validation cohorts (GSE57691, GSE120521). Download normalized expression matrices and clinical metadata. Merge datasets using ComBat batch effect correction (sva package in R). Filter genes with low variance (coefficient of variation <0.1) and low expression (mean TPM <1). From single-cell analysis results, extract the 7 C1Q-associated differentially expressed genes identified in previous experiments. Verify gene expression distribution and check for missing values across all datasets. Perform log2 transformation and z-score normalization for machine learning compatibility.
Phase 2: Machine Learning Model Development — Days 5-10
Split training data into 80% training and 20% internal validation sets using stratified sampling to maintain class balance. Implement three complementary machine learning algorithms: 1) Gradient Boosting Machine (GBM) using gbm package with parameters: n.trees=1000, interaction.depth=3, shrinkage=0.01, cross-validation folds=10; 2) LASSO regression using glmnet package with alpha=1, lambda determined by 10-fold cross-validation; 3) XGBoost using xgboost package with nrounds=100, max_depth=6, eta=0.3, subsample=0.8. For each algorithm, perform hyperparameter tuning using grid search with 5-fold cross-validation. Evaluate feature importance scores from each model and rank the 7 C1Q-related genes.
Phase 3: Hub Gene Selection and Model Integration — Days 11-13
Combine feature importance rankings from all three algorithms using rank aggregation methods (RankAggreg package). Calculate ensemble importance scores by weighted average of normalized importance scores from each algorithm (equal weights initially). Select top-ranking genes based on consistency across algorithms - genes must rank in top 50% for at least 2 of 3 algorithms. Validate selection using recursive feature elimination (RFE) with 10-fold cross-validation to identify optimal gene subset. Focus analysis on C1QA and C1QC as primary hub genes based on combined ranking scores and biological relevance.
Phase 4: Diagnostic Model Development and Validation — Days 14-17
Develop generalized linear models (GLM) using selected hub genes (C1QA and C1QC) as predictors for atherosclerosis classification. Create multiple model variants: 1) Individual gene models (C1QA only, C1QC only), 2) Combined gene model (C1QA + C1QC), 3) Extended model including interaction terms. Use binomial family with logit link function for binary classification. Train models on full training dataset and validate on held-out internal validation set. Perform 10-fold cross-validation to assess model stability and calculate confidence intervals for performance metrics.
Phase 5: ROC Analysis and External Validation — Days 18-20
Generate receiver operating characteristic (ROC) curves for all model variants using pROC package in R. Calculate area under the curve (AUC), sensitivity, specificity, positive predictive value, and negative predictive value with 95% confidence intervals using bootstrapping (n=2000 iterations). Determine optimal probability thresholds using Youden's J statistic. Validate final models on external validation cohorts, ensuring no overlap with training data. Perform calibration analysis using Hosmer-Lemeshow test and calibration plots to assess prediction reliability. Compare model performance between training and validation cohorts using DeLong's test.
Phase 6: Model Performance Assessment and Clinical Utility — Days 21-23
Evaluate clinical utility using decision curve analysis to determine net benefit across probability thresholds. Calculate number needed to diagnose (NND) and likelihood ratios for positive and negative results. Perform subgroup analyses stratified by age, sex, and atherosclerosis severity when metadata available. Generate prediction nomograms for clinical application using rms package. Conduct sensitivity analyses by testing model performance with different probability cutoffs and evaluating robustness to missing data. Create comprehensive performance reports with forest plots showing AUC values across all validation cohorts and confidence intervals.