Skip to content

pwlsp/ml-multiple-myeloma

Repository files navigation

ml-multiple-myeloma

This repository contains the second project for Machine Learning course at NOVA FCT.

Authors:

  • Weronika Łoś
  • Paweł Spychała
  • Piotr Ratajczak

Note

Catboost seems to need venv with python 3.13 to run.

pip install pandas numpy matplotlib seaborn missingno scikit-learn xgboost catboost

Here's a rundown of the project from Prof. Claudia Soares's website:

Semi-Supervised Learning for Predicting Survival Time in Multiple Myeloma Patients

Objective Apply semi-supervised techniques to develop predictive models for estimating the survival time of patients with multiple myeloma. This approach is particularly relevant given the presence of missing data and potentially unlabeled instances in the dataset.

Note that there is a field of statistics called Survival Analysis that uses specific formalisms and models to address this problem. We will not use this approach. Our goal is to formulate the problem in line with our Machine Learning (ML) tasks, select models from the toolset learned during the course, and define metrics aligned with the objective of predicting survival time.

Dataset Description The provided synthetic dataset simulates clinical data for multiple myeloma patients, including various features with correlated missing values. Importantly, a portion of the dataset will have unlabeled 'SurvivalTime' values, simulating a common real-world scenario in which complete information is not always available.

The dataset comprises features such as age, gender, disease stage, genetic risk, treatment type, comorbidity index, treatment response, survival time, and a censoring indicator.

Your goal is to accurately predict survival time from features, accounting for missing values in the features and missing labels (censored data).

Censoring is a type of missing-data issue in which the time to an event is not observed because the study ends before all subjects have experienced the event of interest, or a subject leaves the study before an event occurs. This is common in survival analysis.

If only the lower limit l for the actual event time T is known, such that T > l, it's referred to as right censoring. Right censoring occurs, for example, when the birth date of a subject is known but they are still alive at the time of loss to follow-up or when the study concludes. We typically encounter right-censored data.

Consider the following example, from the Python toolbox scikit-survival documentation.

It illustrates censoring in a medical study examining coronary heart disease, conducted over a 12-month period.

image

The interpretation of the graphics are the following:

  • Patient A: • Lost to follow-up after 3 months • No cardiovascular event recorded • Record is censored
  • Patient B: • Experienced an event 4.5 months after enrollment • Record is uncensored
  • Patient C: • Experienced an event 3.5 months after enrollment • Record is uncensored
  • Patient D: • Withdrew from the study after 2 months • Record is censored
  • Patient E: • Did not experience any event before study end • Record is censored

The takeaways from this example are the following:

  • Only patients B and C have uncensored records (exact time of cardiovascular event known)
  • Patients A, D, and E have censored records (only known to be event-free up to their last follow-up)
  • For censored patients, it is unknown if they experienced an event after study termination or abandonment.

Missing data

In real datasets, it is common to observe missing data.

Missing data can arise from various sources, including incomplete patient records, unnecessary exams for a given patient, or data entry errors. Understanding the patterns and mechanisms of missing data is crucial for selecting appropriate strategies and ensuring the validity of our predictive models.

Tasks

Task 1. Setting the baseline

As in the previous assignment, the first step is to set a baseline. The ML task is regression, and we will fit a linear regression model as a baseline to compare against subsequent developments.

Task 1.1 Data preparation and validation pipeline

Explore the dataset, focusing on patterns of missing data and the distribution of labeled versus unlabeled examples. For this end, install the package missingno with pip or conda. The project's GitHub page can be found here. It has been featured in Towards Data Science, Kaggle, and other sources.

  • Plot the bar plot of missing values, using the following code
import missingno as msno

msno.bar(df)

where df is assumed to contain the training data.

Perform other visualizations using msno.heatmap(), msno.matrix(), msno.dendrogram(df). Analyze the plots and comment on the slides. If one were to drop all data points with missing values, as well as the censored ones, would it be possible to fit a model?

  • Drop the columns containing features with missing data and censored data points, as well as missing survival times. How many points are there left?
  • Check the pairplot between the remaining features and the target variable. Analyze and comment on the slides.
  • Define the matrix X with the features as columns and examples as rows, and y as a vector with the Survival Time.
  • Consider a train, validation, and test split versus a train and test split with cross-validation. What validation procedure is more data-efficient? Justify your answer with evidence from the dataset.
  • Define a metric. As the data is right-censored, we will use the censored Mean Squared Error (cMSE) that can be coded as
def error_metric(y, y_hat, c):
    import numpy as np
    err = y-y_hat
    err = (1-c)*err**2 + c*np.maximum(0,err)**2
    return np.sum(err)/err.shape[0]

here, c is the censored variable and y is the true Survival Time, as determined by the ground truth. The variable y_hat contains the predicted Survival Time.

Task 1.2 Learn the baseline model

  • Learn a baseline model that, given the features without missing data, can predict the uncensored, non-missing target variable. Your baseline is a Linear Regression model.
  • For the baseline model, make a pipeline and add a StandardScaler instance before the regressor. Note that, for the uncensored data used in this task, the cMSE is equivalent to the MSE.
  • To assess the quality of your model, build the y-y hat plot, calculate the cMSE, and examine them. Comment and document your plots and data on the slides.
  • Submit the predictions of the baseline model to Kaggle with the name baseline-submission-xx.csv where xx is a natural number. The submission used for grading is the one with the larger value.
  • Include in the slides if there are any large discrepancies between the cMSE you estimated locally from your test split and the one computed by Kaggle.
  • If there is such a discrepancy, work on your validation strategy and correct it. Update the slides accordingly to describe the issue.

Task 2. Nonlinear models

Task 2.1 Development

  • Develop functions for training Polynomial and k-Nearest Neighbors on the data prepared in Task 1.1. using the validation procedure determined in Task 1.1 and Task 1.2.
  • Select the model hyperparameters, like the polynomial degree and the $k$ using cross-validation for model selection.

Task 2.2 Evaluation

  • Evaluate the models developed in Task 2.1 against the baseline. Always back up your analysis with evidence, e.g., by presenting a table that displays the different models and their maximum, minimum, mean error, and standard deviation of error.
  • Submit the best predictions from Task 2 to Kaggle with the file name Nonlinear-submission-xx.csv where xx is a natural number. The submission used for grading is the one with the larger value.

Task 3. Handling missing data

We now add the data points where the features have missing data. We still cannot take advantage of the unlabeled data, as our ML task is regression, a supervised learning task.

Task 3.1 Missing data imputation

  • Experiment with completing missing data using imputation techniques in Scikit-Learn and here, using the baseline model.
  • Compare the results with Task 1.2 in the slides, using a table with the error statistics and the y-y hat plot. Present evidence of your analysis.
  • Choose the best imputation strategies obtained with the baseline and apply them to the best models of Task 2. Analyze your results and report them in the slides, with evidence from your experiments.

Task 3.2 Train models that do not require imputation

  • Develop code to apply models and techniques that can directly handle missing data, such as tree-based methods, like decision trees.
  • Experiment with the Scikit-Learn model HistGradientBoostingRegressor, and CatBoost’s CatBoostRegressor. For installation instructions of the CatBoost Library check here. You can use conda or pip.
  • There is a tutorial on using CatBoost for censored data here. Try the Accelerated Failure Time (AFT) CatBoost applied to the assignment data.

Task 3.3 Evaluation

  • Compare the results of the strategies developed in Task 3.1 and 3.2 with the baseline model in the slides, using a table with the error statistics and the y-y hat plot. Present evidence of your analysis.
  • Try the best imputation strategies of Task 3.1, impute the data, run the best model of task 3.2 and compare with the baseline in the slides.
  • Submit the best predictions from Task 3 to Kaggle with the file name handle-missing-submission-xx.csv where xx is an natural number. The submission used for grading is the one with the larger value.

Task 4. Semi-supervised learning for unlabeled data

There are missing values in the target variable. Nevertheless, to take advantage of the unsupervised subset to help in the supervised learning task.

Task 4.1 Imputation with labeled and unlabeled data

  • Apply the best imputation methods from Task 3.1 to both the labeled and unlabeled data sets to fit the imputers. Then, use the imputed data with labels to train a Linear Regression model. Compare with the baseline and with the model trained in Task 3.1.
  • Use the labelled and unlabelled data to train an Isomap lower-dimensional representation of the data. See here for more details on the model. You will need to train the isomap transformer with the complete supervised + unsupervised dataset.
  • To be able to add the semi-supervised Isomap transformer model to a pipeline, you will need the following wrapper code
from sklearn.base import BaseEstimator

class FrozenTransformer(BaseEstimator):
    def __init__(self, fitted_transformer):
        self.fitted_transformer = fitted_transformer

    def __getattr__(self, name):
        # `fitted_transformer`'s attributes are now accessible
        return getattr(self.fitted_transformer, name)

    def __sklearn_clone__(self):
        return self

    def fit(self, X, y=None):
        # Fitting does not change the state of the estimator
        return self

    def transform(self, X, y=None):
        # transform only transforms the data
        return self.fitted_transformer.transform(X)

    def fit_transform(self, X, y=None):
        # fit_transform only transforms the data
        return self.fitted_transformer.transform(X)

and the FrozenTransformer can be used as

from sklearn.manifold import Isomap
from sklearn.pipeline import make_pipeline

# Impute missing values with the best imputers from Task 3.1
imp = SimpleImputer()
# X is the union of the unsupervised and (train) supervised feature datasets
X = imp.fit_transform(X)

scaler = StandardScaler()
X = scaler.fit_transform(X)
# Try different numbers of components.
iso = Isomap(n_components=2)
iso.fit(X)

pipe = make_pipeline(SimpleImputer(),
                     scaler,
                     FrozenTransformer(iso), # <- Here is the Frozen Isomap
                     LinearRegression())

# (X_train, y_train) is the labeled, supervised data
pipe.fit(X_train, y_train)

Task 4.2 Evaluation

  • Compare the results of the strategies developed in Task 4.1 with the baseline model in the slides, using a table with error statistics and a y-y hat plot. Present evidence of your analysis.
  • Try the best imputation strategies from Task 3.1, impute the data, run the best model from Task 3.2, and compare with the baseline in the slides.
  • Submit the best predictions from Task 4 to Kaggle with the file name semisupervised-submission-xx.csv where xx is a natural number. The submission used for grading is the one with the larger value.

Task 5 [optional]

To build your final model, use anything you have learned in the ML course. Submit the predictions to Kaggle as optional-submission-xx.csv where xx is a natural number. The submission used for grading is the one with the larger value. Describe your model architecture and options in the slides.

About

Project for Machine Learning at Universidade NOVA de Lisboa

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors