easypheno.postprocess.model_reuse

Module Contents

Functions

apply_final_model(results_directory_model, old_data_dir, new_data_dir, new_genotype_matrix, new_phenotype_matrix, save_dir = None)

Apply a final model on a new dataset. It will be applied to the whole dataset.

retrain_on_new_data(results_directory_model, data_dir, genotype_matrix, phenotype_matrix, phenotype, encoding = None, maf_percentage = 0, save_dir = None, datasplit = 'nested-cv', n_outerfolds = 5, n_innerfolds = 5, test_set_size_percentage = 20, val_set_size_percentage = 20, save_final_model = True)

Train a model on a new dataset using the hyperparameters that worked best for the specified model results.

easypheno.postprocess.model_reuse.apply_final_model(results_directory_model, old_data_dir, new_data_dir, new_genotype_matrix, new_phenotype_matrix, save_dir=None)

Apply a final model on a new dataset. It will be applied to the whole dataset. So the main purpose of this function is, if you get new samples you want to predict on. If the final model was saved, this will be used for inference on the new dataset. Otherwise, it will be retrained on the initial dataset and then used for inference on the new dataset.

The new dataset will be filtered for the SNP ids that the model was initially trained on.

CAUTION: the SNPs of the old and the new dataset have to be the same!

Parameters
  • results_directory_model (str) – directory that contains the model results that you want to use

  • old_data_dir (str) – directory that contains the data that the model was trained on

  • new_data_dir (str) – directory that contains the new genotype and phenotype matrix

  • new_genotype_matrix (str) – new genotype matrix (incl. file suffix)

  • new_phenotype_matrix (str) – new phenotype matrix (incl. file suffix)

  • save_dir (str) – directory to store the results

easypheno.postprocess.model_reuse.retrain_on_new_data(results_directory_model, data_dir, genotype_matrix, phenotype_matrix, phenotype, encoding=None, maf_percentage=0, save_dir=None, datasplit='nested-cv', n_outerfolds=5, n_innerfolds=5, test_set_size_percentage=20, val_set_size_percentage=20, save_final_model=True)

Train a model on a new dataset using the hyperparameters that worked best for the specified model results.

Parameters
  • data_dir (str) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix (str) – name of the genotype matrix including datatype ending

  • phenotype_matrix (str) – name of the phenotype matrix including datatype ending

  • phenotype (str) – name of the phenotype to predict

  • encoding (str) – encoding to use. Default is None, so standard encoding of each model will be used. Options are: ‘012’, ‘onehot’, ‘raw’

  • maf_percentage (int) – threshold for MAF filter as percentage value. Default is 0, so no MAF filtering

  • save_dir (str) – directory for saving the results. Default is None, so same directory as data_dir

  • datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test

  • n_outerfolds (int) – number of outerfolds relevant for nested-cv

  • n_innerfolds (int) – number of folds relevant for nested-cv and cv-test

  • test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test

  • val_set_size_percentage (int) – size of the validation set relevant for train-val-test

  • save_final_model (bool) – specify if the final model should be saved

  • results_directory_model (str) –