easypheno.postprocess.model_reuse
Module Contents
Functions
|
Apply a final model on a new dataset. It will be applied to the whole dataset. |
|
Train a model on a new dataset using the hyperparameters that worked best for the specified model results. |
- easypheno.postprocess.model_reuse.apply_final_model(results_directory_model, old_data_dir, new_data_dir, new_genotype_matrix, new_phenotype_matrix, save_dir=None)
Apply a final model on a new dataset. It will be applied to the whole dataset. So the main purpose of this function is, if you get new samples you want to predict on. If the final model was saved, this will be used for inference on the new dataset. Otherwise, it will be retrained on the initial dataset and then used for inference on the new dataset.
The new dataset will be filtered for the SNP ids that the model was initially trained on.
CAUTION: the SNPs of the old and the new dataset have to be the same!
- Parameters
results_directory_model (str) – directory that contains the model results that you want to use
old_data_dir (str) – directory that contains the data that the model was trained on
new_data_dir (str) – directory that contains the new genotype and phenotype matrix
new_genotype_matrix (str) – new genotype matrix (incl. file suffix)
new_phenotype_matrix (str) – new phenotype matrix (incl. file suffix)
save_dir (str) – directory to store the results
- easypheno.postprocess.model_reuse.retrain_on_new_data(results_directory_model, data_dir, genotype_matrix, phenotype_matrix, phenotype, encoding=None, maf_percentage=0, save_dir=None, datasplit='nested-cv', n_outerfolds=5, n_innerfolds=5, test_set_size_percentage=20, val_set_size_percentage=20, save_final_model=True)
Train a model on a new dataset using the hyperparameters that worked best for the specified model results.
- Parameters
data_dir (str) – data directory where the phenotype and genotype matrix are stored
genotype_matrix (str) – name of the genotype matrix including datatype ending
phenotype_matrix (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict
encoding (str) – encoding to use. Default is None, so standard encoding of each model will be used. Options are: ‘012’, ‘onehot’, ‘raw’
maf_percentage (int) – threshold for MAF filter as percentage value. Default is 0, so no MAF filtering
save_dir (str) – directory for saving the results. Default is None, so same directory as data_dir
datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test
n_outerfolds (int) – number of outerfolds relevant for nested-cv
n_innerfolds (int) – number of folds relevant for nested-cv and cv-test
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
val_set_size_percentage (int) – size of the validation set relevant for train-val-test
save_final_model (bool) – specify if the final model should be saved
results_directory_model (str) –