`easypheno.optimization.optuna_optim`

Module Contents

Classes

OptunaOptim

Class that contains all info for the whole optimization using optuna for one model and dataset.

class easypheno.optimization.optuna_optim.OptunaOptim(save_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, n_outerfolds, n_innerfolds, val_set_size_percentage, test_set_size_percentage, maf_percentage, n_trials, save_final_model, batch_size, n_epochs, task, current_model_name, dataset, models_start_time, intermediate_results_interval=50, outerfold_number_to_run=None)

Class that contains all info for the whole optimization using optuna for one model and dataset.

Attributes

task (str): ML task (regression or classification) depending on target variable

current_model_name (str): name of the current model according to naming of .py file in package model

dataset (Dataset): dataset to use for optimization run

datasplit_subpath (str): subpath with datasplit info relevant for saving / naming

base_path (str): base_path for save_path

save_path (str): path for model and results storing

study (optuna.study.Study): optuna study for optimization run

current_best_val_result (float): the best validation result so far

early_stopping_point (int): point at which early stopping occured (relevant for some models)

user_input_params (dict): all params handed over to the constructor that are needed in the whole class

Parameters

save_dir (pathlib.Path) – directory for saving the results.
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict
n_outerfolds (int) – number of outerfolds relevant for nested-cv
n_innerfolds (int) – number of folds relevant for nested-cv and cv-test
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
val_set_size_percentage (int) – size of the validation set relevant for train-val-test
maf_percentage (int) – threshold for MAF filter as percentage value
n_trials (int) – number of trials for optuna
save_final_model (bool) – specify if the final model should be saved
batch_size (int) – batch size for neural network models
n_epochs (int) – number of epochs for neural network models
task (str) – ML task (regression or classification) depending on target variable
current_model_name (str) – name of the current model according to naming of .py file in package model
dataset (easypheno.preprocess.base_dataset.Dataset) – dataset to use for optimization run
models_start_time (str) – optimized models and starting time of the optimization run for saving purposes
intermediate_results_interval (int) – number of trials after which intermediate results will be saved
outerfold_number_to_run (int) – outerfold to run in case you do not want to run all

create_new_study(self)

Create a new optuna study.

Returns: a new optuna study instance
Return type: optuna.study.Study

objective(self, trial, train_val_indices)

Objective function for optuna optimization that returns a score

Parameters

trial (optuna.trial.Trial) – trial of optuna for optimization
train_val_indices (dict) – indices of train and validation sets

Returns

score of the current hyperparameter config

Return type

float

clean_up_after_exception(self, trial_number, trial_params, reason)

Clean up things after an exception: delete unfitted model if it exists and update runtime csv

Parameters

trial_number (int) – number of the trial
trial_params (dict) – parameters of the trial
reason (str) – hint for the reason of the Exception

write_runtime_csv(self, dict_runtime)

Write runtime info to runtime csv file

Parameters: dict_runtime (dict) – dictionary with runtime information

calc_runtime_stats(self)

Calculate runtime stats for saved csv file.

Returns: dict with runtime info enhanced with runtime stats
Return type: dict

check_params_for_duplicate(self, current_params)

Check if params were already suggested which might happen by design of TPE sampler.

Parameters: current_params (dict) – dictionar with current parameters
Returns: bool reflecting if current params were already used in the same study
Return type: bool

generate_results_on_test(self, outerfold_info)

Generate the results on the testing data

Parameters: outerfold_info (dict) – dictionary with outerfold datasplit indices
Returns: evaluation metrics dictionary
Return type: dict

get_feature_importance(self, model, X, y, top_n=1000, include_perm_importance=False)

Get feature importances for models that possess such a feature, e.g. XGBoost

Parameters

model (easypheno.model._base_model.BaseModel) – model to analyze
X (numpy.array) – feature matrix for permutation
y (numpy.array) – target vector for permutation
top_n (int) – top n features to select
include_perm_importance (bool) – include permutation based feature importance or not

Returns

DataFrame with feature importance information

Return type

pandas.DataFrame

run_optuna_optimization(self)

Run whole optuna optimization for one model, dataset and datasplit.

Returns: dictionary with results overview
Return type: dict

easypheno.optimization.optuna_optim

Module Contents

Classes

`easypheno.optimization.optuna_optim`