easypheno.preprocess.base_dataset
Module Contents
Classes
Class containing dataset ready for optimization (e.g. geno/phenotype matched). |
- class easypheno.preprocess.base_dataset.Dataset(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, encoding, maf_percentage, do_snp_filters=True)
Class containing dataset ready for optimization (e.g. geno/phenotype matched).
Attributes
encoding (str): the encoding to use (standard encoding or user-defined)
X_full (numpy.array): all (matched, maf- and duplicated-filtered) SNPs
y_full (numpy.array): all target values
sample_ids_full (numpy.array):all sample ids
snp_ids (numpy.array): SNP ids
datasplit (str): datasplit to use
datasplit_indices (dict): dictionary containing all indices for the specified datasplit
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict
datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test
n_outerfolds (int) – number of outerfolds relevant for nested-cv
n_innerfolds (int) – number of folds relevant for nested-cv and cv-test
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
val_set_size_percentage (int) – size of the validation set relevant for train-val-test
maf_percentage (int) – threshold for MAF filter as percentage value
encoding (str) – the encoding to use (standard encoding or user-defined)
do_snp_filters (bool) – specify if SNP filters (e.g. duplicates, maf etc.) should be applied
- load_match_raw_data(self, data_dir, genotype_matrix_name, do_snp_filters=True)
Load the full genotype and phenotype matrices specified and match them
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
do_snp_filters (bool) – specify if SNP filters (e.g. duplicates, maf etc.) should be applied
- Returns
matched genotype, phenotype and sample ids
- Return type
(numpy.ndarray, numpy.ndarray, numpy.ndarray)
- maf_filter_raw_data(self, data_dir, maf_percentage)
Apply maf filter to full raw data, if maf=0 only non-informative SNPs will be removed
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
maf_percentage (int) – threshold for MAF filter as percentage value
- filter_duplicate_snps(self)
Remove duplicate SNPs, i.e. SNPs that are completely the same for all samples and therefore do not add information.
- check_and_save_filtered_snp_ids(self, data_dir, maf_percentage)
Check if snp_ids for specific maf percentage and encoding are saved in index_file. If not, save them in ‘matched_data/final_snp_ids/{encoding}/maf_{maf_percentage}_snp_ids’
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
maf_percentage (int) – threshold for MAF filter as percentage value
- load_datasplit_indices(self, data_dir, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage)
Load the datasplit indices saved during file unification.
Structure:
{ 'outerfold_0': { 'innerfold_0': {'train': indices_train, 'val': indices_val}, 'innerfold_1': {'train': indices_train, 'val': indices_val}, ... 'innerfold_n': {'train': indices_train, 'val': indices_val}, 'test': test_indices }, ... 'outerfold_m': { 'innerfold_0': {'train': indices_train, 'val': indices_val}, 'innerfold_1': {'train': indices_train, 'val': indices_val}, ... 'innerfold_n': {'train': indices_train, 'val': indices_val}, 'test': test_indices } }
Caution: The actual structure depends on the datasplit specified by the user, e.g. for a train-val-test split only ‘outerfold_0’ and its subelements ‘innerfold_0’ and ‘test’ exist.
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
n_outerfolds (int) – number of outerfolds relevant for nested-cv
n_innerfolds (int) – number of folds relevant for nested-cv and cv-test
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
val_set_size_percentage (int) – size of the validation set relevant for train-val-test
- Returns
dictionary with the above-described structure containing all indices for the specified data split
- Return type
- check_datasplit(self, n_outerfolds, n_innerfolds)
Check if the datasplit is valid. Raise Exceptions if train, val or test sets contain same samples.
- static get_index_file_name(genotype_matrix_name, phenotype_matrix_name, phenotype)
Get the name of the file containing the indices for maf filters and data splits