easypheno.preprocess.base_dataset

Module Contents

Classes

Dataset

Class containing dataset ready for optimization (e.g. geno/phenotype matched).

class easypheno.preprocess.base_dataset.Dataset(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, encoding, maf_percentage, do_snp_filters=True)

Class containing dataset ready for optimization (e.g. geno/phenotype matched).

Attributes

  • encoding (str): the encoding to use (standard encoding or user-defined)

  • X_full (numpy.array): all (matched, maf- and duplicated-filtered) SNPs

  • y_full (numpy.array): all target values

  • sample_ids_full (numpy.array):all sample ids

  • snp_ids (numpy.array): SNP ids

  • datasplit (str): datasplit to use

  • datasplit_indices (dict): dictionary containing all indices for the specified datasplit

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix_name (str) – name of the genotype matrix including datatype ending

  • phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending

  • phenotype (str) – name of the phenotype to predict

  • datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test

  • n_outerfolds (int) – number of outerfolds relevant for nested-cv

  • n_innerfolds (int) – number of folds relevant for nested-cv and cv-test

  • test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test

  • val_set_size_percentage (int) – size of the validation set relevant for train-val-test

  • maf_percentage (int) – threshold for MAF filter as percentage value

  • encoding (str) – the encoding to use (standard encoding or user-defined)

  • do_snp_filters (bool) – specify if SNP filters (e.g. duplicates, maf etc.) should be applied

load_match_raw_data(self, data_dir, genotype_matrix_name, do_snp_filters=True)

Load the full genotype and phenotype matrices specified and match them

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix_name (str) – name of the genotype matrix including datatype ending

  • do_snp_filters (bool) – specify if SNP filters (e.g. duplicates, maf etc.) should be applied

Returns

matched genotype, phenotype and sample ids

Return type

(numpy.ndarray, numpy.ndarray, numpy.ndarray)

maf_filter_raw_data(self, data_dir, maf_percentage)

Apply maf filter to full raw data, if maf=0 only non-informative SNPs will be removed

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • maf_percentage (int) – threshold for MAF filter as percentage value

filter_duplicate_snps(self)

Remove duplicate SNPs, i.e. SNPs that are completely the same for all samples and therefore do not add information.

check_and_save_filtered_snp_ids(self, data_dir, maf_percentage)

Check if snp_ids for specific maf percentage and encoding are saved in index_file. If not, save them in ‘matched_data/final_snp_ids/{encoding}/maf_{maf_percentage}_snp_ids’

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • maf_percentage (int) – threshold for MAF filter as percentage value

load_datasplit_indices(self, data_dir, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage)

Load the datasplit indices saved during file unification.

Structure:

{
    'outerfold_0': {
        'innerfold_0': {'train': indices_train, 'val': indices_val},
        'innerfold_1': {'train': indices_train, 'val': indices_val},
        ...
        'innerfold_n': {'train': indices_train, 'val': indices_val},
        'test': test_indices
        },
    ...
    'outerfold_m': {
        'innerfold_0': {'train': indices_train, 'val': indices_val},
        'innerfold_1': {'train': indices_train, 'val': indices_val},
        ...
        'innerfold_n': {'train': indices_train, 'val': indices_val},
        'test': test_indices
        }
}

Caution: The actual structure depends on the datasplit specified by the user, e.g. for a train-val-test split only ‘outerfold_0’ and its subelements ‘innerfold_0’ and ‘test’ exist.

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • n_outerfolds (int) – number of outerfolds relevant for nested-cv

  • n_innerfolds (int) – number of folds relevant for nested-cv and cv-test

  • test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test

  • val_set_size_percentage (int) – size of the validation set relevant for train-val-test

Returns

dictionary with the above-described structure containing all indices for the specified data split

Return type

dict

check_datasplit(self, n_outerfolds, n_innerfolds)

Check if the datasplit is valid. Raise Exceptions if train, val or test sets contain same samples.

Parameters
  • n_outerfolds (int) – number of outerfolds in datasplit_indices dictionary

  • n_innerfolds (int) – number of folds in datasplit_indices dictionary

static get_index_file_name(genotype_matrix_name, phenotype_matrix_name, phenotype)

Get the name of the file containing the indices for maf filters and data splits

Parameters
  • genotype_matrix_name (str) – name of the genotype matrix including datatype ending

  • phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending

  • phenotype (str) – name of the phenotype to predict

Returns

name of index file

Return type

str