easypheno.preprocess.raw_data_functions

Module Contents

Functions

prepare_data_files(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, models, user_encoding, maf_percentage)

Prepare all data files for a common format: genotype matrix, phenotype matrix and index file.

check_genotype_h5_file(data_dir, genotype_matrix_name, encodings)

Check .h5 genotype file. Should contain:

check_index_file(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype)

Check if index file is available and if the datasets 'y', 'matched_sample_ids', 'X_index', 'y_index' and

save_all_data_files(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, models, user_encoding, maf_percentage, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage)

Prepare and save all required data files:

check_transform_format_genotype_matrix(data_dir, genotype_matrix_name, models, user_encoding, save_h5 = True)

Check the format of the specified genotype matrix.

check_genotype_csv_file(data_dir, genotype_matrix_name, encodings)

Load .csv genotype file. File must have the following structure:

check_genotype_binary_plink_file(data_dir, genotype_matrix_name)

Load binary PLINK file, .bim, .fam, .bed files with same prefix need to be in same folder.

check_genotype_plink_file(data_dir, genotype_matrix_name)

Load PLINK files, .map and .ped file with same prefix need to be in same folder.

check_duplicate_samples(sample_ids)

check if genotype matrix contain duplicate samples

check_genotype_shape(X, sample_ids, snp_ids)

Check if number of samples in sample_ids and genotype matrix match

create_genotype_h5_file(data_dir, genotype_matrix_name, sample_ids, snp_ids, X)

Save genotype matrix in unified .h5 file.

check_and_load_phenotype_matrix(data_dir, phenotype_matrix_name, phenotype)

Check and load the specified phenotype matrix. Only accept .csv, .pheno, .txt files.

genotype_phenotype_matching(X, X_ids, y)

Match the handed over genotype and phenotype matrix for the phenotype specified by the user, i.e. compare sample ids

get_matched_data(data, index)

Get elements of data specified in index array

append_index_file(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, maf_percentage)

Check index file, described in create_index_file(), and append datasets if necessary

create_index_file(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, maf_percentage, X, y, sample_ids, X_index, y_index)

Create the .h5 index file containing the maf filters and data splits for the combination of genotype matrix,

filter_non_informative_snps(X)

Remove non-informative SNPs, i.e. SNPs that are constant

get_minor_allele_freq(X)

Compute minor allele frequencies of genotype matrix

create_maf_filter(maf, freq)

Create minor allele frequency filter

check_datasplit_user_input(user_datasplit, user_n_outerfolds, user_n_innerfolds, user_test_set_size_percentage, user_val_set_size_percentage, datasplit, param_to_check)

Check if user input of data split parameters differs from standard values.

check_train_test_splits(y, datasplit, datasplit_params)

Create stratified train-test splits. Continuous values will be grouped into bins and stratified according to those.

make_bins(y, datasplit, datasplit_params)

Create bins of continuous values for stratification.

make_nested_cv(y, outerfolds, innerfolds)

Create index dictionary for stratified nested cross validation with the following structure:

make_stratified_cv(x, y, split_number)

Create index dictionary for stratified cross-validation with following structure:

make_train_test_split(y, test_size, val_size=None, val=False, random=42)

Create index arrays for stratified train-test, respectively train-val-test splits.

easypheno.preprocess.raw_data_functions.prepare_data_files(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, models, user_encoding, maf_percentage)

Prepare all data files for a common format: genotype matrix, phenotype matrix and index file.

First check if genotype file is .h5 file (standard format of this framework):

  • YES: First check if all required information is present in the file, raise Exception if not. Then check if index file exists:

    • NO: Load genotype and create all required index files

    • YES: Append all required data splits and maf-filters to index file

  • NO: Load genotype and create all required files

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix_name (str) – name of the genotype matrix including datatype ending

  • phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending

  • phenotype (str) – name of the phenotype to predict

  • datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test

  • n_outerfolds (int) – number of outerfolds relevant for nested-cv

  • n_innerfolds (int) – number of folds relevant for nested-cv and cv-test

  • test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test

  • val_set_size_percentage (int) – size of the validation set relevant for train-val-test

  • models – models to consider

  • user_encoding (str) – encoding specified by the user

  • maf_percentage (int) – threshold for MAF filter as percentage value

easypheno.preprocess.raw_data_functions.check_genotype_h5_file(data_dir, genotype_matrix_name, encodings)

Check .h5 genotype file. Should contain:

  • sample_ids: vector with sample names of genotype matrix,

  • snp_ids: vector with SNP identifiers of genotype matrix,

  • X_{enc}: (samples x SNPs)-genotype matrix in enc encoding, where enc might refer to:

    • ‘012’: additive (number of minor alleles)

    • ‘raw’: raw (alleles)

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix_name (str) – name of the phenotype matrix including datatype ending

  • encodings (list) – list of needed encodings

easypheno.preprocess.raw_data_functions.check_index_file(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype)

Check if index file is available and if the datasets ‘y’, ‘matched_sample_ids’, ‘X_index’, ‘y_index’ and ‘ma_frequency’ exist.

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix_name (str) – name of the genotype matrix including datatype ending

  • phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending

  • phenotype (str) – name of the phenotype to predict

Returns

bool reflecting check result

Return type

bool

easypheno.preprocess.raw_data_functions.save_all_data_files(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, models, user_encoding, maf_percentage, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage)

Prepare and save all required data files:

  • genotype matrix in unified format as .h5 file with,

  • phenotype matrix in unified format as .csv file,

  • file containing maf filter and data split indices as .h5

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix_name (str) – name of the genotype matrix including datatype ending

  • phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending

  • phenotype (str) – name of the phenotype to predict

  • models – models to consider

  • user_encoding (str) – encoding specified by the user

  • maf_percentage (int) – threshold for MAF filter as percentage value

  • datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test

  • n_outerfolds (int) – number of outerfolds relevant for nested-cv

  • n_innerfolds (int) – number of folds relevant for nested-cv and cv-test

  • test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test

  • val_set_size_percentage (int) – size of the validation set relevant for train-val-test

easypheno.preprocess.raw_data_functions.check_transform_format_genotype_matrix(data_dir, genotype_matrix_name, models, user_encoding, save_h5=True)

Check the format of the specified genotype matrix.

Unified genotype matrix will be saved in subdirectory data and named NAME_OF_GENOTYPE_MATRIX.h5

Unified format of the .h5 file of the genotype matrix required for the further processes:

  • mandatory:

    • sample_ids: vector with sample names of genotype matrix,

    • SNP_ids: vector with SNP identifiers of genotype matrix,

    • X_{enc}: (samples x SNPs)-genotype matrix in enc encoding, where enc might refer to:

      • ‘012’: additive (number of minor alleles)

      • ‘raw’: raw (alleles)

  • optional: genotype in additional encodings

Accepts .h5, .hdf5, .h5py, .csv, PLINK binary and PLINK files. .h5, .hdf5, .h5py files must satisfy the unified format. If the genotype matrix contains constant SNPs, those will be removed and a new file will be saved. Will open .csv, PLINK and binary PLINK files and generate required .h5 format.

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix_name (str) – name of the genotype matrix including datatype ending

  • models – models to consider

  • user_encoding (str) – encoding specified by the user

  • save_h5 (bool) – save genotype in unified h5 format if True, default is True

Returns

genotype matrix (raw encoded if present, 012 encoded otherwise), sample ids and SNP ids

Return type

(numpy.array, numpy.array, numpy.array)

easypheno.preprocess.raw_data_functions.check_genotype_csv_file(data_dir, genotype_matrix_name, encodings)

Load .csv genotype file. File must have the following structure: First column must contain the sample ids, the column names should be the SNP ids. The values should be the genotype matrix either in additive encoding or in raw encoding. If the name of the first column is ‘MarkerID’ it is assumed that the rows contain the markers and the column contain the samples and the genotype matrix will be transposed. If the csv file contains the genotype in biallelic notation (i.e. ‘AA’, ‘AT’, …), this function generates a genotype matrix in iupac notation (i.e. ‘A’, ‘W’, …).

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix_name (str) – name of the genotype matrix including datatype ending

  • encodings (list) – list of needed encodings

Returns

sample ids, SNP ids and genotype in additive / raw encoding

Return type

(numpy.array, numpy.array, numpy.array)

Load binary PLINK file, .bim, .fam, .bed files with same prefix need to be in same folder. Compute additive and raw encoding of genotype

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix_name (str) – name of the genotype matrix including datatype ending

Returns

sample ids, SNP ids and genotype in raw encoding

Return type

(numpy.array, numpy.array, numpy.array)

Load PLINK files, .map and .ped file with same prefix need to be in same folder. Accepts GENOTYPENAME.ped and GENOTYPENAME.map as input

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix_name (str) – name of the genotype matrix including datatype ending

Returns

sample ids, SNP ids and genotype in raw encoding

Return type

(numpy.array, numpy.array, numpy.array)

easypheno.preprocess.raw_data_functions.check_duplicate_samples(sample_ids)

check if genotype matrix contain duplicate samples

Parameters

sample_ids (numpy.array) – sample ids of genotype matrix

Returns

True if duplicates are present, False if not

Return type

bool

easypheno.preprocess.raw_data_functions.check_genotype_shape(X, sample_ids, snp_ids)

Check if number of samples in sample_ids and genotype matrix match and if number of markers in snp_ids and genotype matrix match.

Parameters
  • X (numpy.array) – genotype matrix

  • sample_ids (numpy.array) – vector containing sample ids of genotype

  • snp_ids (numpy.array) – vector containing SNP ids of genotype

easypheno.preprocess.raw_data_functions.create_genotype_h5_file(data_dir, genotype_matrix_name, sample_ids, snp_ids, X)

Save genotype matrix in unified .h5 file.

Structure:

  • sample_ids

  • snp_ids

  • X_raw (or X_012 if X_raw not available)

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix_name (str) – name of the genotype matrix including datatype ending

  • sample_ids (numpy.array) – array containing sample ids of genotype data

  • snp_ids (numpy.array) – array containing snp ids of genotype data

  • X (numpy.array) – matrix containing genotype either in raw or in additive encoding

easypheno.preprocess.raw_data_functions.check_and_load_phenotype_matrix(data_dir, phenotype_matrix_name, phenotype)

Check and load the specified phenotype matrix. Only accept .csv, .pheno, .txt files. Sample ids need to be in first column, remaining columns should contain phenotypic values with phenotype name as column name

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending

  • phenotype (str) – name of the phenotype to predict

Returns

DataFrame with sample_ids as index and phenotype values as single column without NAN values

Return type

pandas.DataFrame

easypheno.preprocess.raw_data_functions.genotype_phenotype_matching(X, X_ids, y)

Match the handed over genotype and phenotype matrix for the phenotype specified by the user, i.e. compare sample ids

Parameters
  • X (numpy.array) – genotype matrix in additive encoding

  • X_ids (numpy.array) – sample ids of genotype matrix

  • y (pandas.DataFrame) – pd.DataFrame containing sample ids of phenotype as index and phenotype values as single column

Returns

matched genotype matrix, matched sample ids, index arrays for genotype and phenotype to redo matching

Return type

tuple

easypheno.preprocess.raw_data_functions.get_matched_data(data, index)

Get elements of data specified in index array

Parameters
  • data (numpy.array) – matrix or array

  • index (numpy.array) – index array

Returns

data at selected indices

Return type

numpy.array

easypheno.preprocess.raw_data_functions.append_index_file(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, maf_percentage)

Check index file, described in create_index_file(), and append datasets if necessary

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix_name (str) – name of the genotype matrix including datatype ending

  • phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending

  • phenotype (str) – name of the phenotype to predict

  • maf_percentage (int) – threshold for MAF filter as percentage value

  • datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test

  • n_outerfolds (int) – number of outerfolds relevant for nested-cv

  • n_innerfolds (int) – number of folds relevant for nested-cv and cv-test

  • test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test

  • val_set_size_percentage (int) – size of the validation set relevant for train-val-test

easypheno.preprocess.raw_data_functions.create_index_file(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, maf_percentage, X, y, sample_ids, X_index, y_index)

Create the .h5 index file containing the maf filters and data splits for the combination of genotype matrix, phenotype matrix and phenotype. It will be created using standard values additionally to user inputs for the maf filters and data splits.

Unified format of .h5 file containing the maf filters and data splits:

'matched_data': {
        'y': matched phenotypic values,
        'matched_sample_ids': sample ids of matched genotype/phenotype,
        'X_index': indices of genotype matrix to redo matching,
        'y_index': indices of phenotype vector to redo matching,
        'ma_frequency': minor allele frequency of each SNP of genotype file to create new MAF filters
        }
'maf_filter': {
        'maf_{maf_percentage}': indices of SNPs to delete  # (with MAF < maf_percentage),
        ...
        }
'datasplits': {
        'nested_cv': {
                '#outerfolds-#innerfolds': {
                        'outerfold_0': {
                            'innerfold_0': {'train': indices_train, 'val': indices_val},
                            ...
                            'innerfold_n': {'train': indices_train, 'val': indices_val},
                            'test': test_indices
                            },
                        ...
                        'outerfold_m': {
                            'innerfold_0': {'train': indices_train, 'val': indices_val},
                            ...
                            'innerfold_n': {'train': indices_train, 'val': indices_val},
                            'test': test_indices
                            }
                        },
                ...
                }
        'cv-test': {
                '#folds-test_percentage': {
                        'outerfold_0': {
                            'innerfold_0': {'train': indices_train, 'val': indices_val},
                            ...
                            'innerfold_n': {'train': indices_train, 'val': indices_val},
                            'test': test_indices
                            }
                        },
                ...
                }
        'train-val-test': {
                'train_percentage-val_percentage-test_percentage': {
                        'outerfold_0': {
                            'innerfold_0': {'train': indices_train, 'val': indices_val},
                            'test': test_indices
                            }
                        },
                ...
                }
        }

Standard values for the maf filters and data splits:

  • maf thresholds: 1, 3, 5

  • folds (inner-/outerfolds for ‘nested-cv’ and folds for ‘cv-test’): 5

  • test percentage (for ‘cv-test’ and ‘train-val-test’): 20

  • val percentage (for ‘train-val-test’): 20

Parameters
  • data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored

  • genotype_matrix_name (str) – name of the genotype matrix including datatype ending

  • phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending

  • phenotype (str) – name of the phenotype to predict

  • maf_percentage (int) – threshold for MAF filter as percentage value

  • datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test

  • n_outerfolds (int) – number of outerfolds relevant for nested-cv

  • n_innerfolds (int) – number of folds relevant for nested-cv and cv-test

  • test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test

  • val_set_size_percentage (int) – size of the validation set relevant for train-val-test

  • X (numpy.array) – genotype in additive encoding to create ma-frequencies

  • y (numpy.array) – matched phenotype values

  • sample_ids (numpy.array) – matched sample ids of genotype/phenotype

  • X_index (numpy.array) – index file of genotype to redo matching

  • y_index (numpy.array) – index file of phenotype to redo matching

easypheno.preprocess.raw_data_functions.filter_non_informative_snps(X)

Remove non-informative SNPs, i.e. SNPs that are constant

Parameters

X (numpy.array) – genotype matrix in raw or additive encoding

Returns

filtered genotype matrix and filter-vector

Return type

(numpy.array, numpy.array)

easypheno.preprocess.raw_data_functions.get_minor_allele_freq(X)

Compute minor allele frequencies of genotype matrix

Parameters

X (numpy.array) – genotype matrix in additive encoding

Returns

array with frequencies

Return type

numpy.array

easypheno.preprocess.raw_data_functions.create_maf_filter(maf, freq)

Create minor allele frequency filter

Parameters
  • maf (int) – maf threshold as percentage value

  • freq (numpy.array) – array containing minor allele frequencies as decimal value

Returns

array containing indices of SNPs with MAF smaller than specified threshold, i.e. SNPs to delete

Return type

numpy.array

easypheno.preprocess.raw_data_functions.check_datasplit_user_input(user_datasplit, user_n_outerfolds, user_n_innerfolds, user_test_set_size_percentage, user_val_set_size_percentage, datasplit, param_to_check)

Check if user input of data split parameters differs from standard values. If it does, add input to list of parameters

Parameters
  • user_datasplit (str) – datasplit specified by the user

  • user_n_outerfolds (int) – number of outerfolds relevant for nested-cv specified by the user

  • user_n_innerfolds (int) – number of folds relevant for nested-cv and cv-test specified by the user

  • user_test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test specified by the user

  • user_val_set_size_percentage (int) – size of the validation set relevant for train-val-test specified by the user

  • datasplit (str) – type of data split

  • param_to_check (list) – standard parameters to compare to

Returns

adapted list of parameters

Return type

list

easypheno.preprocess.raw_data_functions.check_train_test_splits(y, datasplit, datasplit_params)

Create stratified train-test splits. Continuous values will be grouped into bins and stratified according to those.

Datasplit parameters:

  • nested-cv: [n_outerfolds, n_innerfolds]

  • cv-test: [n_innerfolds, test_set_size_percentage]

  • train-val-test: [val_set_size_percentage, train_set_size_percentage]

Parameters
  • datasplit (str) – type of datasplit (‘nested-cv’, ‘cv-test’, ‘train-val-test’)

  • y (numpy.array) – array with phenotypic values for stratification

  • datasplit_params (list) – parameters to use for split

Returns

dictionary respectively arrays with indices

easypheno.preprocess.raw_data_functions.make_bins(y, datasplit, datasplit_params)

Create bins of continuous values for stratification.

Datasplit parameters:

  • nested-cv: [n_outerfolds, n_innerfolds]

  • cv-test: [n_innerfolds, test_set_size_percentage]

  • train-val-test: [val_set_size_percentage, train_set_size_percentage]

Parameters
  • y (numpy.array) – array containing phenotypic values

  • datasplit (str) – train test split to use

  • datasplit_params (list) – parameters to use for split

Returns

binned array

Return type

numpy.array

easypheno.preprocess.raw_data_functions.make_nested_cv(y, outerfolds, innerfolds)

Create index dictionary for stratified nested cross validation with the following structure:

{
    'outerfold_0_test': test_indices,
    'outerfold_0': {
        'fold_0_train': innerfold_0_train_indices,
        'fold_0_test': innerfold_0_test_indices,
        ...
        'fold_n_train': innerfold_n_train_indices,
        'fold_n_test': innerfold_n_test_indices
    },
    ...
    'outerfold_m_test': test_indices,
    'outerfold_m': {
        'fold_0_train': innerfold_0_train_indices,
        'fold_0_test': innerfold_0_test_indices,
        ...
        'fold_n_train': innerfold_n_train_indices,
        'fold_n_test': innerfold_n_test_indices
    }
}
Parameters
  • y (numpy.array) – target values grouped in bins for stratification

  • outerfolds (int) – number of outer folds

  • innerfolds (int) – number of inner folds

Returns

index dictionary

Return type

dict

easypheno.preprocess.raw_data_functions.make_stratified_cv(x, y, split_number)

Create index dictionary for stratified cross-validation with following structure:

{
    'fold_0_train': fold_0_train_indices,
    'fold_0_test': fold_0_test_indices,
    ...
    'fold_n_train': fold_n_train_indices,
    'fold_n_test': fold_n_test_indices
}
Parameters
  • x (numpy.array) – whole train indices

  • y (numpy.array) – target values binned in groups for stratification

  • split_number (int) – number of folds

Returns

dictionary containing train and validation indices for each fold

Return type

dict

easypheno.preprocess.raw_data_functions.make_train_test_split(y, test_size, val_size=None, val=False, random=42)

Create index arrays for stratified train-test, respectively train-val-test splits.

Parameters
  • y (numpy.array) – target values grouped in bins for stratification

  • test_size (int) – size of test set as percentage value

  • val_size – size of validation set as percentage value

  • val – if True, function returns validation set additionally to train and test set

  • random – controls shuffling of data

Returns

either train, val and test index arrays or train and test index arrays and corresponding binned target values

Return type

(numpy.array, numpy.array, numpy.array)