`easypheno.preprocess.raw_data_functions`

Module Contents

Functions

`prepare_data_files`(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, models, user_encoding, maf_percentage)	Prepare all data files for a common format: genotype matrix, phenotype matrix and index file.
`check_genotype_h5_file`(data_dir, genotype_matrix_name, encodings)	Check .h5 genotype file. Should contain:
`check_index_file`(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype)	Check if index file is available and if the datasets 'y', 'matched_sample_ids', 'X_index', 'y_index' and
`save_all_data_files`(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, models, user_encoding, maf_percentage, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage)	Prepare and save all required data files:
`check_transform_format_genotype_matrix`(data_dir, genotype_matrix_name, models, user_encoding, save_h5 = True)	Check the format of the specified genotype matrix.
`check_genotype_csv_file`(data_dir, genotype_matrix_name, encodings)	Load .csv genotype file. File must have the following structure:
`check_genotype_binary_plink_file`(data_dir, genotype_matrix_name)	Load binary PLINK file, .bim, .fam, .bed files with same prefix need to be in same folder.
`check_genotype_plink_file`(data_dir, genotype_matrix_name)	Load PLINK files, .map and .ped file with same prefix need to be in same folder.
`check_duplicate_samples`(sample_ids)	check if genotype matrix contain duplicate samples
`check_genotype_shape`(X, sample_ids, snp_ids)	Check if number of samples in sample_ids and genotype matrix match
`create_genotype_h5_file`(data_dir, genotype_matrix_name, sample_ids, snp_ids, X)	Save genotype matrix in unified .h5 file.
`check_and_load_phenotype_matrix`(data_dir, phenotype_matrix_name, phenotype)	Check and load the specified phenotype matrix. Only accept .csv, .pheno, .txt files.
`genotype_phenotype_matching`(X, X_ids, y)	Match the handed over genotype and phenotype matrix for the phenotype specified by the user, i.e. compare sample ids
`get_matched_data`(data, index)	Get elements of data specified in index array
`append_index_file`(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, maf_percentage)	Check index file, described in create_index_file(), and append datasets if necessary
`create_index_file`(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, maf_percentage, X, y, sample_ids, X_index, y_index)	Create the .h5 index file containing the maf filters and data splits for the combination of genotype matrix,
`filter_non_informative_snps`(X)	Remove non-informative SNPs, i.e. SNPs that are constant
`get_minor_allele_freq`(X)	Compute minor allele frequencies of genotype matrix
`create_maf_filter`(maf, freq)	Create minor allele frequency filter
`check_datasplit_user_input`(user_datasplit, user_n_outerfolds, user_n_innerfolds, user_test_set_size_percentage, user_val_set_size_percentage, datasplit, param_to_check)	Check if user input of data split parameters differs from standard values.
`check_train_test_splits`(y, datasplit, datasplit_params)	Create stratified train-test splits. Continuous values will be grouped into bins and stratified according to those.
`make_bins`(y, datasplit, datasplit_params)	Create bins of continuous values for stratification.
`make_nested_cv`(y, outerfolds, innerfolds)	Create index dictionary for stratified nested cross validation with the following structure:
`make_stratified_cv`(x, y, split_number)	Create index dictionary for stratified cross-validation with following structure:
`make_train_test_split`(y, test_size, val_size=None, val=False, random=42)	Create index arrays for stratified train-test, respectively train-val-test splits.

easypheno.preprocess.raw_data_functions.prepare_data_files(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, models, user_encoding, maf_percentage)

Prepare all data files for a common format: genotype matrix, phenotype matrix and index file.

First check if genotype file is .h5 file (standard format of this framework):

YES: First check if all required information is present in the file, raise Exception if not. Then check if index file exists:
- NO: Load genotype and create all required index files
- YES: Append all required data splits and maf-filters to index file
NO: Load genotype and create all required files

Parameters

data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict
datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test
n_outerfolds (int) – number of outerfolds relevant for nested-cv
n_innerfolds (int) – number of folds relevant for nested-cv and cv-test
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
val_set_size_percentage (int) – size of the validation set relevant for train-val-test
models – models to consider
user_encoding (str) – encoding specified by the user
maf_percentage (int) – threshold for MAF filter as percentage value

easypheno.preprocess.raw_data_functions.check_genotype_h5_file(data_dir, genotype_matrix_name, encodings)

Check .h5 genotype file. Should contain:

sample_ids: vector with sample names of genotype matrix,
snp_ids: vector with SNP identifiers of genotype matrix,
X_{enc}: (samples x SNPs)-genotype matrix in enc encoding, where enc might refer to:
- ‘012’: additive (number of minor alleles)
- ‘raw’: raw (alleles)

Parameters

data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the phenotype matrix including datatype ending
encodings (list) – list of needed encodings

easypheno.preprocess.raw_data_functions.check_index_file(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype)

Check if index file is available and if the datasets ‘y’, ‘matched_sample_ids’, ‘X_index’, ‘y_index’ and ‘ma_frequency’ exist.

Parameters

data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict

Returns

bool reflecting check result

Return type

bool

easypheno.preprocess.raw_data_functions.save_all_data_files(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, models, user_encoding, maf_percentage, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage)

Prepare and save all required data files:

genotype matrix in unified format as .h5 file with,
phenotype matrix in unified format as .csv file,
file containing maf filter and data split indices as .h5

Parameters

data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict
models – models to consider
user_encoding (str) – encoding specified by the user
maf_percentage (int) – threshold for MAF filter as percentage value
datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test
n_outerfolds (int) – number of outerfolds relevant for nested-cv
n_innerfolds (int) – number of folds relevant for nested-cv and cv-test
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
val_set_size_percentage (int) – size of the validation set relevant for train-val-test

easypheno.preprocess.raw_data_functions.check_transform_format_genotype_matrix(data_dir, genotype_matrix_name, models, user_encoding, save_h5=True)

Check the format of the specified genotype matrix.

Unified genotype matrix will be saved in subdirectory data and named NAME_OF_GENOTYPE_MATRIX.h5

Unified format of the .h5 file of the genotype matrix required for the further processes:

mandatory:
- sample_ids: vector with sample names of genotype matrix,
- SNP_ids: vector with SNP identifiers of genotype matrix,
- X_{enc}: (samples x SNPs)-genotype matrix in enc encoding, where enc might refer to:
  ‘012’: additive (number of minor alleles)
  
  ‘raw’: raw (alleles)
optional: genotype in additional encodings

Accepts .h5, .hdf5, .h5py, .csv, PLINK binary and PLINK files. .h5, .hdf5, .h5py files must satisfy the unified format. If the genotype matrix contains constant SNPs, those will be removed and a new file will be saved. Will open .csv, PLINK and binary PLINK files and generate required .h5 format.

Parameters

data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
models – models to consider
user_encoding (str) – encoding specified by the user
save_h5 (bool) – save genotype in unified h5 format if True, default is True

Returns

genotype matrix (raw encoded if present, 012 encoded otherwise), sample ids and SNP ids

Return type

(numpy.array, numpy.array, numpy.array)

easypheno.preprocess.raw_data_functions.check_genotype_csv_file(data_dir, genotype_matrix_name, encodings)

Load .csv genotype file. File must have the following structure: First column must contain the sample ids, the column names should be the SNP ids. The values should be the genotype matrix either in additive encoding or in raw encoding. If the name of the first column is ‘MarkerID’ it is assumed that the rows contain the markers and the column contain the samples and the genotype matrix will be transposed. If the csv file contains the genotype in biallelic notation (i.e. ‘AA’, ‘AT’, …), this function generates a genotype matrix in iupac notation (i.e. ‘A’, ‘W’, …).

Parameters

data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
encodings (list) – list of needed encodings

Returns

sample ids, SNP ids and genotype in additive / raw encoding

Return type

(numpy.array, numpy.array, numpy.array)

easypheno.preprocess.raw_data_functions.check_genotype_binary_plink_file(data_dir, genotype_matrix_name)

Load binary PLINK file, .bim, .fam, .bed files with same prefix need to be in same folder. Compute additive and raw encoding of genotype

Parameters

data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending

Returns

sample ids, SNP ids and genotype in raw encoding

Return type

(numpy.array, numpy.array, numpy.array)

easypheno.preprocess.raw_data_functions.check_genotype_plink_file(data_dir, genotype_matrix_name)

Load PLINK files, .map and .ped file with same prefix need to be in same folder. Accepts GENOTYPENAME.ped and GENOTYPENAME.map as input

Parameters

data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending

Returns

sample ids, SNP ids and genotype in raw encoding

Return type

(numpy.array, numpy.array, numpy.array)

easypheno.preprocess.raw_data_functions.check_duplicate_samples(sample_ids)

check if genotype matrix contain duplicate samples

Parameters: sample_ids (numpy.array) – sample ids of genotype matrix
Returns: True if duplicates are present, False if not
Return type: bool

easypheno.preprocess.raw_data_functions.check_genotype_shape(X, sample_ids, snp_ids)

Check if number of samples in sample_ids and genotype matrix match and if number of markers in snp_ids and genotype matrix match.

Parameters

X (numpy.array) – genotype matrix
sample_ids (numpy.array) – vector containing sample ids of genotype
snp_ids (numpy.array) – vector containing SNP ids of genotype

easypheno.preprocess.raw_data_functions.create_genotype_h5_file(data_dir, genotype_matrix_name, sample_ids, snp_ids, X)

Save genotype matrix in unified .h5 file.

Structure:

sample_ids
snp_ids
X_raw (or X_012 if X_raw not available)

Parameters

data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
sample_ids (numpy.array) – array containing sample ids of genotype data
snp_ids (numpy.array) – array containing snp ids of genotype data
X (numpy.array) – matrix containing genotype either in raw or in additive encoding

easypheno.preprocess.raw_data_functions.check_and_load_phenotype_matrix(data_dir, phenotype_matrix_name, phenotype)

Check and load the specified phenotype matrix. Only accept .csv, .pheno, .txt files. Sample ids need to be in first column, remaining columns should contain phenotypic values with phenotype name as column name

Parameters

data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict

Returns

DataFrame with sample_ids as index and phenotype values as single column without NAN values

Return type

pandas.DataFrame

easypheno.preprocess.raw_data_functions.genotype_phenotype_matching(X, X_ids, y)

Match the handed over genotype and phenotype matrix for the phenotype specified by the user, i.e. compare sample ids

Parameters

X (numpy.array) – genotype matrix in additive encoding
X_ids (numpy.array) – sample ids of genotype matrix
y (pandas.DataFrame) – pd.DataFrame containing sample ids of phenotype as index and phenotype values as single column

Returns

matched genotype matrix, matched sample ids, index arrays for genotype and phenotype to redo matching

Return type

tuple

easypheno.preprocess.raw_data_functions.get_matched_data(data, index)

Get elements of data specified in index array

Parameters

data (numpy.array) – matrix or array
index (numpy.array) – index array

Returns

data at selected indices

Return type

numpy.array

easypheno.preprocess.raw_data_functions.append_index_file(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, maf_percentage)

Check index file, described in create_index_file(), and append datasets if necessary

Parameters

data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict
maf_percentage (int) – threshold for MAF filter as percentage value
datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test
n_outerfolds (int) – number of outerfolds relevant for nested-cv
n_innerfolds (int) – number of folds relevant for nested-cv and cv-test
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
val_set_size_percentage (int) – size of the validation set relevant for train-val-test

easypheno.preprocess.raw_data_functions.create_index_file(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, maf_percentage, X, y, sample_ids, X_index, y_index)

Create the .h5 index file containing the maf filters and data splits for the combination of genotype matrix, phenotype matrix and phenotype. It will be created using standard values additionally to user inputs for the maf filters and data splits.

Unified format of .h5 file containing the maf filters and data splits:

'matched_data': {
        'y': matched phenotypic values,
        'matched_sample_ids': sample ids of matched genotype/phenotype,
        'X_index': indices of genotype matrix to redo matching,
        'y_index': indices of phenotype vector to redo matching,
        'ma_frequency': minor allele frequency of each SNP of genotype file to create new MAF filters
        }
'maf_filter': {
        'maf_{maf_percentage}': indices of SNPs to delete  # (with MAF < maf_percentage),
        ...
        }
'datasplits': {
        'nested_cv': {
                '#outerfolds-#innerfolds': {
                        'outerfold_0': {
                            'innerfold_0': {'train': indices_train, 'val': indices_val},
                            ...
                            'innerfold_n': {'train': indices_train, 'val': indices_val},
                            'test': test_indices
                            },
                        ...
                        'outerfold_m': {
                            'innerfold_0': {'train': indices_train, 'val': indices_val},
                            ...
                            'innerfold_n': {'train': indices_train, 'val': indices_val},
                            'test': test_indices
                            }
                        },
                ...
                }
        'cv-test': {
                '#folds-test_percentage': {
                        'outerfold_0': {
                            'innerfold_0': {'train': indices_train, 'val': indices_val},
                            ...
                            'innerfold_n': {'train': indices_train, 'val': indices_val},
                            'test': test_indices
                            }
                        },
                ...
                }
        'train-val-test': {
                'train_percentage-val_percentage-test_percentage': {
                        'outerfold_0': {
                            'innerfold_0': {'train': indices_train, 'val': indices_val},
                            'test': test_indices
                            }
                        },
                ...
                }
        }

Standard values for the maf filters and data splits:

maf thresholds: 1, 3, 5
folds (inner-/outerfolds for ‘nested-cv’ and folds for ‘cv-test’): 5
test percentage (for ‘cv-test’ and ‘train-val-test’): 20
val percentage (for ‘train-val-test’): 20

Parameters

data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict
maf_percentage (int) – threshold for MAF filter as percentage value
datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test
n_outerfolds (int) – number of outerfolds relevant for nested-cv
n_innerfolds (int) – number of folds relevant for nested-cv and cv-test
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
val_set_size_percentage (int) – size of the validation set relevant for train-val-test
X (numpy.array) – genotype in additive encoding to create ma-frequencies
y (numpy.array) – matched phenotype values
sample_ids (numpy.array) – matched sample ids of genotype/phenotype
X_index (numpy.array) – index file of genotype to redo matching
y_index (numpy.array) – index file of phenotype to redo matching

easypheno.preprocess.raw_data_functions.filter_non_informative_snps(X)

Remove non-informative SNPs, i.e. SNPs that are constant

Parameters: X (numpy.array) – genotype matrix in raw or additive encoding
Returns: filtered genotype matrix and filter-vector
Return type: (numpy.array, numpy.array)

easypheno.preprocess.raw_data_functions.get_minor_allele_freq(X)

Compute minor allele frequencies of genotype matrix

Parameters: X (numpy.array) – genotype matrix in additive encoding
Returns: array with frequencies
Return type: numpy.array

easypheno.preprocess.raw_data_functions.create_maf_filter(maf, freq)

Create minor allele frequency filter

Parameters

maf (int) – maf threshold as percentage value
freq (numpy.array) – array containing minor allele frequencies as decimal value

Returns

array containing indices of SNPs with MAF smaller than specified threshold, i.e. SNPs to delete

Return type

numpy.array

easypheno.preprocess.raw_data_functions.check_datasplit_user_input(user_datasplit, user_n_outerfolds, user_n_innerfolds, user_test_set_size_percentage, user_val_set_size_percentage, datasplit, param_to_check)

Check if user input of data split parameters differs from standard values. If it does, add input to list of parameters

Parameters

user_datasplit (str) – datasplit specified by the user
user_n_outerfolds (int) – number of outerfolds relevant for nested-cv specified by the user
user_n_innerfolds (int) – number of folds relevant for nested-cv and cv-test specified by the user
user_test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test specified by the user
user_val_set_size_percentage (int) – size of the validation set relevant for train-val-test specified by the user
datasplit (str) – type of data split
param_to_check (list) – standard parameters to compare to

Returns

adapted list of parameters

Return type

list

easypheno.preprocess.raw_data_functions.check_train_test_splits(y, datasplit, datasplit_params)

Create stratified train-test splits. Continuous values will be grouped into bins and stratified according to those.

Datasplit parameters:

nested-cv: [n_outerfolds, n_innerfolds]
cv-test: [n_innerfolds, test_set_size_percentage]
train-val-test: [val_set_size_percentage, train_set_size_percentage]

Parameters

datasplit (str) – type of datasplit (‘nested-cv’, ‘cv-test’, ‘train-val-test’)
y (numpy.array) – array with phenotypic values for stratification
datasplit_params (list) – parameters to use for split

Returns

dictionary respectively arrays with indices

easypheno.preprocess.raw_data_functions.make_bins(y, datasplit, datasplit_params)

Create bins of continuous values for stratification.

Datasplit parameters:

nested-cv: [n_outerfolds, n_innerfolds]
cv-test: [n_innerfolds, test_set_size_percentage]
train-val-test: [val_set_size_percentage, train_set_size_percentage]

Parameters

y (numpy.array) – array containing phenotypic values
datasplit (str) – train test split to use
datasplit_params (list) – parameters to use for split

Returns

binned array

Return type

numpy.array

easypheno.preprocess.raw_data_functions.make_nested_cv(y, outerfolds, innerfolds)

Create index dictionary for stratified nested cross validation with the following structure:

{
    'outerfold_0_test': test_indices,
    'outerfold_0': {
        'fold_0_train': innerfold_0_train_indices,
        'fold_0_test': innerfold_0_test_indices,
        ...
        'fold_n_train': innerfold_n_train_indices,
        'fold_n_test': innerfold_n_test_indices
    },
    ...
    'outerfold_m_test': test_indices,
    'outerfold_m': {
        'fold_0_train': innerfold_0_train_indices,
        'fold_0_test': innerfold_0_test_indices,
        ...
        'fold_n_train': innerfold_n_train_indices,
        'fold_n_test': innerfold_n_test_indices
    }
}

Parameters

y (numpy.array) – target values grouped in bins for stratification
outerfolds (int) – number of outer folds
innerfolds (int) – number of inner folds

Returns

index dictionary

Return type

dict

easypheno.preprocess.raw_data_functions.make_stratified_cv(x, y, split_number)

Create index dictionary for stratified cross-validation with following structure:

{
    'fold_0_train': fold_0_train_indices,
    'fold_0_test': fold_0_test_indices,
    ...
    'fold_n_train': fold_n_train_indices,
    'fold_n_test': fold_n_test_indices
}

Parameters

x (numpy.array) – whole train indices
y (numpy.array) – target values binned in groups for stratification
split_number (int) – number of folds

Returns

dictionary containing train and validation indices for each fold

Return type

dict

easypheno.preprocess.raw_data_functions.make_train_test_split(y, test_size, val_size=None, val=False, random=42)

Create index arrays for stratified train-test, respectively train-val-test splits.

Parameters

y (numpy.array) – target values grouped in bins for stratification
test_size (int) – size of test set as percentage value
val_size – size of validation set as percentage value
val – if True, function returns validation set additionally to train and test set
random – controls shuffling of data

Returns

either train, val and test index arrays or train and test index arrays and corresponding binned target values

Return type

(numpy.array, numpy.array, numpy.array)

easypheno.preprocess.raw_data_functions

Module Contents

Functions

`easypheno.preprocess.raw_data_functions`