easypheno.preprocess.raw_data_functions
Module Contents
Functions
|
Prepare all data files for a common format: genotype matrix, phenotype matrix and index file. |
|
Check .h5 genotype file. Should contain: |
|
Check if index file is available and if the datasets 'y', 'matched_sample_ids', 'X_index', 'y_index' and |
|
Prepare and save all required data files: |
|
Check the format of the specified genotype matrix. |
|
Load .csv genotype file. File must have the following structure: |
|
Load binary PLINK file, .bim, .fam, .bed files with same prefix need to be in same folder. |
|
Load PLINK files, .map and .ped file with same prefix need to be in same folder. |
|
check if genotype matrix contain duplicate samples |
|
Check if number of samples in sample_ids and genotype matrix match |
|
Save genotype matrix in unified .h5 file. |
|
Check and load the specified phenotype matrix. Only accept .csv, .pheno, .txt files. |
|
Match the handed over genotype and phenotype matrix for the phenotype specified by the user, i.e. compare sample ids |
|
Get elements of data specified in index array |
|
Check index file, described in create_index_file(), and append datasets if necessary |
|
Create the .h5 index file containing the maf filters and data splits for the combination of genotype matrix, |
Remove non-informative SNPs, i.e. SNPs that are constant |
|
Compute minor allele frequencies of genotype matrix |
|
|
Create minor allele frequency filter |
|
Check if user input of data split parameters differs from standard values. |
|
Create stratified train-test splits. Continuous values will be grouped into bins and stratified according to those. |
|
Create bins of continuous values for stratification. |
|
Create index dictionary for stratified nested cross validation with the following structure: |
|
Create index dictionary for stratified cross-validation with following structure: |
|
Create index arrays for stratified train-test, respectively train-val-test splits. |
- easypheno.preprocess.raw_data_functions.prepare_data_files(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, models, user_encoding, maf_percentage)
Prepare all data files for a common format: genotype matrix, phenotype matrix and index file.
First check if genotype file is .h5 file (standard format of this framework):
YES: First check if all required information is present in the file, raise Exception if not. Then check if index file exists:
NO: Load genotype and create all required index files
YES: Append all required data splits and maf-filters to index file
NO: Load genotype and create all required files
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict
datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test
n_outerfolds (int) – number of outerfolds relevant for nested-cv
n_innerfolds (int) – number of folds relevant for nested-cv and cv-test
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
val_set_size_percentage (int) – size of the validation set relevant for train-val-test
models – models to consider
user_encoding (str) – encoding specified by the user
maf_percentage (int) – threshold for MAF filter as percentage value
- easypheno.preprocess.raw_data_functions.check_genotype_h5_file(data_dir, genotype_matrix_name, encodings)
Check .h5 genotype file. Should contain:
sample_ids: vector with sample names of genotype matrix,
snp_ids: vector with SNP identifiers of genotype matrix,
X_{enc}: (samples x SNPs)-genotype matrix in enc encoding, where enc might refer to:
‘012’: additive (number of minor alleles)
‘raw’: raw (alleles)
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the phenotype matrix including datatype ending
encodings (list) – list of needed encodings
- easypheno.preprocess.raw_data_functions.check_index_file(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype)
Check if index file is available and if the datasets ‘y’, ‘matched_sample_ids’, ‘X_index’, ‘y_index’ and ‘ma_frequency’ exist.
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict
- Returns
bool reflecting check result
- Return type
- easypheno.preprocess.raw_data_functions.save_all_data_files(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, models, user_encoding, maf_percentage, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage)
Prepare and save all required data files:
genotype matrix in unified format as .h5 file with,
phenotype matrix in unified format as .csv file,
file containing maf filter and data split indices as .h5
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict
models – models to consider
user_encoding (str) – encoding specified by the user
maf_percentage (int) – threshold for MAF filter as percentage value
datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test
n_outerfolds (int) – number of outerfolds relevant for nested-cv
n_innerfolds (int) – number of folds relevant for nested-cv and cv-test
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
val_set_size_percentage (int) – size of the validation set relevant for train-val-test
- easypheno.preprocess.raw_data_functions.check_transform_format_genotype_matrix(data_dir, genotype_matrix_name, models, user_encoding, save_h5=True)
Check the format of the specified genotype matrix.
Unified genotype matrix will be saved in subdirectory data and named NAME_OF_GENOTYPE_MATRIX.h5
Unified format of the .h5 file of the genotype matrix required for the further processes:
mandatory:
sample_ids: vector with sample names of genotype matrix,
SNP_ids: vector with SNP identifiers of genotype matrix,
X_{enc}: (samples x SNPs)-genotype matrix in enc encoding, where enc might refer to:
‘012’: additive (number of minor alleles)
‘raw’: raw (alleles)
optional: genotype in additional encodings
Accepts .h5, .hdf5, .h5py, .csv, PLINK binary and PLINK files. .h5, .hdf5, .h5py files must satisfy the unified format. If the genotype matrix contains constant SNPs, those will be removed and a new file will be saved. Will open .csv, PLINK and binary PLINK files and generate required .h5 format.
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
models – models to consider
user_encoding (str) – encoding specified by the user
save_h5 (bool) – save genotype in unified h5 format if True, default is True
- Returns
genotype matrix (raw encoded if present, 012 encoded otherwise), sample ids and SNP ids
- Return type
(numpy.array, numpy.array, numpy.array)
- easypheno.preprocess.raw_data_functions.check_genotype_csv_file(data_dir, genotype_matrix_name, encodings)
Load .csv genotype file. File must have the following structure: First column must contain the sample ids, the column names should be the SNP ids. The values should be the genotype matrix either in additive encoding or in raw encoding. If the name of the first column is ‘MarkerID’ it is assumed that the rows contain the markers and the column contain the samples and the genotype matrix will be transposed. If the csv file contains the genotype in biallelic notation (i.e. ‘AA’, ‘AT’, …), this function generates a genotype matrix in iupac notation (i.e. ‘A’, ‘W’, …).
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
encodings (list) – list of needed encodings
- Returns
sample ids, SNP ids and genotype in additive / raw encoding
- Return type
(numpy.array, numpy.array, numpy.array)
- easypheno.preprocess.raw_data_functions.check_genotype_binary_plink_file(data_dir, genotype_matrix_name)
Load binary PLINK file, .bim, .fam, .bed files with same prefix need to be in same folder. Compute additive and raw encoding of genotype
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
- Returns
sample ids, SNP ids and genotype in raw encoding
- Return type
(numpy.array, numpy.array, numpy.array)
- easypheno.preprocess.raw_data_functions.check_genotype_plink_file(data_dir, genotype_matrix_name)
Load PLINK files, .map and .ped file with same prefix need to be in same folder. Accepts GENOTYPENAME.ped and GENOTYPENAME.map as input
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
- Returns
sample ids, SNP ids and genotype in raw encoding
- Return type
(numpy.array, numpy.array, numpy.array)
- easypheno.preprocess.raw_data_functions.check_duplicate_samples(sample_ids)
check if genotype matrix contain duplicate samples
- Parameters
sample_ids (numpy.array) – sample ids of genotype matrix
- Returns
True if duplicates are present, False if not
- Return type
- easypheno.preprocess.raw_data_functions.check_genotype_shape(X, sample_ids, snp_ids)
Check if number of samples in sample_ids and genotype matrix match and if number of markers in snp_ids and genotype matrix match.
- Parameters
X (numpy.array) – genotype matrix
sample_ids (numpy.array) – vector containing sample ids of genotype
snp_ids (numpy.array) – vector containing SNP ids of genotype
- easypheno.preprocess.raw_data_functions.create_genotype_h5_file(data_dir, genotype_matrix_name, sample_ids, snp_ids, X)
Save genotype matrix in unified .h5 file.
Structure:
sample_ids
snp_ids
X_raw (or X_012 if X_raw not available)
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
sample_ids (numpy.array) – array containing sample ids of genotype data
snp_ids (numpy.array) – array containing snp ids of genotype data
X (numpy.array) – matrix containing genotype either in raw or in additive encoding
- easypheno.preprocess.raw_data_functions.check_and_load_phenotype_matrix(data_dir, phenotype_matrix_name, phenotype)
Check and load the specified phenotype matrix. Only accept .csv, .pheno, .txt files. Sample ids need to be in first column, remaining columns should contain phenotypic values with phenotype name as column name
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict
- Returns
DataFrame with sample_ids as index and phenotype values as single column without NAN values
- Return type
pandas.DataFrame
- easypheno.preprocess.raw_data_functions.genotype_phenotype_matching(X, X_ids, y)
Match the handed over genotype and phenotype matrix for the phenotype specified by the user, i.e. compare sample ids
- Parameters
X (numpy.array) – genotype matrix in additive encoding
X_ids (numpy.array) – sample ids of genotype matrix
y (pandas.DataFrame) – pd.DataFrame containing sample ids of phenotype as index and phenotype values as single column
- Returns
matched genotype matrix, matched sample ids, index arrays for genotype and phenotype to redo matching
- Return type
- easypheno.preprocess.raw_data_functions.get_matched_data(data, index)
Get elements of data specified in index array
- Parameters
data (numpy.array) – matrix or array
index (numpy.array) – index array
- Returns
data at selected indices
- Return type
numpy.array
- easypheno.preprocess.raw_data_functions.append_index_file(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, maf_percentage)
Check index file, described in create_index_file(), and append datasets if necessary
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict
maf_percentage (int) – threshold for MAF filter as percentage value
datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test
n_outerfolds (int) – number of outerfolds relevant for nested-cv
n_innerfolds (int) – number of folds relevant for nested-cv and cv-test
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
val_set_size_percentage (int) – size of the validation set relevant for train-val-test
- easypheno.preprocess.raw_data_functions.create_index_file(data_dir, genotype_matrix_name, phenotype_matrix_name, phenotype, datasplit, n_outerfolds, n_innerfolds, test_set_size_percentage, val_set_size_percentage, maf_percentage, X, y, sample_ids, X_index, y_index)
Create the .h5 index file containing the maf filters and data splits for the combination of genotype matrix, phenotype matrix and phenotype. It will be created using standard values additionally to user inputs for the maf filters and data splits.
Unified format of .h5 file containing the maf filters and data splits:
'matched_data': { 'y': matched phenotypic values, 'matched_sample_ids': sample ids of matched genotype/phenotype, 'X_index': indices of genotype matrix to redo matching, 'y_index': indices of phenotype vector to redo matching, 'ma_frequency': minor allele frequency of each SNP of genotype file to create new MAF filters } 'maf_filter': { 'maf_{maf_percentage}': indices of SNPs to delete # (with MAF < maf_percentage), ... } 'datasplits': { 'nested_cv': { '#outerfolds-#innerfolds': { 'outerfold_0': { 'innerfold_0': {'train': indices_train, 'val': indices_val}, ... 'innerfold_n': {'train': indices_train, 'val': indices_val}, 'test': test_indices }, ... 'outerfold_m': { 'innerfold_0': {'train': indices_train, 'val': indices_val}, ... 'innerfold_n': {'train': indices_train, 'val': indices_val}, 'test': test_indices } }, ... } 'cv-test': { '#folds-test_percentage': { 'outerfold_0': { 'innerfold_0': {'train': indices_train, 'val': indices_val}, ... 'innerfold_n': {'train': indices_train, 'val': indices_val}, 'test': test_indices } }, ... } 'train-val-test': { 'train_percentage-val_percentage-test_percentage': { 'outerfold_0': { 'innerfold_0': {'train': indices_train, 'val': indices_val}, 'test': test_indices } }, ... } }
Standard values for the maf filters and data splits:
maf thresholds: 1, 3, 5
folds (inner-/outerfolds for ‘nested-cv’ and folds for ‘cv-test’): 5
test percentage (for ‘cv-test’ and ‘train-val-test’): 20
val percentage (for ‘train-val-test’): 20
- Parameters
data_dir (pathlib.Path) – data directory where the phenotype and genotype matrix are stored
genotype_matrix_name (str) – name of the genotype matrix including datatype ending
phenotype_matrix_name (str) – name of the phenotype matrix including datatype ending
phenotype (str) – name of the phenotype to predict
maf_percentage (int) – threshold for MAF filter as percentage value
datasplit (str) – datasplit to use. Options are: nested-cv, cv-test, train-val-test
n_outerfolds (int) – number of outerfolds relevant for nested-cv
n_innerfolds (int) – number of folds relevant for nested-cv and cv-test
test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test
val_set_size_percentage (int) – size of the validation set relevant for train-val-test
X (numpy.array) – genotype in additive encoding to create ma-frequencies
y (numpy.array) – matched phenotype values
sample_ids (numpy.array) – matched sample ids of genotype/phenotype
X_index (numpy.array) – index file of genotype to redo matching
y_index (numpy.array) – index file of phenotype to redo matching
- easypheno.preprocess.raw_data_functions.filter_non_informative_snps(X)
Remove non-informative SNPs, i.e. SNPs that are constant
- Parameters
X (numpy.array) – genotype matrix in raw or additive encoding
- Returns
filtered genotype matrix and filter-vector
- Return type
(numpy.array, numpy.array)
- easypheno.preprocess.raw_data_functions.get_minor_allele_freq(X)
Compute minor allele frequencies of genotype matrix
- Parameters
X (numpy.array) – genotype matrix in additive encoding
- Returns
array with frequencies
- Return type
numpy.array
- easypheno.preprocess.raw_data_functions.create_maf_filter(maf, freq)
Create minor allele frequency filter
- Parameters
maf (int) – maf threshold as percentage value
freq (numpy.array) – array containing minor allele frequencies as decimal value
- Returns
array containing indices of SNPs with MAF smaller than specified threshold, i.e. SNPs to delete
- Return type
numpy.array
- easypheno.preprocess.raw_data_functions.check_datasplit_user_input(user_datasplit, user_n_outerfolds, user_n_innerfolds, user_test_set_size_percentage, user_val_set_size_percentage, datasplit, param_to_check)
Check if user input of data split parameters differs from standard values. If it does, add input to list of parameters
- Parameters
user_datasplit (str) – datasplit specified by the user
user_n_outerfolds (int) – number of outerfolds relevant for nested-cv specified by the user
user_n_innerfolds (int) – number of folds relevant for nested-cv and cv-test specified by the user
user_test_set_size_percentage (int) – size of the test set relevant for cv-test and train-val-test specified by the user
user_val_set_size_percentage (int) – size of the validation set relevant for train-val-test specified by the user
datasplit (str) – type of data split
param_to_check (list) – standard parameters to compare to
- Returns
adapted list of parameters
- Return type
- easypheno.preprocess.raw_data_functions.check_train_test_splits(y, datasplit, datasplit_params)
Create stratified train-test splits. Continuous values will be grouped into bins and stratified according to those.
Datasplit parameters:
nested-cv: [n_outerfolds, n_innerfolds]
cv-test: [n_innerfolds, test_set_size_percentage]
train-val-test: [val_set_size_percentage, train_set_size_percentage]
- easypheno.preprocess.raw_data_functions.make_bins(y, datasplit, datasplit_params)
Create bins of continuous values for stratification.
Datasplit parameters:
nested-cv: [n_outerfolds, n_innerfolds]
cv-test: [n_innerfolds, test_set_size_percentage]
train-val-test: [val_set_size_percentage, train_set_size_percentage]
- easypheno.preprocess.raw_data_functions.make_nested_cv(y, outerfolds, innerfolds)
Create index dictionary for stratified nested cross validation with the following structure:
{ 'outerfold_0_test': test_indices, 'outerfold_0': { 'fold_0_train': innerfold_0_train_indices, 'fold_0_test': innerfold_0_test_indices, ... 'fold_n_train': innerfold_n_train_indices, 'fold_n_test': innerfold_n_test_indices }, ... 'outerfold_m_test': test_indices, 'outerfold_m': { 'fold_0_train': innerfold_0_train_indices, 'fold_0_test': innerfold_0_test_indices, ... 'fold_n_train': innerfold_n_train_indices, 'fold_n_test': innerfold_n_test_indices } }
- easypheno.preprocess.raw_data_functions.make_stratified_cv(x, y, split_number)
Create index dictionary for stratified cross-validation with following structure:
{ 'fold_0_train': fold_0_train_indices, 'fold_0_test': fold_0_test_indices, ... 'fold_n_train': fold_n_train_indices, 'fold_n_test': fold_n_test_indices }
- easypheno.preprocess.raw_data_functions.make_train_test_split(y, test_size, val_size=None, val=False, random=42)
Create index arrays for stratified train-test, respectively train-val-test splits.
- Parameters
y (numpy.array) – target values grouped in bins for stratification
test_size (int) – size of test set as percentage value
val_size – size of validation set as percentage value
val – if True, function returns validation set additionally to train and test set
random – controls shuffling of data
- Returns
either train, val and test index arrays or train and test index arrays and corresponding binned target values
- Return type
(numpy.array, numpy.array, numpy.array)