easypheno.simulate.synthetic_phenotypes

Module Contents

Functions

filter_duplicates(X, snp_ids)

Remove duplicate SNPs, i.e. SNPs that are completely the same for all samples and therefore do not add information.

get_simulation(X, sample_ids, snp_ids, number_of_samples, number_causal_snps, explained_variance, maf, heritability, seed, number_background_snps, distribution, shape)

Simulate phenotypes based on (real) genotypes in an additive setting with normally or gamma distributed noise and

check_sim_id(sim_dir)

Check which ids were already used for simulations.

save_sim_overview(save_dir, sim_names, number_of_samples, number_causal_snps, explained_variance, maf, heritability, seeds, number_background_snps, distribution, shape)

save overview file for all simulations; append new simulations if file already exists

save_simulation(save_dir, genotype_matrix_name, number_of_sim, X, sample_ids, snp_ids, number_of_samples, number_causal_snps, explained_variance, maf, heritability, seed, number_background_snps, distribution, shape)

Set all variables and generate one or more simulations with same configurations.

easypheno.simulate.synthetic_phenotypes.filter_duplicates(X, snp_ids)

Remove duplicate SNPs, i.e. SNPs that are completely the same for all samples and therefore do not add information.

Parameters
  • X (numpy.array) – genotype matrix to be filtered

  • snp_ids (numpy.array) – vector containing corresponding SNP ids

Returns

filtered genotype matrix and filtered SNP ids

Return type

(numpy.array, numpy.array)

easypheno.simulate.synthetic_phenotypes.get_simulation(X, sample_ids, snp_ids, number_of_samples, number_causal_snps, explained_variance, maf, heritability, seed, number_background_snps, distribution, shape)

Simulate phenotypes based on (real) genotypes in an additive setting with normally or gamma distributed noise and normally distributed effect sizes of causal SNPs.

Parameters
  • X (numpy.array) – genotype matrix in additive encoding

  • sample_ids (numpy.array) – sample ids of genotype matrix

  • snp_ids (numpy.array) – SNP ids of genotype matrix

  • number_of_samples (int) – number of samples of synthetic phenotype

  • number_causal_snps (int) – number of SNPs used as causal markers in simulation

  • explained_variance (int) – percentage value of how much of the total variance the causal SNPs should explain

  • maf (int) – percentage value used for maf filtering of genotype matrix

  • heritability (int) – percentage value of how much of the variance should be explained by polygenic background

  • seed (int) – seed for random sampling

  • number_background_snps (int) – number of randomly selected SNPs to simulate the polygenic background

  • distribution (str) – probability distribution used to draw random noise can be ‘normal’ or ‘gamma’

  • shape (float) – only needed if distribution is ‘gamma’

Returns

simulated phenotype with corresponding sample ids, SNP ids of causal SNPs, SNP ids of background SNPs,

Return type

(numpy.array, numpy.array, numpy.array, numpy.array, numpy.array, numpy.array, numpy.array)

effect sizes of background, effect sizes of causal SNPs, used explained variance for each causal SNP

easypheno.simulate.synthetic_phenotypes.check_sim_id(sim_dir)

Check which ids were already used for simulations.

Parameters

sim_dir (pathlib.Path) – directory containing simulations to check

Returns

last simulation number + 1

Return type

int

easypheno.simulate.synthetic_phenotypes.save_sim_overview(save_dir, sim_names, number_of_samples, number_causal_snps, explained_variance, maf, heritability, seeds, number_background_snps, distribution, shape)

save overview file for all simulations; append new simulations if file already exists

Parameters
  • save_dir (pathlib.Path) – directory to save overview file to

  • sim_names (list) – list containing simulation name for each simulation

  • number_of_samples (list) – list containing number of samples for each simulation

  • number_causal_snps (list) – list containing number of causal SNPS for each simulation

  • explained_variance (list) – list containing total explained variance of causal SNPs for each simulation

  • maf (list) – list containing used maf frequency for each simulation

  • heritability (list) – list containing used heritability for each simulation

  • seeds (list) – list containing used seed for each simulation

  • number_background_snps (list) – list containing number of background SNPs for each simulation

  • distribution (list) – list containing used distribution of random noise for each simulation

  • shape (list) – list containing shape of gamma distribution, resp. None for normal distribution for each simulation

easypheno.simulate.synthetic_phenotypes.save_simulation(save_dir, genotype_matrix_name, number_of_sim, X, sample_ids, snp_ids, number_of_samples, number_causal_snps, explained_variance, maf, heritability, seed, number_background_snps, distribution, shape)

Set all variables and generate one or more simulations with same configurations. Save overview file and simulated phenotypes to subfolder ‘genotype_matrix_name’ in save_dir as ‘Simulations_Overview.csv’ and Simulation_{sim_id}.csv. Save SNP ids of background SNPs, effect sizes/betas of background SNPs and configuration infos containing SNP ids and betas of causal SNPs to subfolder sim_configs in ‘genotype_matrix_name’ as ‘background_{sim_id}.csv’, ‘betas_background_{sim_id}.csv’ and ‘simulation_config_{sim_id}.csv’. If only one phenoype is simulated, the sim_id consists of a single number. If several phenotypes are simulated with the same configurations, then the sim_id is the number of the first simulation ‘-’ number of last simulation, e.g. ‘10-15’

Parameters
  • save_dir (str) – directory to save simulations to

  • genotype_matrix_name (str) – name of genotype matrix to be used for simulations, needed to create subfolder in save_dir

  • number_of_sim (int) – number of simulations to create with same configurations

  • X (numpy.array) – genotype matrix in additive encoding

  • sample_ids (numpy.array) – sample ids of genotype matrix

  • snp_ids (numpy.array) – SNP ids of genotype matrix

  • number_of_samples (int) – number of samples of synthetic phenotype

  • number_causal_snps (int) – number of SNPs used as causal markers in simulation

  • explained_variance (int) – percentage value of how much of the total variance the causal SNPs should explain

  • maf (int) – percentage value used for maf filtering of genotype matrix

  • heritability (int) – percentage value of how much of the variance should be explained by polygenic background

  • seed (int) – seed for random sampling

  • number_background_snps (int) – number of randomly selected SNPs to simulate the polygenic background

  • distribution (str) – probability distribution used to draw random noise can be ‘normal’ or ‘gamma’

  • shape (float) – only needed if distribution is ‘gamma’