HowTo: Summarize prediction results with easyPheno
In the subpackage postprocess, we included functions to analyze optimization results. We provide scripts to run each of these functions (prefix run_) with our Docker workflow, on which we will also focus in this tutorial. If you want to use the functions directly (e.g. with the pip installed package), please check the scripts and see which functions are called.
Optimization results in easyPheno are saved using the following directory structure:
run_summarize_results.py, you can accumulate all optimization results for a genotype matrix:
python3 -m easypheno.postprocess.run_summarize_results -rd path_at_name_genotype_matrix_level
This leads to the creation of the summary files described in summarize_results_per_phenotype_and_datasplit().
Results_summary_all_phenotypes*DATASPLIT-PATTERN*.csv file created by the command above, we provide scripts to visualize the results of several prediction models on different phenotypes:
python3 -m easypheno.postprocess.run_plot_results -rsp path_to_Results_summary_all_phenotypes_XX.csv -sd path_to_save_directory
This creates a heatmap plot, which is stored at the specified save directory. Currently, heatmaps are implemented, and we can easily add more plot functions.
Additional analysis for simulated phenotypes
For simulated phenotypes, we know the ground truth in terms of markers respective features, which influence the phenotypic value. Based on that, we are able to compare these effect sizes with feature importances to analyze how well an algorithm captures the relevant features.
To this end, we conduct a statistical as well as visual analysis, which we further describe in the following publication:
A comparison of classical and machine learning-based phenotype prediction methods on simulated data and three plant speciesMaura John, Florian Haselbeck, Rupashree Dass, Christoph Malisi, Patrizia Ricca, Christian Dreischer, Sebastian J. Schultheiss and Dominik G. GrimmFrontiers in Plant Science, 2022 (currently in press)
The files to do that can be generated with the following command:
python3 -m easypheno.simulate.run_results_analysis_synthetic_data -rd path_at_name_genotype_matrix_level -simd path_to_simulation_configs -sd path_to_save_directory
Besides .csv-files with statistical information, a scatter plot visualizing feature importances in comparison with effect sizes is created.