We optimised MutationDistiller's HPO weights on a set consisting of known disease mutations from ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/) linked with HPO terms: We obtained all pathogenic ClinVar entries with at least two HPO terms; a total of 188 cases linked with 142 different genes. Please refer to our web site for this test set. We spiked these mutations into the HG00377 exome from the 1000 Genomes Project (http://www.internationalgenome.org/) and sent them, together with the associated HPO terms, to MutationDistiller. Subsequently, we iterated through a range of weight combinations (245 combinations in total) for direct, ancestor and descendant matches and compared the results. If the disease mutation was found, we then observed the distribution of the ranks given to the genes containing the disease mutation across all weight combinations. We only regarded the first 100 ranks, denoting any cases beyond that as not found. Genes with the exact same score were given the same rank.
A VCF file containing all the ClinVar variants we used, together with their HPO symptoms, can be found
The results for all iterations through 245 different weight combinations for direct HPO matches, ancestor and descendant term matches can be accessed here.
To validate MutationDistiller's HPO-based prioritisations, we compared it to other tools sharing similar properties:
In our test, we included widely used and freely available functional state-of-the-art tools which do not require any software installation or user login,
can analyse single patient VCF files, and offer HPO-based prioritisations. We found three different algorithms
fulfilling these criteria: eXtasy, obtained from
github (version 2014-02-19) and the PhenIX
and HiPhive algorithms incorporated into Exomiser (version exomiser-cli-10.0.1).
For our analyses, we used default settings for all algorithms, which is what an untrained user would be expected to do. For each of the algorithms, we had to rely on locally installed versions as the online tools were not working reliably or fast enough for our purposes. We tested the software on a set of 101 solved patient cases from the Charité Berlin. These instances of rare, early-onset Mendelian disorders were provided by clinicians and researchers working in the Department of Neuropaediatrics and the Institute of Medical Genetics and Human Genetics. We used newly found disease mutations which were not yet included in ClinVar, together with the HPO symptoms assigned to the patient and information on the expected mode of inheritance (if available). The set included a range of disorders and various types of mutations as well as compound heterozygous cases. We spiked the known causative variant for each case into the same 1000G exome VCF file used for optimisation of MutationDistiller (HG00377). As the eXtasy algorithm is not capable of working with all HPO terms, we removed for this tool the terms not found in eXtasy's database from our set. This limited our set for eXtasy analysis to 88 cases. Moreover, eXtasy's entry options are limited to 10 HPO symptoms per case. In the 7 cases with more than 10 HPO terms, we thus randomly removed symptoms to reach only 10 terms.
We then sent the resulting VCF files, the HPO identifiers and mode of inheritance information submitted by the clinicians to the different tools. For MutationDistiller, we used the HPO weight settings determined in the optimisation procedure described above. The tools included into this comparison do not provide a score for known pathogenic variants, which is why we decided not to take into account MutationDistiller's ClinVar score at this stage.
As in the optimisation step, we recorded the ranks allocated to the genes containing the index mutation, capping at rank 100.
For the four tools or algorithms, we then compared the distribution of ranks for the index genes.
Exomiser: To assess the prioritisation of Exomiser, we used its Exomiser gene pheno score, which does not include the variant prediction but only the phenotypic assessment of the gene.
eXtasy: For eXtasy, we had to distinguish between cases in which only one HPO term was used for analysis and cases with more than one term. In cases with a single HPO term, we ranked the files by the result score; in combined cases by the provided statistical score as the program outputs a result score for each HPO term separately.
Below, we provide the raw result files for the different tools, indicating the disease genes indicated by our users (searchgene), and the associated rank and scores. n/a signifies that the disease gene was not found within the first 100 ranks. To protect our patient's privacy, we are not able to provide HPO terms or disease variants. Please click on the corresponding links to access the raw result files for the different tools: