Machine Learning of Patient Characteristics to Predict Admission Outcomes in the Undiagnosed Diseases Network

Key Points Question Can machine learning algorithms reproduce the performance of clinical experts in determining whether to accept patients to the Undiagnosed Diseases Network for extensive genome-scale evaluation? Findings This prognostic study developed a machine learning model using 2421 patient applications and evaluated the model through retrospective and prospective validation. The area under the receiver operating characteristic curve obtained for predicting admission outcomes suggested that the admission process for accepted applications may be accelerated by up to 68% using the developed machine learning model. Meaning Findings of this study suggest that the use of machine learning assistance to prioritize the evaluation of patients with undiagnosed diseases is feasible and may increase the number of applications processed in a given time frame.


eTable 1. Normalized Term Frequency of Several Semantic Types in the Referral Letters of Accepted and Not-Accepted Applications
UMLS contains a set of broad subject categories, or semantic types, that provide a consistent categorization of all medical terms/concepts represented in the UMLS Metathesaurus. We counted the frequency of a select subset of these semantic types in the referral letters of Accepted and Not-accepted applications using our "training" data. The results are reported in eTable 1.
Results are obtained from training data. SD indicates Standard Deviation and t indicates tstatistic of two-tailed t-test with *p<0.05 and ***p<0.001.

Semantic Type
Accepted (

eFigure. Comparison of Different Models in Terms of Their Ranking Performance Illustrated by Precision-Recall Curve
We compared classification models in terms of their ability to rank Accepted applications above Not-accepted ones. The eFigure shows the precision-recall curve for Walley et al. (2018) [16] and our two BERT-based models (for clear illustration, we only showed the results of these three models). The no-skill line was a straight line at precision = 0.5. The results showed BERT-based models were considerably more precise (able to correctly rank Accepted applications) at lower recall values.

eTable 3. Symptom-Level Performance on Prospective Test Instances
eTable 3 illustrates symptoms for which our best model succeeded (green) or failed (yellow) in predicting admission outcomes on the prospective test instances. %preval indicates prevalence in training of a symptom in training/test data.

eAppendix 1. Process for Assigning Patient Applications to Review Sessions
Given a ranking heuristic, we follow the following process to assign applications to review sessions: each review session i in the UDN dataset had a meeting date ti and a budget bi which indicates the number of applications that could be reviewed at that review session. Given the review sessions in ascending order of their meeting dates ti ∀i=1,2,.., and a ranked list of applications in descending order of their admission likelihood, we assigned the top bi applications that were submitted before date ti to the ith review session (and set their review/decision dates to ti). These top bi applications were then removed from the ranked list of applications and this process was repeated for subsequent review sessions until all applications were assigned to review sessions. This process ensured that applications were assigned to their closest review sessions while respecting the budget/time constraint for each session and the order of applications in the given ranked list of applications. Once all applications were assigned review dates, we computed the average processing time by computing the time difference between application submission and review dates as follows: ∑ =1 ( − )where sj and rj were the submission and assigned review dates of the jth application. Note that information about meeting dates and budget for each session was provided by the UDN.

eTable 4. Average Processing Time Across Different Review "Periods" and Number of Applications Reviewed in Each Review Session ("Budgets")
Let d be the review frequency (i.e. review sessions occur every d-day period) and a be the number of applications reviewed in each session (budget for each session). The optimal values for d and a so that the ranking generated by our best classifier and the Accept-First ranking model lead to the same/comparable (i.e. a difference of less than a week) average processing time are highlighted in eTable 4. For example, for biweekly review sessions (d = 14), at least a = 26 applications should be reviewed at each session so that our best classifier leads to a comparable processing time to the Accept-First model.
We look for combinations in which the ranking generated by the classifier and the perfect ranking lead to the same/comparable average processing time. The highlighted rows show periods and budgets for which our classifier and perfect model have a maximum difference of one week.