APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for Clinical Decision Support

Key Points Question Can quantitative methods be used to evaluate the robustness of artificial intelligence (AI) prediction models and their suitability for clinical decision support? Findings In this quality improvement study, the APPRAISE-AI tool was developed to evaluate the methodological and reporting quality of 28 clinical AI studies using a quantitative approach. APPRAISE-AI demonstrated strong interrater and intrarater reliability and correlated well with other validated measures of study quality across a variety of AI studies. Meaning These findings suggest that APPRAISE-AI fills a critical gap in the current landscape of AI reporting guidelines and provides a standardized, quantitative tool for evaluating the methodological rigor and clinical utility of AI models.


Intraclass correlation coefficient
A sample size of 28 studies with 2 observations per study achieves 81% power to detect an intraclass correlation coefficient of 0•45 under the alternative hypothesis when the intraclass correlation coefficient under the null hypothesis is 0 using an F-test with a significance level of 0•05.ii How was the ground truth determined?Multiple (>1), experts Ground truth was determined through combination of multiple means including 1) x-ray reports, 2) followup imaging such as x-rays, CT, or MRI, and 3) operative reports.Since not all patients were surgically validated (i.e., underwent surgery to treat their fracture), this was not scored as "Objective, well-captured ground truth".Given the multiple assessments by clinical experts, this was scored as "Multiple (>1), experts" [page 3].ii Time-windows for abstracted features are specified (e.g., vital signs recorded within the past 12 hours will be used to predict sepsis) Y All included x-rays were from preoperative hips at initial presentation [pages 2-3].
iii How was missing data handled?
If there is no missing data, it should be clearly stated that there is no missing data, select Not applicable.If it is unclear whether there is missing data or how it was handled, select Not reported Not applicable This is not applicable since the deep learning model requires a frontal pelvic x-ray image to make a prediction.Patients without frontal pelvic x-rays were excluded and this is indicated in Figure 1

Y
The authors highlight that their deep learning model was generalizable to an international cohort.It outperformed radiologists and the reported performance of a previously developed AI model.They mention the significant drop in sensitivity using the pre-specified operating point, which would limit its clinical utility.They summarize the key errors identified in their algorithmic audit [pages 6 The authors outline limitations including 1) exclusion of patients with surgical hardware, 2) low sample size of the multi-reader, multi-case study, 3) lack of racial or ethnicity information for patient-specific subgroup testing, and 4) findings from the algorithmic audit may not be statistically choice of candidate features (e.g., based on prior research, clinical relevance, available data, etc.) Y Plain frontal pelvic radiographs were used since this is most commonly ordered at the time of initial presentation in emergency department [page 2].
s) to clinical practice and future directions are discussed Y The authors outline how their deep learning model can be implemented into clinical workflows and mitigation strategies to address limitations outlined in their algorithmic audit [page 7, appendix pages 16

i
of the results is presented, which may include: -New predictors of the ground truth of interest discovered using AI -Strengths of the AI model(s) compared to current models in the literature -Why the AI model(s) performed better/worse than what is currently available?-(Optional) If feature importance rankings were used, describe whether they were aligned with clinical intuition and known prognostic factors Y The authors highlight the of their Transformer model including the integration of longitudinal data, prediction of major causes of death, and long-term mortality prediction.They reviewed the most important features for each of their outcomes, which aligned with the existing literature [pages 8-9].21 Implementation into clinical practice: Describe how the AI model(s) can be applied to clinical practice, with respect to the potential to improve patient care, clinical decision-making, and/or efficiency.[Max score 1] 1 Potential application(s) to clinical practice and future directions are discussed Y The authors outline how their Transformer model can be implemented to predict 1-and 5-year mortality at each follow-up [page 9].limitations including 1) missing data and cause of death in the SRTR dataset, 2) inability to identify causal relationships, 3) lack of external validation, 4) lack of race and ethnicity information in the UHN dataset, 5) exclusion of other causes of death, and 6) shift in patient distribution (overrepresentation of NAFLD patients, change in immunosuppression) [pages 9-10].

Item Description Score Total Overall score: /100 © 2023
Rationale provided for choice of candidate features (e.g., based on prior research, clinical relevance, available data, etc.): + 1 -Time-windows for abstracted features are specified (e.g., vital signs recorded within the past 12 hours will be used to predict sepsis): + 1 Compare evaluation metrics for the AI model(s) and reference standard when stratified by patient-and task-specific subgroups to identify subgroups that benefit, are not helped at all, or harmed by the models.Patient-specific subgroups may include age group, gender, ethnicity, or socioeconomic status.Task-specific subgroups are disease-specific and may include risk stratification (e.g., low-, intermediate-, and high-risk disease in prostate cancer), or subtyping (e.g., different bacteria in positive blood cultures).Kwong JCC et al.JAMA Network Open.
Describe how the dataset was obtained (e.g., single/multi-center, local/national database, etc.), and study period.If relevant, the diversity of the dataset is also described (e.g., inclusion of community hospitals, low/middle income populations, and institutions from other countries) -Transformation/Augmentation: Details provided for how data was altered to change its representation (e.g., normalization, log-transformation, one-hot encoding, image rotation, image translation, adjusting image contrast).If not performed, it should be explicitly stated: + 1 -Modification/Cleaning: Details provided for how data was altered in a non-uniform manner (e.g., outlier removal).If not performed, it should be explicitly stated: + 1 Removal of features -Method reported (e.g., clinical judgement, principal component analysis, recursive feature elimination, correlation, or ablation analysis).If not performed, it should Provide rationale for sample size required for model development (e.g., based on power calculation) -If not reported: 0 for entire item -Sample size reported: + 2 -Number of events reported: + 2 -Details provided for sample size calculation (can be in supplementary material): + 1 /5 Describe appropriate metrics for readers to understand the risk/benefit tradeoffs of using the AI model at the specified decision threshold (e.g., decision curve analysis) Select one of the following -No assessment of clinical utility: + 0 -Sensitivity and specificity reported for a specified threshold: + 2 -Decision curve analysis or impact on clinical outcomes (e.g., overall survival, length of stay, readmission rates): + 5 -Nomogram/scoring system/website available to use model: + 1 -Trained model available: + 1 -Complete source code available: + 1 -Executable end-to-end (e.g., dependency file, documentation on how to run the code) available: + 2 /10

5 Eligibility criteria: Specify all criteria for inclusion/exclusion of patients and features. Provide appropriate details (e.g., adults, age > 18) and rationale. [Max score 3] 3
For the Royal Adelaide Hospital dataset, all frontal pelvic x-rays were included, regardless of x-ray equipment and imaging parameters.For the Stanford University Medical Center dataset, all lower extremity x-rays were included, of which a random selection of 46 fracture and 100 non-fracture cases were selected [pages 2-3].For the Royal Adelaide Hospital dataset, cases were excluded if surgical hardware is seen in the x-ray or if there were no frontal pelvic x-rays available.For the Stanford University Medical Center dataset, cases were excluded if surgical hardware is seen in the x-ray or if personal health information is included in the raw image [pages 2-3].iiiDetails and rationale for criteria are provided YThe authors explained that cases with surgical hardware were excluded since they represent a different class of hip injury, while the target population for their deep learning model is focused on preoperative patients.X-rays with embedded personal health information were excluded for privacy reasons [pages 2-3].
6 Ground truth: Define the ground truth of interest.Describe how it was collected (e.g., manual annotation by experts) and encoded (e.g., binary, categorical, dichotomized continuous, continuous variable, etc.).[Max score 6] 6 i Ground truth of interest is clearly defined For unsupervised learning, describe what measure(s) and associated data will be used to assess cluster validity (e.g., correlating diseasespecific features with overall survival) Y Presence of a proximal femoral fracture [page 3]

Specify how the data was divided into the training, validation, and testing cohorts. [Max score 7] 7
Both random split at the patient-level (Royal Adelaide Hospital dataset) and external validation (Stanford University Medical Center dataset) were used.The option that yielded the higher possible score (external validation) was selected [pages 2-3].
[page 4].iv Transformation/Augmentation: Details provided for how data was altered to change its representation (e.g., normalization, log-transformation, onehot encoding, image rotation, image translation, adjusting image contrast) If not performed, it should be clearly stated that it was not performed, select Not applicable.If it is unclear whether it was performed or not explicitly stated, select No Y Transformation and augmentation procedures included standardizing pixel intensities, image translation, rotations, shears, and histogram matching [appendix page 8, 10].Modification/Cleaning: Details provided for how data was altered in a non-uniform manner (e.g., outlier removal).If not performed, it should be clearly stated that it was not performed, select Not applicable.If it is unclear whether it was performed or not explicitly stated, select No Y Bounding boxes were created to localize and separate the left and right hips [appendix page 9].v Outline any methods used to remove features (e.g., clinical judgement, principal component analysis, recursive feature elimination, correlation, or ablation analysis), if applicable If not performed, it should be clearly stated that it was not performed, Not applicable Not performed © 2023 Kwong JCC et al.JAMA Network Open.select Not applicable.If it is unclear whether it was performed or not explicitly stated, select No. 8 Data splitting: 9 Sample size calculation: Provide rationale for sample size required for model development (e.g., based on power calculation).[Max score 5] 0 i Minimum sample size required reported N Not specified.However, the authors state that the number of cases included in the multi-reader, multi-case study maximized the sample size while balancing what can be reasonably expected from clinicians.They compared their sample size against similar studies in the discussion [pages 3, 7].

13 Cohort characteristics: Provide the total cohort size and summary statistics of the training, validation (if used), and testing cohorts, including incidence of the ground truth of interest. [Max score 4] 4
i Total cohort size, number of samples with missing data, and follow-up time (if applicable) are reported Y Table 1 [page 5] © 2023 Kwong JCC et al.JAMA Network Open.

14 Model specification: Present the final AI model(s) and specify the final panel of features included and hyperparameters tuned. Final hyperparameters can be listed in Supplementary Material. [Max score
iii Measure(s) for model calibration is reported (e.g., calibration plots, calibration slope and intercept) If both calibration plot and statistical summary of calibration are provided, select Calibration plot

17 Bias assessment: Compare evaluation metrics for the AI model(s) and reference standard when stratified by patient-and task-specific subgroups to identify subgroups that benefit, are not helped at all, or harmed by the models. Patient-specific subgroups may include age group, gender, ethnicity, or socioeconomic status. Task-specific subgroups are disease-specific and may include risk stratification (e.g., low-, intermediate-, and high-risk disease in prostate cancer), or subtyping (e.g., different bacteria in positive blood cultures). [Max score 6]
iv Task-specific: Performance (e.g., AUROC) is evaluated across at least one subgroup Y Type of fracture (subtle, mild, moderate, severe displacement, comminuted), location of fracture (subcapital, cervical, pertrochanteric, subtrochanteric) [page 6].v Task-specific: Clinical utility (e.g., sensitivity or specificity for a specified threshold) is evaluated across at least one subgroup N Not performed © 2023 Kwong JCC et al.JAMA Network Open.

18 Error analysis: Analyze predictive errors to identify characteristics that are more prone to inaccurate predictions. Determine if there are any surprise errors (e.g., clearly inaccurate predictions based on clinical judgement). [Max score 4]
Two surprise false negatives included 1) a minimally displaced subtrochanteric fracture in a patient with Paget's disease and 2) a heavily displaced subtrochanteric fracture with the fracture elements forming a pseudo-Shenton's line.The one false positive was a case with a severely deformed femoral head, suspected due to a childhood injury but has not progressed due to osteoarthritis [appendix pages 12-14].
-(Optional) If feature importance rankings were used, describe whether they were aligned with clinical intuition and known prognostic factors

Other Information 23 Disclosures: Disclose all financial relationships, sources of funding, and potential conflicts of interest. [Max score 1] 1
reliable [page 7].

Share the data, data dictionary, source code, or release an application that runs the code. [Max score 10] 2
ii Data availability: How can other researchers access the data used in the study?Data availability needs to be explicitly stated to receive points Available on request The data sharing statement indicates that the derived data is available upon requests to the corresponding author [page 8].iii Model availability: How can other researchers access the model(s) used in the study?N Not provided © 2023 Kwong JCC et al.JAMA Network Open.

the clinical problem and rationale for developing AI models. Review existing relevant literature exploring AI models for the problem being addressed. [Max score 1] 1
i The clinical context and rationale for developing/updating an AI model(s) to address the clinical problem are presented Y The authors describe how long-term life expectancy following liver transplantation may be impacted by graft failure, infections, cardiovascular complications, and cancer.While several risk factors for these longterm complications have been identified, they have not been integrated in a comprehensive and longitudinal manner, which is possible due to the longitudinal follow-up that is standard of care in this patient population.The authors propose the use of a deep learning model utilizing longitudinal data to provide more accurate prognostication of mortality due to graft failure, infection, cancer, or cardiovascular causes [pages 1-2].ii A synthesis of existing AI models that predict the same outcome is provided.If there are no existing models, this should be stated Y In the research in context, the authors found that no studies have investigated the use of longitudinal data to predict liver transplant outcomes [page 2].

4 Source of Data: Describe how the dataset was obtained (e.g., single/multi-center, local/national database, etc.), and study period. If relevant, the diversity of the dataset is also described (e.g., inclusion of community hospitals, low/middle income populations, and institutions from other countries). [Max score 8]
iv What was the setting(s) of the institutions included in the data or inferred based on their description?If not reported or unknown, select No. Academic institutions Y Liver transplantation is typically performed only at academic institutions.UHN is an academic teaching hospital [pages 2-3].Institutions from multiple (> 1) countries Y Canada and United States [page 2] © 2023 Kwong JCC et al.JAMA Network Open.Ground truth was determined using International Classification of Diseases codes for the SRTR dataset, and manual chart review for the UHN dataset.Since not all outcomes were determined via diagnostic codes, this was not scored as "Objective, well-captured ground truth".Given the multiple assessments by clinical experts through chart review, this was scored as "Multiple (>1), experts" [page 3].7Data abstraction, cleaning, preparation:

Describe the methods used to develop the final dataset, with consideration of feature abstraction, handling of missing data, feature engineering, and removal of features. [Max score 7]
Imputation was primarily done through forward-filling.The authors also experimented with median-and mean-filling, and random drawing from the training distribution.They found that the forward-filling approach yielded the best AUROC [page 4, appendix pages 3, 19].
v Outline any methods used to remove features (e.g., clinical judgement, principal component analysis, recursive feature elimination, correlation, or ablation analysis), if applicable If not performed, it should be clearly stated that it was not performed, N Not specified © 2023 Kwong JCC et al.JAMA Network Open.

Specify how the data was divided into the training, validation, and testing cohorts. [Max score 7] 5
/tuning/validation strategy was used for the SRTR dataset, while a 5-fold stratified cross validation strategy was used for the UHN dataset.As per Table1, the option that yields the higher value (cross validation) was selected [page 4].

24 Transparency: Share the data, data dictionary, source code, or release an application that runs the code. [Max score 10] 9
© 2023 Kwong JCC et al.JAMA Network Open.

AI score (out of 100) 64 Quality based on overall APPRAISE-AI score High Clinical Relevance (out of 4) 4 Data Quality (out of 24) 16 Methodological Conduct (out of 20) 7 Robustness of Results (out of 20) 7 Reporting Quality (out of 12) 12 Reproducibility (out of 20) Introduction 2 Background: Describe the clinical problem and rationale for developing AI models. Review existing relevant literature exploring AI models for the problem being addressed. [Max score 1]
The authors describe how accurate prediction of prostate cancer-specific mortality may help identify patients who would benefit most from treatment.However, current predictive models are limited in that they either predict biochemical recurrence, which is a poor surrogate for survival, or fail to capture complex, non-linear relationships between variables.The authors propose the use of a novel machine learning framework on a large, national dataset to predict 10-year cancer-specific mortality in men with non-metastatic prostate cancer [pages 1-2].iiA synthesis of existing AI models that predict the same outcome is provided.If there are no existing models, this should be stated Y In the research in context, the authors found that only few machine learning studies examined prognostication in prostate cancer and were primarily based on small, single ethnic cohorts [page 2].

cleaning, preparation: Describe the methods used to develop the final dataset, with consideration of feature abstraction, handling of missing data, feature engineering, and removal of features. [Max score 7] 4
Cancer of the Prostate Risk Assessment score, Cambridge Prognostic Groups, National Comprehensive Cancer Care Network, Genitourinary Radiation Oncologists of Canada, American Urological Association, European Association of Urology, National Institute for Health and Care Excellence [page 3] ii Regression model using same features in AI model used for comparison N Not specified iii Domain expert (e.g., clinician judgement) or current standard of care (gold standard) used for comparison Y PREDICT Prostate and Memorial Sloan Kettering Cancer Center nomograms are the most widely used models in clinical practice [page 3].
i Rationale provided for choice of candidate features (e.g., based on prior research, clinical relevance, available data, etc.) Y Features were selected based on known predictors of prostate cancer-specific mortality [pages 1-3].ii Time-windows for abstracted features are specified (e.g., vital signs recorded within the past 12 hours will be used to predict sepsis) Y Features were abstracted at the time of prostate cancer diagnosis [page 3].© 2023 Kwong JCC et al.JAMA Network Open.

List the evaluation metrics used to assess performance and calibration, including the justification for selection. [Max score 5] 5 i
Measure(s) for model discrimination is reported (e.g., AUROC, AUPRC, c-index, etc.)If multiple measures of discrimination are provided and at least one includes a measure of statistical significance, select Measure(s) with statistical significance Time-dependent c-index with 95% confidence intervals determined using 10,000 bootstrap samples (Table2)[page 3, 5]ii Rationale provided for which metric is most clinically relevant for the problem at hand Y Time-dependent c-index was used to assess discrimination at the 10-year timepoint [page 3].

16 Clinical utility assessment: Describe appropriate metrics for readers to understand the risk/benefit trade-offs of using the AI model at the specified decision threshold (e.g., decision curve analysis). [Max score 5] 5 i
Measure(s) of clinical utility is reportedIf both sensitivity or specificity for a specified threshold and decision curve analysis are provided, select Decision curve analysis