Assessment of a Deep Learning Model Based on Electronic Health Record Data to Forecast Clinical Outcomes in Patients With Rheumatoid Arthritis

This prognostic study examines the ability of an artificial intelligence system to predict the state of disease activity in patients with rheumatoid arthritis at their next clinical visit.

Fully connected networks perform non-randomly but had the lowest overall performance. GRU, LSTM, CNN models perform nearly equivalently. Adding a single dense function applied across all variables within each time point (TDD) produced the largest increase in model performance.
The best performing architecture utilized a TDD input layer followed by a recurrent layer with a slight preference for GRUs over LSTMs.

eAppendix. Supplemental Methods and Discussion
Data

Primary Cohort (UCSF)
In order to use real-world longitudinal patient data to build and evaluate our models, we utilized

Inclusion Criteria
Of the over 900,000 patients in the UCSF EHR, 3959 had at least 1 RA diagnosis but only 2452 had 2 diagnoses separated by more than 30 days. Of those 2452 patients, 1417 had been seen by a Rheumatology provider, and only 925 had received at least one CDAI score and only 672 of those patients had received 2 scores. Six hundred and three patients had received at least one inflammatory test and only 578 of the remaining group had also received a DMARD.

Replication Cohort (ZSFG)
The IRB for ZSFG did not require de-identification of patient records. EHR data were directly accessed using the eCW product "eBO reports" which runs on an IBM Cognos platform.

Variables Utilized in Model
Medication names were standardized by first using the R scripting library MetaMap 1 and then we programmatically removed any remaining characters associated with delivery or dosage.
Medication names were then mapped to the list of DMARDs. Steroids were included if their pharmaceutical class was labeled as "glucocorticosteroid" in the EHR and their route of administration was either oral or injection. All patient medications that did not map to either a DMARD or steroid were dropped. Most machine learning libraries, including the TensorFlow 2 library that we planned to use for modeling, do not accept string values within tensors.
Therefore, we encoded medications using a dictionary mapping the drug name to a unique integer value (e.g., Methotrexate= 1) in each patient's record. We chose to include only the first occurrence of each medication given the lack of reliable medication stop dates in the EHR.

Modeling Input Formats
We considered two different formats for representing a patient's longitudinal trajectory as input for modeling. The first method was a Sequential string of events. In this format, each patient's events follow the exact chronology in which they appear within the EHR. As an analogy, in this format, a patient's trajectory is like a sentence and the goal of the model is to predict the final word the sentence (always a CDAI score category in this case of either controlled or uncontrolled Deep Learning can be viewed as a hierarchical transformation of the input data, each layer acting as a distinct function changing the data in a different way, into the representation of the original input data that makes the predictive task as straightforward as possible. An architecture is an arrangement of layers placed together and representing either the modeler's theory or an experimental finding about which functions will generate the best representation of the input for a given problem.

Overview of Relevant Deep Learning Layer Types
We LSTMs and GRUs have slightly different mechanisms for learning sequential representations which can lead to differences in performance on different data sets. In general, LSTMs are more robust but due to that they are slower to train.

Model Training
The UCSF cohort was divided into three sub-cohorts for model building and testing: training, validation, and testing. To ensure that the sub-cohorts were representative of the overall population, we calculated the proportion of patients in each CDAI outcome category (60% were Controlled, 40% were Uncontrolled). We then performed a Stratified Random Split, keeping 20% (n = 116) of the patients aside for testing (these patients' data were never trained on, the data was used only to test the final model) and using 80% for model training and development.
We then performed an additional stratified random split on the patients assigned, allocating 80% The ZSFG patient cohort was less than half the size of the UCSF cohort and we chose not to involve it in model selection. The ZSFG cohort was split in two, the test cohort was matched to the size of the UCSF test cohort as closely as possible (n=117) so that model performance could be evaluated across equally sized patient populations. A training cohort, comprised of the remaining patients (n = 125), was created from the remaining patients. Membership in the cohorts was assigned through a random stratified split as described above.
The goal of any deep learning architecture is to learn a representation of the original input data that maximizes the success rate for the predictive task. In this case, the input data is each individual patients' clinical RA trajectory and the task is to predict what each patients' disease activity state will be at their next visit. The final representation that the architecture generates is For models that included both patient variables that changed over time, such as lab values and CDAIs, and static variables that did not change over time, such as demographics, the variables were separated according to whether or not they were time-dependent. These separate inputs for each patient were fed into two independent deep networks; a recurrent network for the timedependent variables and a purely dense network for static variables. The two network outputs were concatenated to form a final joint representation which was sent through a non-linear layer and passed to the logistic classifier. Back propagation flowed through both networks allowing joint learning of static and time-dependent representations.

Model Optimization
There are many different strategies that can be applied for model optimization. The most common methods for optimization include: experienced intuition, grid searches, random searches, or some form of Sequential Model-based Global Optimization (SMBO) techniques.
Since, to our knowledge, no model architectures for multivariable time series deep learning to predict future health outcomes for a chronic disease have been published, there was no data to guide intuition for selecting the optimal variables. While grid searches are popular because they are easy to conduct, Bergstra and Bengio 7 have shown that it is more efficient to randomly search through values while employing a method to intelligently narrow the search space than it is to loop over a fixed sets of hyperparameter values in a grid. SMBOs are algorithms that begin with a random search over the hyperparameter space, and then use the results of the models built with that search to fit one or more surrogate functions that describe the relationship between a set of possible hyperparameters and model generalization. The algorithm then begins optimizing the surrogate function with the goal of identifying points in the hyperparameter space that will lead to improved model performance on data unseen by the model during training.
To account for the slight class imbalance and to further optimize the network two different approaches biasing model predictions towards the different classes of patients (Controlled vs Uncontrolled) were tested. The first was the sampling method used for training.
Patient's were sampled for training either randomly or at double the distribution for their class found in data. Additionally, we tested two methods for penalizing the loss function: either equally for each class or proportionally balanced so that the under-represented class was up-weighted.

Model Selection
We Forecasting performance increases non-linearly with the number of samples available for training. There is a sharp increase in performance between 50 and 100 samples. The net size of performance gains becomes smaller as the sample size is increased. It is important to note that these experiments are conducted post-hyperparameter-optimization. Therefore, they reflect the numbers necessary to train the optimal model but do not reflect the numbers necessary to identify the optimal model.