Development and Validation of an Artificial Intelligence System to Optimize Clinician Review of Patient Records

Key Points Question Would the development of a novel artificial intelligence (AI) system to organize patient health records improve a physician’s ability to extract patient information? Findings This prognostic study of 12 physicians or fellows in an academic gastroenterology department found that first-time physician users of the AI system were able to save a mean of 18% of the time taken to answer clinical questions regarding a patient’s medical history while maintaining accuracy comparable to their performance without AI. Meaning These findings suggest that, without sacrificing accuracy, the AI technology developed helps physicians extract relevant patient information in a shorter time.

This supplementary material has been provided by the authors to give readers additional information about their work.

eMethods. Study Procedures
First, a date extraction algorithm was developed to identify the date from a group of pages in a referral record; we defined this as the most important date for a given particular group (typically the encounter or result date). Each individual page within the group was searched for dates. In order to exclude dates that were irrelevant to a patient's visit, only dates within the body of each page were considered, with those in the header or footer excluded. Additionally, dates before the year 2000 or in the future were ignored, as they were assumed to be incorrectly parsed or of marginal clinical utility. After this filtering, the most recent of the remaining dates was outputted as the predicted date for a given group. Second, a lab extraction algorithm to extract a list of lab values from a referral record containing such data was developed. The entire referral record was searched for lab names, such as creatinine or hemoglobin, as well as common abbreviations. The extracted lab values were organized into a table sorted by lab name and date.
To organize the record by content category, first, we developed a content categorization algorithm to classify the type of information in a body of text into one of the following categories: Referral, Note, Lab, Radiology, Procedure, Operative Report, Pathology, Fax Cover Sheet, or Insurance. Guided by physician input, a set of keywords that would best discriminate between these categories was developed by reviewing patient records (eTable 1). Input text was searched for keywords in the set, then classified as the category which best matched the keywords found. Subsequently, we developed a page grouping algorithm to use both visual and textual information to partition a referral record into its constituent documents. To do so, a convolutional neural network was developed to take in an image of a page as input and then predict whether the page was the first page of a document. The neural network's prediction was then combined with textual heuristics, such as the presence of phrases of the form "Page X of Y," as well as information from the content categorization algorithm, to produce a final prediction. Based on these predictions, an input record was partitioned into its constituent documents, with each predicted first page beginning a new document. This partitioning was used to perform date extraction and content categorization at the document level, rather than the page level. This led to greater accuracy for both tasks, since textual information from all the pages in a document could be combined for prediction.
To present the optimized patient information to the clinician, we developed a web interface that displayed the outputs of the system for a given referral record. Displayed on the left side of the interface was a summary containing a list of document categories found in the record, along with hyperlinks to the original full PDF record, which was shown on the right side of the interface ( Figure 2). Each category on the left displayed a list of nested items predicted by the system to belong to that category, along with the date of that document. For example, in the radiology category, all CT scans, MRIs, and other radiographic pages would be listed with their dates of completion. Additionally, lab values extracted from the document were displayed in a table organized by date. Social information such as smoking history and allergies were displayed in a separate section. All categorized documents were linked to the original PDF: clicking on any document on the summarized display section on the left of the interface would scroll to the page in the original full PDF record on the right side of the interface from which it was extracted. We developed a system to extract relevant clinical information from patient referral records. A referral record consists of many scanned documents, generally concatenated together into one file. First, optical character recognition (OCR) was applied to parse the PDF into text. The pages were then split into groups-sets of consecutive pages that make up a single source document. To do so, a convolutional neural network, augmented by a set of textual heuristics, was applied to each page to predict which pages began a new document. For each group, manual heuristics were applied to extract the date, and a manually tuned linear classifier was applied to predict the document category. Position-and text-based heuristics were run over the entire lab record to extract a list of lab values, as well as information regarding smoking and allergies.
Text Extraction: Extraction of text from referral records was done with PyOCR (0.7.2), an optical character recognition library using the Tesseract 4.1.1 engine. We chose PyOCR for its flexibility in reading all image types, its various output types for text, and its ability to parse only digits within a text, a feature essential in the extraction of specific lab values. Every page was converted first to an image, before the algorithm was used to detect all regions of text on the page. The algorithm returned a list of results; each result contained the coordinates of the bounding rectangular box as well as the text contents within it. We used the PyOCR library, an optical character recognition (OCR) tool wrapper that uses the Long Short-Term Memory-based Tesseract 4.1.1 OCR Engine.

Date Extraction:
The date extraction component determines the timestamp of a group of pages. On each page, a set of keywords (eTable 1) were used to select a set of text boxes to be searched for timestamps. To avoid date artifacts from the times faxes were received, all dates with a bounding box with the vertical coordinate of its left top corner positioned on the top or bottom 10% of the page were discarded. All remaining parseable dates were extracted, and the most recent remaining date more recent than the year 2000 was selected.
Lab Extraction: The lab extraction component uses a list of lab names along with their corresponding numerical ranges to extract a list of lab values from the lab tables in a patient referral record. First, the referral record was searched for lab names or their common abbreviations. Common OCR misreadings were also included in the list of abbreviations. For each instance of a lab name that was found, the area to the right of the lab name on the page was searched for numerical values. Values that fell outside a clinically specified range were excluded; the first remaining numerical value on the line was selected.

Content Categorization:
The content categorization component categorizes a given page or group of pages as one of a set of predetermined categories: Referral, Note, Lab, Radiology, Procedure, Operative Report, Pathology, Fax, and Insurance. For each category, a characteristic set of keywords, along with a set of weights indicating how important each keyword is to classification, was developed in consultation with clinicians. When processing a referral record, each group of pages was searched for keywords; the total predicted score for each category was the sum of the weights for all keywords corresponding to that category found in the group. To minimize the impact of incidental mentions of information from other categories in long paragraphs, keywords found in a line at least 5 words long were weighted with a penalty of 0.3. For a given group of pages, the category with the highest predicted score was selected and displayed to the user.
The set of keywords and weights was developed alongside a clinician through an iterative procedure using the training records and a separate set of validation records. The clinician read through the documents in the training records and determined the category of each page. Then, for each class, they identified words contained in the training documents that were present in many of the documents of that class, and at the same time mostly absent from documents in other classes. Next, the list of words was fine-tuned on the validation set by testing how well the keywords separated pages into their correct classes when used with the content categorization procedure described above. The list was adjusted until the pages of the validation records were categorized at an acceptable level of accuracy.

Document Partitioning:
The document partitioning component splits a referral record into its constituent documents. Referral records are typically made up of a large number of separate original documents merged into a single PDF; our algorithm recovers these documents by predicting whether each page is the beginning of a separate document. Each page of a referral record was first resized to 216x256 pixels, then converted to black and white. To classify a given page, the previous page, the current page, and the next page were input to a fine-tuned ResNet18 model pre-trained on ImageNet. The model was finetuned to classify pages on our training set of referral records for 15 epochs with a batch size of 64, using binary cross-entropy loss and the Adam optimizer with β1 = 0.9, β2 = 0.999, and a constant learning rate of 1e-4.
When processing a new referral record, each page was processed by our model; pages predicted to be a first page with probability greater than 0.5 were labeled as the beginning of a document. In addition, pages whose text contained "Page 1 of" were also classified as the beginning of a document.
Web Implementation: We developed a web interface for clinicians that organized the output of the system for a given patient record. A main group view was organized into sections corresponding to document category. Each section displayed a list of page groups that were predicted to belong to that category; additionally, each group was rendered with a differently-colored label to better distinguish between categories. Within each section, the groups were sorted in original page order and labeled with the predicted date. Any duplicate groups were rendered in grey below their duplicate ( Figure K). A lab view displayed a table of labs found in the patient record. Rows corresponded to a given lab type (e.g. calcium); columns corresponded to a specific date. A social history view displayed pages predicted to contain information about allergies and social history. Our interface consisted of a split-screen layout, with the summarized outputs of our algorithm on the left and the original PDF record on the right for crossreference. Dragging the border between the two screens allowed for resizing the two panes. Clicking on a link in the summary (i.e., a page group, a lab value, or a social history link) scrolled to the appropriate page of the PDF record. Conversely, scrolling in the PDF record scrolled the summary to the appropriate section. A screenshot of the interface is presented in Figure 2. We implemented our system using the jQuery library (version 3.4.1) and pdf.js (version 2.3.2).

Selection of patient records/categories for referral packet organization:
We randomly chose new patient referral packets from a variety of gastroenterology providers at Stanford. They are typical of a wide variety of gastroenterology referrals as they were taken across multiple sub-specialties including general GI, liver disease, GI motility, etc.
Our AI system includes all pages of the original referral packet, but predicts and categorizes each page of the packet into of nine categories that are most commonly present in a typical referral packet. The rationale for these nine are that patient notes, labs, radiology/operative/procedure/pathology reports are useful patient information and referral page, cover sheet and insurance information is often included in these referral packets as well. In consultation with physicians in our research group, we determined that the most useful parts of extraction would be dates, in order to chronologically order notes as well as objective data (procedures and labs). In addition, a tabular format for routine labs was thought to be very helpful as it could provide insight into trends over time, which is often monitored in clinical practice.

Limitations
Our system has several technical limitations. First, when multiple dates were included in a given group and recognized by our date extraction algorithm, the most recent date was selected. However, this may not always correspond to the most clinically relevant date; for example, the day a lab procedure was performed may be more clinically important than the day the results were reported. In addition, our system currently displays all pages in the summary, even those with little information or with significant informational overlap with other pages. Although removing such pages would improve the ease of use of our system, such a modification could lead to the potential omission of important patient information, especially information incorrectly parsed by our OCR component. This could, however, be customized in a future version of this system. Additionally, our current duplicate-removal system, based solely on wordlevel overlap, is insufficiently sophisticated to identify pages with high semantic overlap. Future work might extend this with neural-based duplication detection.

Mixed Effects Model
proc mixed data=work; class qid sid; model time = assisted acc packetlength /solution cl; repeated / subject=qid(sid); This study was performed with a multi-reader multi-case (MRMC) strategy. To account for variabilities among readers and cases, a linear mixed effects regression model is used with random reader-specific and case-specific effects. Two random reader-specific effects and two random case-specific effects terms are included in the regression model.

Reading_Time = b0 + b0i + b0j + (bint + bint_i + bint_j)Xint + εij
where: b0i = reader-level random intercept assumed to follow a Gaussian distribution b0j = case-level random intercept assumed to follow a Gaussian distribution bint_i = reader-level random slope assumed to follow a Gaussian distribution bint_j = case-level random slope assumed to follow a Gaussian distribution Xint = AI-optimization (Standard Review = 0, AI-optimization = 1) εij = Gaussian error Therefore, from the above model, both overall and reader-specific differences between the two groups are able to be estimated.

eTable 4. Date and Content Categorization Accuracy per Packet
Caption: For each page within a packet, our algorithm output a date and class classification that was used to sort the packet into relevant groups. In this table, we evaluated the accuracy of these outputs, along with 95% CI's computed via the bootstrap replicate method, on each packet when compared to the ground truth provided by physicians. For dates, we evaluated both the percentage of pages in the packet labeled with the correct date, and the percentage labeled within 3 days of the correct date; both of these results are above 72% for all packets used in the experiment. The latter condition allows us to evaluate the accuracy of our model when ignoring small errors immaterial to clinical judgements. We achieve a similar, though slightly lesser, accuracy for class classification. The average is weighted so that each page has equal importance.