Prevalence and Sources of Duplicate Information in the Electronic Medical Record

Key Points Question How much duplicate content is present in electronic medical records, where does it come from, and why is it there? Findings In this cross-sectional analysis of 104 456 653 routinely generated clinical notes, 16 523 851 210 words (50.1% of the total count of 32 991 489 889 words) were duplicated from prior documentation. Duplicate content was prevalent in notes written by physicians at all levels of training, nurses, and therapists and was evenly divided between intra-author and inter-author duplication. Meaning The prevalence of information duplication in electronic medical records suggests that it is an adaptive behavior requiring further investigation so that improved documentation systems can be developed.


eAppendix 1. Code
Below is the code for our n-gram sliding window algorithm, as implemented using Python and Spark.
(1) We do not believe sequence alignment methods are well-suited to fully capture the scope of the clinical text duplication problem, for a few reasons. Traditional complete pairwise sequence alignment presumes that one sequence (note) is the result of copying and transforming exactly one other previous sequence (note). This has multiple problems; one is that duplication can occur at an intra-note level (e.g. text in the history/physical section of a note is pasted to the assessment/plan in the same note, etc). Also, clinical notes often include text duplicated from multiple sources (i.e., note C is made up of text duplicated from note A and note B). Using (naive) sequence alignment only allows for the alignment of a note text to a single other source which has been transformed, and does not capture the phenomenon of multiple sources. The ngram method, by contrast, facilitates detection of these real-world phenomena more effectively.
(2) Performing pairwise sequence alignment on such a large corpus (even limiting the algorithm to only aligning a note with previous notes in the same patient's chart) would be prohibitively expensive computationally (O(n 2 ) per word or worse). The n-gram algorithm, on the other hand, is O(n) with respect to words, which is much more tractable. This is because we use a python dictionary (a data structure with a constant time lookup) to store all previously seen 10-grams, and each 10-gram in the document must simply be 'looked up' to see if it has been previously used in the patient's chart. Previous work that has used sequence alignment has either used much smaller corpora (Wrenn et. al.) or relied on assumptions about duplication that may not hold in reality. For instance, Rule "only computed redundancy for notes where the patient's last visit in a particular specialty had been with the same clinician and documented by the same author, providing a similarly constructed note for comparison. We did not compute redundancy for notes where the patient had no prior outpatient encounter in that particular specialty, or the most recent note had been written by another author." We believe from clinical experience (and provide evidence in our study) that duplication from different authors is highly prevalent in the EMR, so our study requires an algorithm with the ability to compare sequences of word tokens to all prior sequences written for that patient. For these reasons, the n-gram algorithm is more suitable for our purposes.
One trade-off with using our algorithm instead of pairwise sequence alignment is that it may detect small segments of text which were not directly duplicated, but in fact just reflect common phrases used in clinical practice or templated text. This may lead to an overestimation of duplicate text content. To evaluate the scope of this problem, we conducted a preliminary analysis using 10-grams to identify how much text would be identified by our duplication detection by chance (i.e., without intentional duplication behavior) using this method. To do this, we selected 1,000 random notes from all patients in our corpus to use as a 'randomly generated chart.' We then selected a sample of notes from our corpus to treat as "new notes", and ran the same duplication detection algorithm on the new notes, using the 'randomly generated chart' as the "previous notes" to compare to. In this way, we could identify how much text would appear "duplicated" based on the duplication caused by the fact that clinicians use certain turns of phrase, note templates, and text macros to document. We used a sample of 10,000 notes for this process (i.e., treat each of the 10,000 notes as if it was a "new note" written in the same chart as the "randomly generated chart", and average the duplication metric across all text in these 10,000 notes). We found that 13.6% of text in a new note would be identified as duplicated by chance with a previous chart consisting of 1,000 notes, and 5.2% with a previous chart consisting of 100 notes. The vast majority of patients in our corpus have fewer than 1,000 eAppendix 3.

Discussion of Limitations of Our Methods
Our metric for identifying the "source" of duplicated text (i.e., the prior note that the current author was looking at or thinking about when they duplicated the text) is imperfect. Our metric identifies the source as the most recently written note which includes the duplicated text, which is an assumption that may not always hold in reality (the author may have duplicated the text from an older note, rather than the most recent).
However, any metric to identify the true source of duplicated text will be somewhat flawed without direct monitoring of the user's duplication actions, which is infeasible for a study of this size. Some EMR software, including Epic, includes functionality to track limited subsets of duplicate text; to do this, the EMR records when note text was copy-pasted from another note in the chart, as well as the source of the copy-pasted text. However, this functionality only captures directly copy-pasted text (e.g. with Ctrl+C / Ctrl-V) and excludes all other forms of duplication, including duplicate text generated by templates, users re-typing out the same text, or minor paraphrases which nonetheless contain large amounts of redundancy. We therefore felt this feature was too limited for our purposes.
For the purposes of this study, we only aim to claim that duplicate text is a problem that arises both from the individual user documenting repeatedly over time (sourcing from past notes they have written) as well as from multiple users duplicating each other's past work, which we believe our metric is sufficient to demonstrate. We believe our metric is sufficient to demonstrate the point that both inter-author and intra-author duplication are both relatively prominent contributors to text duplication.
On the other hand, there may also be interesting questions that result from defining the "source" that Patient A has heart failure and should be treated with a certain regimen; in the Patient B note, the clinician is stating the same of Patient B, which is not duplicate information. This may seem an obvious point, but it is important to recognize in the discussion around the utility or appropriateness of duplicating text.
According to our methodology, if a template appears in a single note for patient 1 and a single note for patient 2, neither is counted as duplicate. This is intentional, as this does not necessarily reflect undesirable behavior. Medicine is often rather stereotyped, so a physician who treats the same condition thousands of times may develop a templated set of diagnostics/treatments/patient instructions, and having this template available will save them significant time. Similarly, healthcare systems which seek to enforce high-quality treatment