Peabody JW, Luck J, Glassman P, Dresselhaus TR, Lee M. Comparison of Vignettes, Standardized Patients, and Chart AbstractionA Prospective Validation Study of 3 Methods for Measuring Quality. JAMA. 2000;283(13):1715-1722. doi:10.1001/jama.283.13.1715
Author Affiliations: San Francisco Veterans Affairs Medical Center and Institute for Global Health, University of California, San Francisco (Dr Peabody); RAND, Santa Monica (Dr Peabody); Veterans Affairs, Greater Los Angeles Healthcare System, West Los Angeles (Drs Peabody, Luck, and Glassman); University of California, Los Angeles, Schools of Medicine and Public Health, Los Angeles (Drs Luck, Peabody, and Lee); Veterans Affairs Center for the Study of Health Care Provider Behavior (Drs Peabody, Luck, Lee, and Glassman); and San Diego Veterans Affairs Medical Center, University of California, San Diego School of Medicine (Dr Dresselhaus).
Context Better health care quality is a universal goal, yet measuring quality
has proven to be difficult and problematic. A central problem has been isolating
physician practices from other effects of the health care system.
Objective To validate clinical vignettes as a method for measuring the competence
of physicians and the quality of their actual practice.
Design Prospective trial conducted in 1997 comparing 3 methods for measuring
the quality of care for 4 common outpatient conditions: (1) structured reports
by standardized patients (SPs), trained actors who presented unannounced to
physicians' clinics (the gold standard); (2) abstraction of medical records
for those same visits; and (3) physicians' responses to clinical vignettes
that exactly corresponded to the SPs' presentations.
Setting Outpatient primary care clinics at 2 Veterans Affairs medical centers.
Participants Ninety-eight (97%) of 101 general internal medicine staff physicians,
faculty, and second- and third-year residents consented to be randomized for
the study. From this group, 10 physicians at each site were randomly selected
Main Outcome Measures A total of 160 quality scores (8 cases × 20 physicians) were generated
for each method using identical explicit criteria based on national guidelines
and local expert panels. Scores were defined as the percentage of process
criteria correctly met and were compared among the 3 methods.
Results The quality of care, as measured by all 3 methods, ranged from 76.2%
(SPs) to 71.0% (vignettes) to 65.6% (chart abstraction). Measuring quality
using vignettes consistently produced scores closer to the gold standard of
SP scores than using chart abstraction. This pattern was robust when the scores
were disaggregated by the 4 conditions (P<.001
to <.05), by case complexity (P<.001), by site
(P<.001), and by level of physician training (P values from <.001 to <.05). The pattern persisted,
although less dominantly, when we assessed the component domains of the clinical
encounter—history, physical examination, diagnosis, and treatment. Vignettes
were responsive to expected directions of variation in quality between sites
and levels of training. The vignette responses did not appear to be sensitive
to physicians' having seen an SP presenting with the same case.
Conclusions Our data indicate that quality of health care can be measured in an
outpatient setting by using clinical vignettes. Vignettes appear to be a valid
and comprehensive method that directly focuses on the process of care provided
in actual clinical practice. Vignettes show promise as an inexpensive case-mix
adjusted method for measuring the quality of care provided by a group of physicians.
Assessing quality must ultimately rely on measures that are inexpensive,
reliable, and able to adequately control for case-mix variation.1- 3
Health outcome measures, although a direct assessment of health status, also
reflect a spectrum of confounding events such as comorbidities and the socioeconomic
determinants of health—factors that are generally beyond the control
of a physician's daily practice. As a result, process measures of quality
are increasingly being used.4,5
If linkages between the provision of care and better health status have been
firmly established, there are substantiated benefits to measuring process
over measuring outcomes.6 Processes can be
measured more frequently than outcomes (eg, a death or complication), do not
require a lengthy interval to become manifest,7
and are generally less expensive to monitor.8,9
The most common method for measuring process, which includes both the
competence of the clinician and what the clinician actually does, is chart
Chart abstraction primarily has been validated in the inpatient setting, where
care tends to be extensively documented and clinical events are more temporally
circumscribed.13 As care has increasingly shifted
to the outpatient setting, so has reliance on abstraction of outpatient charts
to measure quality of care.14 Despite increased
use of chart abstraction, validity of outpatient process measures has been
systematically evaluated in only a few studies,15,16
and significant problems may exist with chart abstraction in this setting.
For example, abstracted chart data may be subject to recording bias because
of time constraints on outpatient visits. The usefulness of chart abstraction
is further limited because a skilled (and costly) expert must collect the
data.17,18 Perhaps the most important
limitation of chart abstraction is that adjustments for case-mix variation
are insufficient, thereby limiting direct comparisons of quality of care across
different sites or delivery systems.19- 21
An alternative in the outpatient setting is to directly observe patient-provider
interactions; this could be a gold standard if physicians were adequately
masked to the measurement method. However, truly double-blind observations,
where neither provider nor patient know they are being observed, are obviously
not possible for ethical and logistical reasons. An extensive medical-education
literature describes the successful use of standardized patients (SPs) as
a practical gold standard22- 28
and reports that SPs can capture variation in clinical practice and reproducibly
show how individual physician practices vary over time.23,29,30
However, SPs require even more intrusion into a physician's practice than
chart abstraction, and they cannot assess some aspects of physician observation.31 They are expensive and incur the opportunity cost
of time the physician does not spend with "real" patients.
Thus, alternative methods of measuring process are needed.32
Vignettes or written case simulations have been widely used by educators,
demographers, and health service researchers to measure processes in a wide
range of practice settings.33- 35
Vignettes are easily administered, less costly, and can be used in all types
of clinical practices.36 Because they control
for case mix, vignettes hold promise as a way to assess quality of care among
different providers and between organizations that may (or may not) care for
different populations of patients in different systems of care.37- 39
But despite the promise of vignettes and their growing use in a variety of
settings, little work has been done to validate them.40,41
This study was performed to assess whether clinical vignettes are a
valid method for measuring process of care compared with actual clinical practice.
We used a prospective sample of a group of physicians to compare 3 measurement
methods—clinical vignettes, chart abstraction (the standard method),
and SPs (the gold standard). Quality scores were generated for 4 common outpatient
conditions. The analyses directly compared all 3 methods, controlling for
possible design effects of level of training, individual physician effects,
site or location disparities, and case severity. We also evaluated quality
scores for different domains of clinical care skills—history taking,
physical examination, radiologic and laboratory testing, diagnostic accuracy,
and clinical treatment or management.
The study was conducted at 2 general internal medicine primary care
outpatient clinics located at the West Los Angeles and the San Diego Veterans
Affairs medical centers, in California. All primary care staff physicians,
faculty, and residents in these clinics except interns were eligible for the
study. Ninety-eight of the 101 eligible providers (approximately 97%) consented
to see SPs "sometime" during their regularly scheduled clinic hours over the
course of the 12-month academic year. We randomly selected 10 physicians at
each site to see SPs. All consenting physicians were asked to notify us if
they suspected that a patient was an SP. The visits were completed over a
6-month period from February through July, 1997.
Each method measured the process of care for 4 common outpatient conditions:
low back pain, diabetes mellitus, chronic obstructive pulmonary disease, and
coronary artery disease (CAD). Two detailed clinical scenarios (cases) were
developed for each of the 4 conditions, 1 simple and 1 complex, for a total
of 8 cases. For each case, a physician both saw an SP and completed a vignette.
The Box below (“Coronary Artery Disease Scenarios”)
contains detailed summaries of the simple and complex CAD cases. (Detailed descriptions of the
vignettes and the scoring forms are available from the authors.)
Established protocols were followed for SP training and data collection.
Educators running medical school SP programs trained the actors for each case.
Only experienced actors from the SP teaching program were hired. They were
trained to remember and record details of the clinical encounter. After training,
the SPs were enrolled unannounced into the primary care clinics and scheduled
for walk-in or new-patient visits. Their identities as SPs were not revealed
to any of the outpatient staff or the examining physician. Realistic identities,
necessary laboratory findings, and radiographs were all simulated. In all,
10 randomly chosen providers at each of the 2 sites saw 8 cases each, for
a total of 160 visits. To match the vignettes as closely as possible, the
SPs were carefully scripted not to volunteer any information other than the
Case 1. A 65-year-old man, a new patient, comes
to the clinic for follow-up of a myocardial infarction (MI) he had 3 months
ago. In taking the history, the physician should ascertain that the patient
is now free of pain and has no difficulty performing routine activities but
continues to smoke although he has normal blood pressure. After the physician
records what he or she intends to do in the physical examination, the findings
are revealed by the vignettes or by the patient in response to physician questioning,
and the physician then is asked what laboratory tests should be ordered (an
electrocardiogram and cholesterol test), what the diagnosis is (uncomplicated
MI), and how treatment should proceed. The physician should recognize that
the MI is recent and associated with reversible risk factors and that the
patient needs to be taking aspirin and a β-blocker.
Case 2. A 62-year-old new patient presents
with roughly the same story—recent MI with similar risk factors—but
in taking this history, the physician should learn that the patient has difficulty
with routine activities and easily becomes short of breath since running out
of his medication. On examination, the patient is to have slight tachypnea
and slightly elevated blood pressure. When this information is revealed by
the vignettes or by the patient in response to physician questioning, the
physician is expected to order the same tests as in the first case plus a
blood chemistry test and a chest radiograph (or schedule an echocardiogram).
The electrocardiogram confirms that the patient has had an MI in the past.
The physician should recognize that this is an MI complicated by mild heart
failure. The physician should evaluate for potential risk factors (again)
and prescribe aspirin and an angiotensin-converting enzyme inhibitor.
The SPs completed checklists immediately after their visits. An SP quality
score for each visit was generated directly from the checklist responses.
Simultaneously, charts from SP visits were retrieved from the clinic. Data
were abstracted from the charts by a trained nurse abstractor, generating
Several weeks after SPs had been seen in the clinic, vignettes were
given to the same 20 physicians. The vignettes prompted open-ended responses
to questions that were arranged in sections to re-create the sequence of a
typical patient visit: the presenting problem, history, physical examination,
radiologic or laboratory tests ordered, diagnosis, and treatment plan. Each
section began with the presentation of new patient information gained from
answers to questions in the previous section. After answering 1 section and
moving on to the next, physicians could not return to a previous section to
revise their answers. Thus, they could not use the new information to change
(and improve) their previous answers. When the vignettes were completed, the
responses were scored by the same expert nurse abstractor who performed the
chart abstraction, generating another 160 scores. The abstractor, who was
masked to physician identity, reviewed each vignette answer sheet and indicated
on a scoring form those scoring items the physician had successfully completed.
To evaluate whether there was a cuing effect (whether having seen an SP presenting
with the same case might cue physicians to recognize features of the vignette),
we also administered vignettes to 20 matched, randomly selected physicians
who had not seen SPs, generating another 160 scores.
We conceptualized quality as the comprehensive provision of services
in a manner that leads to better outcomes for individuals and populations.42 Thus, we identified candidate criteria to measure
a full range of activities that potentially captured the process of outpatient
primary care. Explicit quality criteria for each of the 8 cases were derived
from national guidelines. We submitted the candidate criteria to local expert
panels of academic and community physicians including both generalists and
specialists for the conditions. Based on their recommendations and group consensus,
we modified and finalized a master criteria list. Criteria for each case included
both necessary care and some care that was either unnecessary or inappropriate
for that condition.
Identical criteria were used in each method as explicit items on which
to score provider responses for each of the 8 cases. Items felt by experts
to be most critical were assigned a weight of 1.0.43,44
Individual items that experts deemed less important, such as multiple physical
examination items that were related to a single clinical construct, were grouped
into categories, implicitly assigning them lower weights, typically 0.50 or
0.33. Scores were generated from SP responses to a closed-ended postinterview
questionnaire that contained the explicit criteria for each case. Chart abstraction
and vignette scores were based on scoring forms that contained the criteria
and were completed by a trained nurse abstractor. The raw item scores for
each method were aggregated into category scores for that method. These weighted
scores, which averaged 21 categories per case, were then totaled and divided
by the total possible score, generating a percentage correct ("quality") score
for each physician-case combination. For the subanalysis, each scoring category
was assigned to 1 of the 5 domains of the encounter—history taking,
physical examination, test ordering, diagnosis, and treatment. Weekly team
meetings were held to review criteria and ensure consistent application of
scoring guidelines. Random audits enhanced the accuracy of the vignettes,
the checklists, and the SPs' scoring. Table
1 lists a summary of scoring criteria for the CAD complex case.
Scores for the 3 methods were compared using a 4-way (3-way nested,
1-way crossed) analysis of variance model. The factors were design effects
(site and physician training level) and random effects (quality measurement
method and provider). A site-method interaction term was also included. The
statistical significance of the difference between means for the 3 methods
was determined using an F test; where these differences were statistically
significant, the significance of pairwise comparisons between methods was
measured using the Student-Neuman-Keuls test. We used the same statistical
methods to compare scores for individual conditions, acute and chronic diseases,
and 4 of the 5 domains of the clinical encounter: history taking, physical
examination, diagnosis, and treatment. A 2-sample t
test was used to evaluate the significance of the difference in mean vignette
scores between providers who had seen SPs and those who had not.
The 3-way comparison of the methods—SPs, vignettes, and chart
abstraction—is shown in Figure 1.
Mean percentage scores are listed for all cases and for each of the 4 conditions.
The highest quality scores for all cases combined were from SPs (76.2%),
followed by vignettes (71.0%), and chart abstraction (65.6%). When the overall
scores were disaggregated by each of the 4 conditions, this pattern remained
unchanged: vignette scores were consistently higher than scores obtained from
chart abstraction and consistently produced scores closer to the SP gold standard
than did chart abstraction when measured both in the aggregate and by individual
condition. The differences among mean scores for the 3 methods were statistically
significant in a 3-way comparison for all conditions except CAD (P = .05). The interaction effect expected to be strongest, site by
method, was not significant (P = .14).
We performed subanalyses to assess whether vignettes were sensitive
to case effects, defined as differences between methods across simple vs complex
cases and acute vs chronic diseases. The results (Table 2) were similar to the overall and disease-specific findings
above. For example, in the simple case, vignette quality scores (74.3%) were
closer to the SP gold standard (76.9%) than was chart abstraction (63.9%)
(P<.001). When we grouped the acute cases (low
back pain and chronic obstructive pulmonary disease exacerbation) and compared
them with the 2 more chronic disease conditions, subanalyses displayed a similar
overall pattern: SP scores were higher than vignette scores, which were higher
than chart abstraction scores.
We tested to see if the 3 methods in general, and vignettes specifically,
would consistently reflect expected differences in quality scores due to design
effects, defined as differences between sites and among provider training
level. Site B consistently scored higher than site A, regardless of method,
with a statistically significant (P<.001) difference
between sites (Table 3). Within
each site, we again found that vignette scores always approximated the SP
scores better than the chart abstraction scores did. We also observed a difference
in the scores between the 2. When we compared individual providers (not shown),
there was variation in quality scores with all 3 methods (P<.01). When we stratified the analysis by physician training level—second-
or third-year residents vs attending physicians—the pattern between
methods was again robust. Third-year residents scored higher than attendings
and typically (but not always) higher than second-year residents.
We measured 4 discrete skill domains of the outpatient visit: history
taking, physical examination, diagnosis, and the treatment or plan. Patterns
were similar to other subanalyses for the domains of history and diagnosis,
but vignette scores were higher than SPs for the physical examination and
lower than chart abstraction for treatment (Table 4).
We were concerned that there might be an ordering effect since we did
not randomly administer the 3 methods (physicians who had completed vignettes
might then look for similar patients in their clinic). Since this left open
the possibility of cuing (answers to the vignettes would be higher because
physicians had been prompted to think about this type of case after seeing
an SP), we gave only vignettes to a second sample of 10 randomly chosen providers
at each site. We compared the vignette scores of the second group with those
of the physicians who had also seen the SPs. The difference in mean scores
for the 2 groups was not statistically significant either overall (P = .37) or when the 4 cases were disaggregated.
We also surveyed physicians as to whether they believed they had seen
any SPs. Of the 160 visits, only 5 SPs (3%) were detected (2 in site A, 3
in site B) comparing favorably with detection rates in similar SP studies.45 Two false-positive detections were also reported
Valid measures of the competence and practice of physicians are basis
of efforts to improve quality of care. However, competence and practice have
been difficult to isolate from structural effects. Moreover, the cost of measuring
quality across systems while controlling for case mix has further confounded
efforts to improve physician practice. This study measured quality in an outpatient
setting by using the common method of chart abstraction; a gold standard method
of SPs; and clinical vignettes, which heretofore have not been rigorously
Despite widespread use of vignettes, there is uncertainty and controversy
about whether vignettes reflect actual clinical practice or merely physician
competence. Some investigators argue that vignettes only reflect what providers
are competent or knowledgeable enough to do.47,48
Other studies have found that vignettes predicted use of computed tomographic
or magnetic resonance imaging,49 reflected
variation in quality when vignettes with open-ended responses were used,50 demonstrated poor history-taking skills,47 or showed inadequate use of warfarin in atrial fibrillation.51 This study advances these earlier studies in several
ways: it used a comprehensive set of quality measures pertaining to all aspects
of a clinical visit, quality was scored on explicit criteria based on national
guidelines and expert panels, the vignettes had an open-ended response format,
physicians were prospectively selected into the study, and vignettes were
compared with 2 other quality measurement methods.
Our results suggest that vignettes may be a useful way to measure physician
practice in an outpatient setting. Vignette scores appeared to reflect actual
physician practice as recorded from SP visits, resulting in higher criterion
validity, and consistently measured physician practice more accurately than
did chart abstraction scores, resulting in better content validity. Vignettes
also were more effective than chart abstraction at measuring variations in
quality between the 2 study sites, yielding good face validity. We did not
find a cuing effect for vignettes when physicians had already seen SPs.
We infer from these findings that low quality may be significantly determined
by physician competence and not merely structural effects. If vignette scores
had been much higher than SP scores, for example, it could be argued that
practice deteriorated because of a structural effect such as the organization
or delivery of care. When we initially designed the study, we hypothesized
that we might find vignette scores (measures of competence) to be higher than
those of SPs or chart abstraction. We reasoned that a social desirability
bias in vignette responses and the vignettes' potential to emphasize knowledge
over actual clinical practice would result in higher scores that overestimated
the process of care.52,53 However,
we found that SP scores were consistently higher than vignette scores (which,
in turn, were higher than chart abstraction), implying that practice is better
than competence, at least for vignettes with open-ended responses. A clinically
based explanation is that the dynamic nature of the patient-physician dialogue
may cue the physician's thinking during the visit. The lower chart scores,
we reasoned, are the effect of recording bias—everything that happens
in the clinical encounter is not written down because of time constraints.
In the future, modifying the vignettes or varying the SP presentation may
help disentangle the direct effect of the patient encounter from the indirect
simulation of the vignette.
The face validity of the study and the general variation in quality
scores we observed deserve comment. Based on unquantified proxies such as
competitiveness of the respective residency programs, we expected site B to
score higher than site A. Vignettes were able to capture this effect. We also
observed that third-year residents generally outperformed second-year residents
and attending physicians. Perhaps this is not surprising—it is not unrealistic
to believe that senior residents know more than junior residents and provide
higher-quality care or exhibit a higher degree of assiduousness than faculty.
Despite their promise, vignettes are not a panacea for measuring quality.
Our analyses of disaggregated data revealed a complex story. Vignettes appear
to overestimate the quality of the physical examination and inconsistently
assess the quality of the treatment plan. We surmise that the reason for the
higher physical examination scores is that writing down an examination in
the vignette has little "temporal cost," whereas carefully performing additional
physical examination items on a patient in the clinic takes time away from
other activities such as ordering tests. We believe that the chart may be
more accurate than vignettes for recording treatment plans. The medical record
is often used to convey treatment orders (eg, for a follow-up appointment
or an imaging study). Structural problems may further degrade quality as measured
by charts—for example, when orders that were correctly requested by
the physician are lost or delayed.
We believe vignettes have an important niche in the overall measurement
of quality but that their use should be carefully defined and further studied.
Our study indicates that vignette scores are a valid overall measure of the
process of care provided by groups of physicians for a range of common outpatient
conditions. The measure appears to be responsive to real variations in quality
among sites and robust for individual diseases. Such a measure could be useful
to policymakers, purchasers, and managers as they seek to compare the quality
of care in different settings or evaluate management and policy interventions.
In addition, vignettes are uniquely suited for comparative analyses
because they better control for case-mix variation and reduce the impact of
Properly adjusting aggregate measures of quality for variations in case mix
and in patient populations is essential for valid comparisons of quality between
health care systems.54 Since vignettes in this
study and elsewhere appear to be responsive to changes in quality, they make
comparisons of quality across time possible.39,40,55,56
Specifically, vignettes could be used to measure the impact of organization
reforms or policy changes whose ostensible purpose is to improve the care
that patients receive.
Finally, vignettes directly measure the process of care, which is where
interventions can be targeted to improve overall health care quality. Structural
features, it is sometimes argued, are major determinants of quality in certain
circumstances, but they are difficult to measure or directly influence.57 Vignettes appear to be most useful when the focus
is on measuring the competence and even practices for a group of providers.
The disadvantage of such a focus is that there may be other more important
reasons for poor-quality care. Nevertheless, process measures look directly
at what services are provided, whether they are provided efficiently, and
whether they lead to better health.
Identifying specific deficiencies in the process of care has implications
on how clinical care might be improved. If, for example, specific limitations
are identified for 1 condition, a disease-based approach might be used; if,
instead, the deficiency is in ordering appropriate tests, training might shift
to an analytic approach to diagnostic testing. Identified deficiencies in
process could also be combined with population health issues, such as disease
prevalence or management of underdiagnosed conditions.
The last point implies that when vignettes are used to measure process,
they must be carefully constructed. The criteria should be linked to explicit
outcomes or evidence-based guidelines, and the responses should be open-ended.
As others have shown when measuring quality responses, eliminating disparity
requires that methodical steps be taken to ensure that scoring criteria are
As they are currently developed and validated, vignettes also have 2
important limitations that discourage their use to measure individual provider
performance or individual quality criteria. First, the limited intermethod
agreement, demonstrated by the domain variation from this and other studies,
argues that vignettes should not be used to assess individual-level performance.59- 61 Second, it may be
unwise to emphasize measurements of individual criteria or individual provider
performance: poor performance on a single criterion may reflect a rare event
and not indicate a pattern of poor quality; similarly, focusing on an individual
provider fails to foster the type of relationships necessary to improve the
care provided by a group of physicians and associated health care workers.8
If vignettes are to be used appropriately, more prospective evaluation
of their strengths and weaknesses will be needed. Future studies are needed
to extend the range of clinical conditions and practice settings. This study,
for example, is limited to 4 outpatient conditions and new or walk-in patients.
Another limitation is that, although we used 2 sites, they are both academically
affiliated Veterans Affairs medical centers. It is possible that structural
elements, such as organization of care or patient population characteristics,
will affect the way providers answer vignettes.
Until these issues are formally addressed, caution is warranted before
extending vignette responses beyond global-level performance assessments.
While these results indicate that vignettes can measure actual clinical practice
by a group of providers, they should not be used to ascertain the deficiencies
in a single provider's ability to obtain a piece of information, perform a
skill or task, or complete a treatment plan.
Vignettes are likely to prove less expensive than chart abstraction
and are certainly less costly than training SPs. And if other studies substantiate
our findings, vignettes hold promise as a method to measure quality in the
outpatient setting while controlling for case mix and structural effects across
sites. Ultimately, dependable quality measurement—which ensures that
an intervention designed to improve practice actually does so—is central
to health care reform.