Physician Practice Pattern Variations in Common Clinical Scenarios Within 5 US Metropolitan Areas

This cross-sectional study examines within-area physician-level variations in decision-making in common clinical scenarios where guidelines specifying appropriateness or quality of care exist.

This supplemental material has been provided by the authors to give readers additional information about their work.

A. Clinical Scenarios
Through an extensive process of literature review, assessment of current measures in the public domain (e.g. National Quality Forum, Centers for Medicare and Medicaid Services, and Agency for Healthcare Research and Quality), and through consultations with clinical subject matter experts, we defined measures of appropriate care in a set of 14 clinical scenarios spanning 7 specialties: coronary artery disease care (stress tests in patients with stable chronic coronary artery disease and statin therapy in patients with chronic coronary artery disease), endocrinologist diabetes care (kidney function testing and oral glucose-lowering medication therapy in patients with diabetes), gastroenterologist gastrointestinal care (polyp detection rates on screening colonoscopy and endoscopy in patients with gastroesophageal reflux disease and no alarm symptoms), pulmonologist COPD and asthma care (bronchodilator and spirometry in patients with COPD), obstetrician prenatal and delivery care (appropriate prenatal screening in pregnant patients and Caesarean delivery in patients with low-risk pregnancy), orthopedist joint care (any physical therapy prior to elective hip or knee replacement and arthroscopy in patients with new hip or knee osteoarthritis), and orthopedist or neurosurgeon spine care (spinal fusion in patients with low back pain and physical therapy in patients with cervical spine pain).
In defining these measures, we selected clinical scenarios that are common. For example, back pain is one of the most common chief complaints presented to physicians in the outpatient setting in the U.S. Similarly, childbirth is a common event in the general population. Second, we focused on clinical scenarios in which specific observable events reflecting appropriateness of care would be plausibly captured in the administrative claims data. Observable events included the presence of a clinical service, test, or prescription drug. We assessed such events in a defined population of patients who were in each clinical scenario (e.g. physical therapy prior to elective hip or knee replacement and arthroscopy in patients with new hip or knee osteoarthritis).
With attention toward measure validity, we defined the measures to be appropriate for specific patient populations observable in the claims data as recommended in practice guidelines, the literature, or established quality measurement definitions. For example, A low-risk pregnancy was defined as the absence of multiple gestation; maternal endocrine, gastrointestinal, or cardiovascular conditions (either pre-existing or gestational); conditions of the placenta, amniotic fluid, and uterus; and fetal conditions including fetal abnormalities, fetal demise, and reduced fetal movement or growth (more details can be found in eFigures 1-7).
To improve measure validity, 9 of the 14 measures largely adhered to specifications from the National Quality Forum, NCQA Healthcare Effectiveness Data and Information Set, Agency for Healthcare Research and Quality, and Centers for Medicare and Medicaid Services. These specifications have gone through rigorous peer review and are commonly used by Medicare and commercial insurers. For measures focused on clinically indicated medications, such as the share of patients with coronary artery disease on a statin or share of patients with diabetes mellitus on an oral glucose-lowering agent, patient-level adherence was measured each month. Aggregated to the physician-level, such a measure reflects the proportion of months a physician's attributed patients were on the indicated medication.
Given that we estimated physician-level variations in measure performance, we linked a given measure to the specialty most likely responsible for performance on the measure (e.g. for measures on child birth, we linked child births to the obstetrician on the delivery claim). Some measures, however, such as patients with coronary artery disease on a statin (which we defined as a cardiologist measure), could be influenced by other specialties such as primary care. In the data, we were not able to disentangle how much of practice patterns was driven by primary care, recommended by specialists and implemented by other clinicians, or driven by specialists. Thus, to the extent that performance was driven by primary care or other clinicians, this study attributes that to the specialist, who would plausibly review those clinical decisions by colleagues and modify them as necessary upon seeing the patient.
We calculated the reliability of each measure for each physician using the signal to noise approach, which examines the ratio of the variation between physicians and total variation within a measure, the latter comprising between-provider plus within-provider variation. A linear random effects model was fit to estimate the clinical event of interest (e.g., the numerator), with the provider as the independent variable, and a random intercept. The covariance parameter estimate serves as the between-provider variation; the within-provider variation was calculated by the sum of the squared residuals from the model, divided by N(N-1). Consistent with prior work, a reliability greater than 0.7 was considered high and a reliability between 0.4 and 0.7 was considered acceptable. To improve the reliability of measures and exclude physicians with low case volumes, we defined a minimum threshold of 10 patient cases conforming to a measure's clinical scenario that a physician must have to be included in the measure. For example, for measures concerning low-risk pregnancy, an obstetrician must have performed at least 10 lowrisk deliveries to be eligible. The average and interquartile range of number of patients per physician for each measure is shown in eMethods 3.
We assigned specialties using the principal specialty code in the claims. Physician specialties were defined systematically using the National Uniform Claim Committee (NUCC) Provider Taxonomy as reported in the Center for Medicare and Medicaid Services (CMS) National Plan and Provider Enumeration System (NPPES).

B. Statistical Analysis
Within each measure, adjusted differences of physician performance were estimated using data at the patient level; in other words, the patient was the unit of analysis. We used a standard model that compared patients attributed to physicians in quintile 1 (best performance) in a given measure to patients attributed to physicians in each of the subsequent quintiles in the same measure. At the patient level, because a given measure's output could be expressed as a binary outcome (e.g. for a patient with new hip or knee osteoarthritis attributed to an orthopedic surgeon, whether the patient received arthroscopic surgery) or a count-based outcome (e.g. for a patient with stable chronic coronary artery disease attributed to a cardiologist, the number of stress tests received), the model used a logit or linear functional form.

= + + + +
In this model, Yijk denotes performance on a given measure for patient i, who is attributed to physician j, in metropolitan statistical area k. DxCG denotes the DxCG risk score, which uses age, sex, and clinical diagnoses to derive a measure that reflects expected spending and is often used by insurers for risk adjustment. SES denotes a composite score of social determinants of health derived from seven U.S. Census variables based on the patient's zip code of residence, according to the methodology of the Agency for Healthcare Research and Quality. 1 These seven variables were: median household income, percent with less than high school graduation among populations 25 years and over, percent with bachelor's degree or higher among populations 25 years and over, percent of families below the poverty level among all people, civilian labor force unemployment rate, median value of owner-occupied units, and percentage of households containing one or more person per room. Quintile denotes a binary variable indicating the quintile of performance of an attributed physician (quintiles 2-5) relative to the reference quintile (quintile 1) for a given measure in a metropolitan statistical area. Within each measure in each metropolitan statistical area, quintile 1 was defined as physicians whose performance reflected the most favorable end of the distribution-denoting more appropriate or guideline-concordant care on average-relative to the subsequent quintiles 2-5. This produced the coefficient of interest, which captured the magnitude of the difference in mean performance between quintile 1 physicians and physicians in one of the subsequent quintiles. Standard errors were clustered at the physician level. Due to the large number of comparisons in the study across the 14 measures and 5 metropolitan statistical areas, each of which we took to be of equal importance relative to the others (in other words, we did not have a primary outcome), we did not conduct individual statistical tests for each difference in mean performance. We used 95% confidence intervals to convey the uncertainty around mean differences in performance. eTables 1-7 show results from the base model and results from sensitivity analyses. Model 1 results are the main estimates. Sensitivity analyses tested the robustness of the main estimates to alterations in the model, focusing on the role of clinical risk and socioeconomic status adjustment. Model 2 omitted clustered standard errors. Model 3 omitted the DxCG risk score. Model 4 omitted the SES score. Model 5 omitted both the DxCG and SES scores. Stable estimates among sensitivity analyses relative to Model 1 would suggest that any observable differences in patient age, sex, clinical diagnoses, and socioeconomic status characteristics across physician quintiles contributed minimal bias toward the differences in performance between the quintiles. However, they would not adjust for all potential confounders. The interpretation of the regression coefficients varies by measure (e.g. average differences in utilization or percentage point differences in measure performance between quintiles).  Diabetes is defined as a diagnosis of type 1 or type 2 diabetes within the past 5 years. Numerator: patients in the denominator who filled a prescription for an oral glucose-lowering agent. Oral glucose-lowering agents include metformin, sulfonylureas, thiazolidinediones, biguanides, bile acid agents, newer agents such as the sodium-glucose cotransporter-2 (SGLT-2) inhibitors, and others. A higher proportion denotes more favorable performance. Attribution was to the endocrinologist with whom the patient had the most office visits within 3 years. * Denominator: patients 50 years of age or older who had a screening colonoscopy and had at least one documented risk factor for colorectal cancer (e.g. family history of colonic polyps, family history of malignant neoplasm of digestive organs, and encounter for screening for malignant neoplasm of colon). A screening colonoscopy is performed for colorectal cancer prevention and is distinct from a diagnostic colonoscopy, which is often performed during a diagnostic workup such as in response to a gastrointestinal bleed or other signs and symptoms. Numerator: patients in the denominator whose colonoscopy revealed a polyp, whether it was benign, an adenoma, or a colorectal cancer. A higher proportion denotes more favorable performance. Attribution was to the gastroenterologist who performed the screening colonoscopy. * Denominator: patients who had a diagnosis of GERD in the past 5 years and who has had no alarm symptoms. Alarm symptoms include weight loss, anemia, bleeding, dysphagia, and odynophagia and render endoscopy more appropriate in the setting of GERD. Numerator: patients in the denominator who had an upper endoscopy during the measurement period. Exclusions: a diagnosis of GERD greater than 5 years prior; presence of alarm symptoms. Continuous enrollment required. A lower proportion denotes more favorable performance. Attribution was to the gastroenterologist with whom a patient had the most office visits within 3 years. Given the apparent absence of such a measure in the public inventory, this measure is novel. Measure validity relies on the practice guidelines and clinical evidence that indicate, on average, endoscopy in patients with GERD and no alarm symptoms is less appropriate, as noted in Methods S1. * Denominator: patients who had an elective hip or knee replacement. Elective is defined as an elective surgical admission in the absence of emergency care. Numerator: patients in the denominator who had physical therapy within 4 months before the procedure. A higher proportion denotes more favorable performance. Exclusions: transfers from one hospital to another hospital for surgery. Required 120 days (4 months) of continuous enrollment prior to surgery. Attribution was to the orthopedic surgeon with whom the patient had the most visits, prior to surgery if applicable. This definition and the measure validity were constructed from practice guidelines and the clinical literature, including those from the American Academy of Orthopaedic Surgeons, American Physical Therapy Association, and Agency for Healthcare Research and Quality that advise physical therapy prior to lower extremity joint replacement (see Methods S1 in this document). This measure was not adapted from a measure in the public domain, although its specification followed the model of other procedural measures in the public domain. Its validity is further enhanced by restricting to cases of elective joint replacement, for which the opportunity for at least one session of a physical therapy trial was plausibly available. 81.4 Note: Data enabling linkages of physicians to their affiliated organization were available only for the South Central metropolitan statistical area (MSA). In this MSA, we were able to identify 4 health systems, each of which is a physician organization (labeled A through D in the graphs below). This table shows the proportion of specialists in each measure in this MSA who were affiliated with a health system.