Estimates of COVID-19 Cases and Deaths Among Nursing Home Residents Not Reported in Federal Data

Key Points Question How many COVID-19 cases and deaths at nursing homes were missed in the federal National Healthcare Safety Network (NHSN) reporting system owing to the delayed start in required reporting? Findings In this cross-sectional study of 15 307 US nursing homes, approximately 44% of COVID-19 cases and 40% of COVID-19 deaths that occurred before the start of reporting were not reported in the first NHSN submission in sample states, suggesting there were more than 68 000 unreported cases and 16 000 unreported deaths nationally. Meaning These findings suggest that federal NHSN data understate total COVID-19 cases and deaths in nursing homes and that using these data without accounting for this issue may result in misleading conclusions about the determinants of nursing home outbreaks.


eAppendix 1. Further Information on State Health Department Data
The below table summarizes the facility-level data we collected from state health departments. The second column notes the date of the state report that we used: all were within 1 week of May 24, 2020. Because some state data includes data for other types of long-term care facilities or congregate living settings (e.g. assisted living facilities), our first step in cleaning the data was to match each facility in the state data to the CMS Nursing Home Compare database in order to exclude non-nursing homes from our estimates. We performed this matching using an algorithm that gave each potential match between a score based on returning the same establishment using the Google Maps API, a fuzzy-matching score of the facility names, and whether the two facilities shared the same geographic identifiers, where available (address, city, or county). We hand-checked all matches below a certain score threshold. The third column of Table A1 documents which variables were available in the state data to match, and whether the state data included non-nursing homes. In some cases, the state data included non-nursing homes, but also included a facility type variable, allowing us to simply restrict the data to nursing homes. In other states, non-nursing homes were included and there was no facility type variable. For these states, where possible, we gathered data on the names and locations of licensed assisted living facilities to improve the matching process (we then performed the matching algorithm on the set of nursing homes from Nursing Home Compare and the other facilities from the state data).
Finally, the fourth column notes details about the case and death measures that states reported. We attempt to use the closest measure to total confirmed and probable cases and deaths among residents that is available; however, there are some notable differences across states in what this measure is. For example, states report only laboratoryconfirmed cases and deaths, others only report cases and deaths at facilities with "outbreaks" (usually defined as a certain number of cases in a given set of days), and some states may be missing cases and deaths for transferred residents or that occurred before a certain date. The next section discusses implications of these reporting differences for our results. Some states censor their data or provide the data in ranges: in these cases, we use the midpoint of the range. We do not use measures of cases from IL and TN because they include both residents and staff and are thus likely to significantly overstate cases but not deaths (since deaths among residents is overwhelmingly higher than deaths among staff), and we do not use cases from MA because they report cases in very coarse buckets. Two states (MD and TN) remove facilities from the data if a certain number of days has passed since the facility last reported a case. For these states, we pull the entire history of reports, and use the last observation of each facility.

eAppendix 2. Further Information on the Extrapolation Method
Our extrapolation method relies on the assumption that the degree of under-reporting in non-sample states was similar to the degree of under-reporting in sample states, conditional on observable characteristics. To extrapolate data from our sample states to the states where we do not have state department data as of May 24, we estimate a predicted probability of non-reporting for cases and for deaths for each facility, and divide the facility's reported estimate of pre-May 24 cases or deaths by this predicted probability.
The figure below shows our estimates for sample states, prior to any extrapolation. Our extrapolation method would be valid under a model of under-reporting where facilities in sample states and non-sample states were equally likely to omit retrospective cases and deaths in their first NHSN submissions. In unreported results, we assessed the reasonability of this assumption, by calculating the under-reporting percentages in each sample state separately. We found that most states appear to fall in a range of 60-100% for both cases and deaths. In addition, Figure 2 in the main text shows that there does not seem to be much systematic variation by other characteristics.

eAppendix 3. Further Information on Differences in State Reporting in State Data Using More Recent State Reports
An important caveat to our results is different state health departments may have differed in what cases and deaths they reported. For example, NY has been criticized for not including resident deaths that occurred outside the facility. Some other states only report data for facilities with a certain number of cases in a given timeframe. In both of these cases, the May 24 state data would thus still understate the true number of nursing home cases and deaths in these states. In this section, we collected later state reports to understand the impact of state reporting differences on our results. If the state and federal reporting requirements were the same, the counts of new cases and deaths after May 24 should match in the state and federal data. Table A2 compares state and federal counts of cases and deaths after May 24 using later state reports. We find that for several states (CA, CO, GA, KY, PA), the state and federal data for cases and deaths align quite well after May 24. In some other states, the state data appears to outpace federal data (CT, FL, MA, NJ, RI), suggesting that the undercount in these states may be somewhat overstated. Finally, in a few states, the state data is notably lower than the federal data (NH, TN, NY). This suggests that the actual undercount in these states are likely higher than implied by Figure 2, implying that the true toll in New York was likely even higher than what is reported in Figure 2. We note that using this table to understand the true size of the undercount assumes that reporting differences have a constant effect over time, which may not be true depending on the nature of the reporting difference. For example, Kentucky may be missing cases and deaths in their early data because they started reporting on March 7, but their more recent data may be unbiased.