Individual information has the potential to be individually identifying information. In their article, Na et al1 demonstrate the potential for activity data collected by commercially available smart watches, smartphones, and fitness trackers to contribute to probabilistic reidentification of research participants. Activity tracker data join a long list of previously reported data types that can be reidentified (eg, those described by Erlich and Narayanan2 and Sweeney3). Given this history, the results of Na et al are not surprising; however, these findings are important because they speak to a core value of medicine—confidentiality—in a context of growing relevance: waveform data of the sort used by Na and colleagues are becoming more common with the widespread availability of sensors to generate these data and the potential for remote monitoring reimbursement to speed their clinical adoption.4
Confidentiality and privacy are foundational principles of both medical practice and research, as captured in the Hippocratic oath, Declaration of Geneva, American Medical Association Code of Medical Ethics, and Declaration of Helsinki. More than 30 years ago, Siegler5 recognized the potential for emerging team-based high-technology medical care to substantially alter the foundational confidentiality patients could expect from medicine. Na et al1 build on Siegler’s work and the prior literature on reidentification as a reminder to researchers and physicians that the nature of confidentiality we provide to patients evolves with technology, which frequently changes faster than patient expectations.
The issue is not merely what happens when higher-tech health care is delivered as expected. Health care providers and health plans (terms used in the US Health and Human Services database) often fall short of the goal of perfect confidentiality protections—recent evidence shows a large and growing number of protected health information disclosures.6 Those disclosures of identified health data are of particular concern in the context of the findings of Na and colleagues, as the identified records disclosed in medical record breaches serve as a beachhead of known identity from which to launch efforts at reidentification of other, theoretically deidentified, data. Any breaches of future health records that include remote monitoring and patient-generated activity data will increase the relevance of the study by Na et al.1
Important work remains to be done. Those following Na and colleagues1 will need to square this result with the existing literature on quasi-identifiers to establish the extent to which the reidentification accuracy seen in this article is attributable to the included quasi-identifying demographic microdata7 as well as the literature on statistical disclosure controls to establish the potential to anonymize activity data.8-10 At present, the result of the study by Na et al1 is, to some extent, a worst-case scenario for reidentification of activity data: their study enforced homogeneity of devices and device carry positions and collected each participant’s data within the same week so that the data were unaffected by long-term seasonal variability. Together, these factors may make reidentification of activity data less difficult. Including diverse devices and device carry positions makes activity recognition more difficult,11 and we expect such diversity might make reidentifying participants more difficult too.
Most importantly, if given a larger cohort of participants, the reidentification accuracy reported by Na et al1 should decrease, not increase, because sparse categorical demographic characteristics, not the activity data itself, play the most significant role in overall reidentification accuracy. The analysis without activity data highlights the overall result’s reliance on segmentation of each studied cohort by sparse demographic categories (eg, age, sex, and education). From Table 3 in the article, we see that adult reidentification accuracy on demographic characteristics alone is more than 80% for both cohorts (2003-2004 and 2005-2006), whereas the accuracy from activity data alone is less than 7%. The reidentification from demographic categories is achievable with simple matching, not complex learned patterns; as such, if the number of categories is fixed, the accuracy should decrease, not increase, as additional participants are available, because more participants will have the same categorical features (Figure).8 This stands in contrast to a predictive system that learns trends from data, where accuracy typically increases with training set size (up to some plateau).
This analysis includes the full National Health and Nutrition Examination Survey cohort (rather than splitting it into age-based subpopulations) to give realistic trends in a sample of twice the size reported by Na et al.1 The lines trace the accuracy profiles of an expanding set of categorical variables for the task of reidentification through simple matching. In the key, each variable’s number of distinct categories is given in parentheses. Reproducible code is available online at https://github.com/michaelchughes/reidentifying_subjects_from_demographics.
Independent of the technical threat to confidentiality that Na et al1 report, the potential legal and social consequences of it, and solutions to it, are numerous and should be an area of further attention.12,13 New identifying-with-sufficient-effort data sources will continue to be identified in medical research and practice as new data types are introduced. As digital records enable infinite perfect duplicates, reidentification is a threat that outlives the patient, and over that potentially infinite time scale includes the risk of joining to many data sets, using many computational methods, that could not have been foreseen at the time of initial data collection or release. Privacy and confidentiality are core values of medicine and require continuous stewardship, application, and interpretation in the context of evolving practice conditions. Emerging technologies will bring improved care and understanding of disease. These technologies will come with new risks too—risks that may never be wholly removed. Physicians have balanced real risks and benefits for millennia by acknowledging and quantifying both; now is not the time to stop.
Published: December 21, 2018. doi:10.1001/jamanetworkopen.2018.6029
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2018 McCoy TH Jr et al. JAMA Network Open.
Corresponding Author: Thomas H. McCoy Jr, MD, Massachusetts General Hospital, 185 Cambridge St, Simches Research Bldg, Sixth Floor, Boston, MA 02114 (firstname.lastname@example.org).
Conflict of Interest Disclosures: Dr McCoy receives research funding from The Stanley Center at the Broad Institute, the Brain & Behavior Research Foundation, the National Institute on Aging, and Telefonica Alpha. Dr Hughes reported grants from Oracle Labs outside the submitted work. No other disclosures were reported.
McCoy TH, Hughes MC. Preserving Patient Confidentiality as Data Grow: Implications of the Ability to Reidentify Physical Activity Data. JAMA Netw Open. 2018;1(8):e186029. doi:10.1001/jamanetworkopen.2018.6029
Customize your JAMA Network experience by selecting one or more topics from the list below.