[Skip to Navigation]
Sign In
Editorial
December 21, 2018

Preserving Patient Confidentiality as Data Grow: Implications of the Ability to Reidentify Physical Activity Data

Author Affiliations
  • 1Center for Quantitative Health, Massachusetts General Hospital, Boston
  • 2Department of Computer Science, Tufts University, Medford, Massachusetts
JAMA Netw Open. 2018;1(8):e186029. doi:10.1001/jamanetworkopen.2018.6029

Individual information has the potential to be individually identifying information. In their article, Na et al1 demonstrate the potential for activity data collected by commercially available smart watches, smartphones, and fitness trackers to contribute to probabilistic reidentification of research participants. Activity tracker data join a long list of previously reported data types that can be reidentified (eg, those described by Erlich and Narayanan2 and Sweeney3). Given this history, the results of Na et al are not surprising; however, these findings are important because they speak to a core value of medicine—confidentiality—in a context of growing relevance: waveform data of the sort used by Na and colleagues are becoming more common with the widespread availability of sensors to generate these data and the potential for remote monitoring reimbursement to speed their clinical adoption.4

Confidentiality and privacy are foundational principles of both medical practice and research, as captured in the Hippocratic oath, Declaration of Geneva, American Medical Association Code of Medical Ethics, and Declaration of Helsinki. More than 30 years ago, Siegler5 recognized the potential for emerging team-based high-technology medical care to substantially alter the foundational confidentiality patients could expect from medicine. Na et al1 build on Siegler’s work and the prior literature on reidentification as a reminder to researchers and physicians that the nature of confidentiality we provide to patients evolves with technology, which frequently changes faster than patient expectations.

The issue is not merely what happens when higher-tech health care is delivered as expected. Health care providers and health plans (terms used in the US Health and Human Services database) often fall short of the goal of perfect confidentiality protections—recent evidence shows a large and growing number of protected health information disclosures.6 Those disclosures of identified health data are of particular concern in the context of the findings of Na and colleagues, as the identified records disclosed in medical record breaches serve as a beachhead of known identity from which to launch efforts at reidentification of other, theoretically deidentified, data. Any breaches of future health records that include remote monitoring and patient-generated activity data will increase the relevance of the study by Na et al.1

Important work remains to be done. Those following Na and colleagues1 will need to square this result with the existing literature on quasi-identifiers to establish the extent to which the reidentification accuracy seen in this article is attributable to the included quasi-identifying demographic microdata7 as well as the literature on statistical disclosure controls to establish the potential to anonymize activity data.8-10 At present, the result of the study by Na et al1 is, to some extent, a worst-case scenario for reidentification of activity data: their study enforced homogeneity of devices and device carry positions and collected each participant’s data within the same week so that the data were unaffected by long-term seasonal variability. Together, these factors may make reidentification of activity data less difficult. Including diverse devices and device carry positions makes activity recognition more difficult,11 and we expect such diversity might make reidentifying participants more difficult too.

Most importantly, if given a larger cohort of participants, the reidentification accuracy reported by Na et al1 should decrease, not increase, because sparse categorical demographic characteristics, not the activity data itself, play the most significant role in overall reidentification accuracy. The analysis without activity data highlights the overall result’s reliance on segmentation of each studied cohort by sparse demographic categories (eg, age, sex, and education). From Table 3 in the article, we see that adult reidentification accuracy on demographic characteristics alone is more than 80% for both cohorts (2003-2004 and 2005-2006), whereas the accuracy from activity data alone is less than 7%. The reidentification from demographic categories is achievable with simple matching, not complex learned patterns; as such, if the number of categories is fixed, the accuracy should decrease, not increase, as additional participants are available, because more participants will have the same categorical features (Figure).8 This stands in contrast to a predictive system that learns trends from data, where accuracy typically increases with training set size (up to some plateau).

Figure.  Projected Decrease of Reidentification Accuracy in the National Health and Nutrition Examination Survey 2003-2004 Cohort Using Increasing Numbers of Demographic Categories With Increasing Cohort Size
Projected Decrease of Reidentification Accuracy in the National Health and Nutrition Examination Survey 2003-2004 Cohort Using Increasing Numbers of Demographic Categories With Increasing Cohort Size

This analysis includes the full National Health and Nutrition Examination Survey cohort (rather than splitting it into age-based subpopulations) to give realistic trends in a sample of twice the size reported by Na et al.1 The lines trace the accuracy profiles of an expanding set of categorical variables for the task of reidentification through simple matching. In the key, each variable’s number of distinct categories is given in parentheses. Reproducible code is available online at https://github.com/michaelchughes/reidentifying_subjects_from_demographics.

Independent of the technical threat to confidentiality that Na et al1 report, the potential legal and social consequences of it, and solutions to it, are numerous and should be an area of further attention.12,13 New identifying-with-sufficient-effort data sources will continue to be identified in medical research and practice as new data types are introduced. As digital records enable infinite perfect duplicates, reidentification is a threat that outlives the patient, and over that potentially infinite time scale includes the risk of joining to many data sets, using many computational methods, that could not have been foreseen at the time of initial data collection or release. Privacy and confidentiality are core values of medicine and require continuous stewardship, application, and interpretation in the context of evolving practice conditions. Emerging technologies will bring improved care and understanding of disease. These technologies will come with new risks too—risks that may never be wholly removed. Physicians have balanced real risks and benefits for millennia by acknowledging and quantifying both; now is not the time to stop.

Back to top
Article Information

Published: December 21, 2018. doi:10.1001/jamanetworkopen.2018.6029

Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2018 McCoy TH Jr et al. JAMA Network Open.

Corresponding Author: Thomas H. McCoy Jr, MD, Massachusetts General Hospital, 185 Cambridge St, Simches Research Bldg, Sixth Floor, Boston, MA 02114 (thmccoy@partners.org).

Conflict of Interest Disclosures: Dr McCoy receives research funding from The Stanley Center at the Broad Institute, the Brain & Behavior Research Foundation, the National Institute on Aging, and Telefonica Alpha. Dr Hughes reported grants from Oracle Labs outside the submitted work. No other disclosures were reported.

References
1.
Na  L, Yang  C, Lo  C-C, Zhao  F, Fukuoka  Y, Aswani  A.  Feasibility of reidentifying individuals in large national physical activity data sets from which protected health information has been removed with use of machine learning.  JAMA Netw Open. 2018;1(8):e186040. doi:10.1001/jamanetworkopen.2018.6040Google ScholarCrossref
2.
Erlich  Y, Narayanan  A.  Routes for breaching and protecting genetic privacy.  Nat Rev Genet. 2014;15(6):409-421. doi:10.1038/nrg3723PubMedGoogle ScholarCrossref
3.
Sweeney  L.  Weaving technology and policy together to maintain confidentiality.  J Law Med Ethics. 1997;25(2-3):98-110, 82. doi:10.1111/j.1748-720X.1997.tb01885.xPubMedGoogle ScholarCrossref
4.
Centers for Medicare & Medicaid Services. Proposed policy, payment, and quality provisions changes to the Medicare physician fee schedule for calendar year 2019. https://www.cms.gov/newsroom/fact-sheets/proposed-policy-payment-and-quality-provisions-changes-medicare-physician-fee-schedule-calendar-year-3. Published July 12, 2018. Accessed November 15, 2018.
5.
Siegler  M.  Sounding Boards. confidentiality in medicine—a decrepit concept.  N Engl J Med. 1982;307(24):1518-1521. doi:10.1056/NEJM198212093072411PubMedGoogle ScholarCrossref
6.
McCoy  TH  Jr, Perlis  RH.  Temporal trends and characteristics of reportable health data breaches, 2010-2017.  JAMA. 2018;320(12):1282-1284. doi:10.1001/jama.2018.9222PubMedGoogle ScholarCrossref
7.
Paass  G.  Disclosure risk and disclosure avoidance for microdata.  J Bus Econ Stat. 1988;6(4):487-500. doi:10.1080/07350015.1988.10509697Google Scholar
8.
Samarati  P, Sweeney  L. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In:  Technical Report SRI-CSL-98-04. Menlo Park, CA: SRI International; 1998.
9.
Dobra  A, Karr  AF, Sanil  AP.  Preserving confidentiality of high-dimensional tabulated data: statistical and computational issues.  Stat Comput. 2003;13(4):363-370. doi:10.1023/A:1025671023941Google ScholarCrossref
10.
Murphy  SN, Gainer  V, Mendis  M, Churchill  S, Kohane  I.  Strategies for maintaining patient privacy in i2b2.  J Am Med Inform Assoc. 2011;18(suppl 1):i103-i108. doi:10.1136/amiajnl-2011-000316PubMedGoogle ScholarCrossref
11.
Sztyler  T, Stuckenschmidt  H, Petrich  W.  Position-aware activity recognition with wearable devices.  Pervasive Mobile Comput. 2017;38(pt 2):281-295. doi:10.1016/j.pmcj.2017.01.008Google ScholarCrossref
12.
Ohm  P.  Broken promises of privacy: responding to the surprising failure of anonymization.  UCLA Law Rev. 2009;57:1701-1777. https://www.uclalawreview.org/pdf/57-6-3.pdf. Accessed November 15, 2018.Google Scholar
13.
Wolf  LE, Patel  MJ, Williams Tarver  BA, Austin  JL, Dame  LA, Beskow  LM.  Certificates of confidentiality: protecting human subject research data in law and practice.  J Law Med Ethics. 2015;43(3):594-609. doi:10.1111/jlme.12302PubMedGoogle Scholar
×