Sharing full data from clinical trials has been extensively advocated to better understand the harms and benefits of current treatments, generate new hypotheses, and maximize knowledge gained through trial participants’ altruism. Several pharmaceutical companies and the European Medicines Agency, which licenses drugs in Europe, are now sharing clinical trial data.1 An Institute of Medicine report1 presented a framework and the International Committee of Medical Journal Editors issued a draft proposal for clinical trial data sharing. The National Institutes of Health have expanded requirements for registration of clinical trials, reporting of summary results, and data management plans.2 The Patient Centered Outcomes Research Institute has been working on a data sharing policy almost since its inception. Beyond clinical trials, researchers can study the effectiveness or safety of therapies via observational data collected in electronic health records within the US Food and Drug Administration (FDA) Sentinel Program, within research-oriented health care systems, and through the Patient-Centered Outcomes Research Network. In various precision medicine initiatives, patients are sharing with researchers data from personal devices and genomic sequencing of biospecimens.
However, there continues to be a lag between data-sharing intentions and the implementation of policies to make sharing happen. Relatively few researchers have requested access to newly available clinical trial data sets. To date, few important results have been published from secondary analyses of shared clinical trial data3 even though examples exist where such availability either could, or did, change conclusions. Identifying and accessing such data sets can be difficult because sponsor platforms are not discoverable, searchable, or interoperable. Resistance to data sharing from clinical trialists has become more apparent,3 mostly based on assertions of data ownership and academic incentives for publishing multiple articles from single studies. Academic incentives should reward data sharing that leads to secondary publications by others and that encourage collaboration between the researchers who generate the data and secondary users.4 Fundamental issues that still need to be addressed include costs, consent, privacy, and data security.
Many have argued for the possible benefits of sharing full data from clinical trials, but the cost of sharing research data has received less attention, even though they are of central concern to both funders and researchers. Deidentification, data curation and storage, and responding to data requests could require resources extending over many years. Pharmaceutical companies that share data from clinical trials currently bear all these costs but have indicated they cannot do so indefinitely.1 A necessary first step is an analysis of the costs of sharing clinical trial data and of the options for sustainable and equitable funding.1 Such information can provide a foundation for discussing how to allocate fairly the costs of data sharing.
Front-end costs could be reduced through use of common data elements and standardized formats for collecting and managing health care and research data.1 This would also make shared data sets more interoperable and useful. A common platform for sponsors and funders to upload data and for data users to request data—or at least consolidating data sharing platforms—would also reduce infrastructure costs, facilitate data searching and access, and thereby increase the potential benefits of data sharing.
Much of the advocacy for data sharing has not grappled with its financial costs or, from the funder perspective, the tradeoff between the benefits of data sharing and the opportunity costs of funding fewer research projects. Experience to date suggests that only a small fraction of studies will have sharing requests, and for even fewer will the request yield an important scientific advance or modification of published claims. Benefits that can only be assessed when sharing is more widespread are the yield from individual-patient data meta-analyses, better assessment of safety across multiple studies and observational data sets, and possibly improved data management and analysis knowing that others might attempt replicate the analyses. Future research should analyze carefully both these benefits and costs of data sharing. If the costs are indeed high in comparison with benefits, requirements for sharing might be calibrated, with greater sharing obligations if it would be costly or difficult to obtain similar data, if the trial was directly relevant to clinical practice, or if the stakes for erroneous analysis are judged to be high.
Under US federal regulations, using and sharing deidentified health data for research does not require the consent of patients. The implicit rationale is that if data cannot be reidentified, the risk to persons whose data are shared is no greater than the risks accepted in daily life.5
However, in the big data era, this regulatory exception to consent is outmoded because no data can be accurately characterized as “deidentified.” Identifiability is not an inherent property of a data set but depends on what other data can be combined with it.1,6 As big data grow, reidentification becomes ever more feasible.
Can sharing data without explicit consent be justified without using an outdated concept of deidentification? Physicians and health care organizations have a moral obligation to improve clinical outcomes, and patients also have a moral obligation to allow data in the electronic health records collected during routine care to be used and shared in observational studies and some very low-risk clinical trials, with appropriate privacy safeguards.7
Breaches of personal data held by retailers, websites, financial institutions, government agencies, and health care organizations are everyday news. How can medical and research data that are shared be better protected? First, identifiable health data should have privacy and security safeguards regardless of who holds them. Congress should extend the health privacy and security protections to all parties that collect or hold such data, including Internet service providers, websites, and mobile application and device developers.8
Second, even without federal requirements, data sharing should use state of the art protections, such as 256-bit encryption, virtual private networks, and testing for network security threats. Methods exist whereby data can be made available to authorized secondary users for analysis without allowing them to download it. Furthermore, distributed or federated networks can aggregate health data held by several institutions more securely than a centralized site holding data from many researchers and institutions.1, The FDA Sentinel project and a confederation of integrated health care systems use distributed data networks in which individual patient-level observational data never leave the site of clinical care.
Third, technical approaches to protecting privacy should be developed and adopted. In differential privacy, some values in a data set are altered so that the data set remains useful for group analyses while better protecting individuals from reidentification. In the altered data set, the risk of reidentification for an individual is no greater if the individual is included in the data set or excluded, and the usefulness of the data set is reduced by no more than a small prespecified amount.9 This approach has been studied by computer scientists, but should be tested and if the findings are promising used more broadly on large health data sets.
Fourth, organizations that collect, store, and use large data sets containing health information for research should appoint a data access and oversight committee that includes patient or public representatives.1,7 The committee should be empowered to identify and address important public concerns.
More needs to be learned about the societal benefits and costs of different data sharing models. The recent contest and prize sponsored by a medical journal and the National Institutes of Health (NIH) for the best secondary analysis of one large NIH-sponsored clinical trial might be broadened to a larger number of trials, similar to XPRIZE competitions.10 These are competitions to solve formidable challenges in diverse areas, such as inventing handheld devices to diagnose disease, developing highly efficient automobiles, and cleaning up oil spills. Medical journals, professional societies, governmental agencies, and nonprofit organizations could provide forums to recommend how to study the societal benefit of funders’ investments in data sharing. Such efforts could help fulfill the promise of data sharing by providing sounder evidence on how to optimize the balance between investing in data sharing, funding new research, and maintaining scientists’ incentives to conduct research requiring primary data collection.
Corresponding Author: Bernard Lo, MD, The Greenwall Foundation, One Penn Plaza, Ste 4726, New York, NY 10019 (firstname.lastname@example.org).
Published Online: July 17, 2017. doi:10.1001/jamainternmed.2017.1926
Conflict of Interest Disclosures: None reported.
Lo B, Goodman SN. Sharing Clinical Research Data—Finding the Right Balance. JAMA Intern Med. Published online July 17, 2017. doi:10.1001/jamainternmed.2017.1926