Small Data Challenges of Studying Rare Diseases | Endocrinology | JAMA Network Open | JAMA Network
[Skip to Navigation]
Sign In
Invited Commentary
Diabetes and Endocrinology
March 23, 2020

Small Data Challenges of Studying Rare Diseases

Author Affiliations
  • 1Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
  • 2Statistical Editor, JAMA Network Open
JAMA Netw Open. 2020;3(3):e201965. doi:10.1001/jamanetworkopen.2020.1965

The age of big data is in full swing, with researchers in both clinical medicine and public health seeking to take advantage of the increasing availability of massive amounts of electronic and administrative health data. In turn, this has led to substantial resources and efforts being poured into the development and teaching of methods for data collection and storage as well as machine learning analytic methods.1

However, big data are not always available, especially in the study of rare diseases. Indeed, in the study of rare diseases, small sample sizes are inevitable, especially when the primary end point is also uncommon. As an example, Avadhanula et al2 used data from a cohort of 125 patients with alkaptonuria, a rare autosomal recessive disorder. Patients were recruited between 2000 and 2018 as part of a prospective longitudinal study conducted at the National Human Genome Research Institute to investigate the incidence of thyroid dysfunction among patients with alkaptonuria. While this is by no means a generous sample size, the cohort is the largest of its kind for patients with alkaptonuria, according to the authors.

In the US, a rare disease is defined as a health condition that affects fewer than 200 000 individuals.3 This definition was created by Congress as part of the Orphan Drug Act of 1983, which aimed to use financial incentives to motivate pharmaceutical and medical device companies to develop new treatments for patients with rare diseases. Close to 7000 conditions meet this definition. Although a relatively small number of individuals are affected by each rare disease, the estimated total number of individuals living with any rare disease is between 25 million and 30 million.3 Support for rare disease research continues today. In 2016, the US Food and Drug Administration awarded $23 million over 4 years to support research in 21 different rare diseases.4 The Patient-Centered Outcomes Research Institute also has a special advisory board for rare disease research and thus far has funded more than 28 patient-centered comparative effectiveness studies that focus on the treatment and management of rare diseases.5

That the study of rare diseases poses unique challenges has been recognized. From the perspective of study design, researchers investigating rare diseases have many options, including crossover and adaptive trials.6 For observational studies, Whicher et al7 list self-controlled study designs, case-control designs, and prospective inception cohorts as potential designs suitable for rare disease research. Beyond the choice of study design, researchers must also be wary of the analytic challenges that arise from studying rare diseases, including the extent to which the available data can be viewed as representative of the entire population of patients with the condition and whether there is sufficient (statistical) power to draw definitive conclusions (ie, those that could inform decision-making). It is perhaps less well recognized that, when the sample size is small, P values are especially vulnerable to small deviations in the observed number of outcomes. For example, in the study by Avadhanula et al,2 1 patient was diagnosed with hyperthyroidism in the cohort of 125 individuals. Based on the exact test for 1-sample proportion, Avadhanula et al2 found insufficient evidence that the estimated prevalence in the study population (ie, 1 of 125 [0.8%]) was different than that in the general population (ie, 0.5%), with a P value of .88. As a thought experiment, suppose 2 patients instead of 1 had been diagnosed with hyperthyroidism. The same test would yield a P value of .23. Furthermore, if 3 patients were diagnosed, the resulting P value would then be .04. Thus, by hypothetically observing just 2 more cases, there is a dramatic change in the P value, a change that would likely alter decision-making.

This is all the more important to acknowledge when it is placed against the backdrop of a 2019 editorial by the American Statistical Association that called on researchers to move away from using the term statistical significance to describe results with a P value of less than .05.8 As part of the editorial, the American Statistical Association solicited suggestions for alternative paradigms. One interesting proposal was that journals adopt a so-called results-blind review process in which study results are omitted from the initial manuscript submission. In doing so, the central criteria for publication would be whether the study objective is relevant and interesting from either a clinical or public health perspective and whether the study design and methods are appropriate. Rare disease research that lacks statistical power or fails to achieve the conventional levels of statistical significance may especially benefit from this type of review process. More publications and dissemination of knowledge of rare disease research would increase awareness and possibly foster new collaborations among different institutions that could lead to small data becoming bigger.

In a 2019 study, Rees et al9 reported on the completion and publication status of 659 clinical trials for rare diseases registered at between January 2010 and December 2012. They found that, as of December 2014, 199 trials (30.2%) were discontinued, with insufficient patient accrual as the most cited reason. Furthermore, among those completed, more than half (306 [66.5%]) remained unpublished at 2 years and nearly one-third (142 [31.5%]) remained unpublished at 4 years. Although the authors were unable to ascertain whether sample size and statistical significance factored into whether a study was published, it seems highly plausible that they would in many instances.10

Currently, JAMA Network Open does not use a results-blind review process. However, although not explicitly stated in the Instructions to Authors, statistical significance is not considered a criterion for publication. Driven by the desire to publish important science, JAMA Network Open is open to publishing high-quality studies with an important research question, a sound study design, appropriate methodology, and conclusions that are a reasonable and accurate reflection of the nature and strength of the evidence. Consequently, this journal represents an important venue for the publication of studies of rare diseases and embraces the challenges that arise from studying diseases that are often overlooked. After all, do we not hope that every disease will become rare in the future?

Back to top
Article Information

Published: March 23, 2020. doi:10.1001/jamanetworkopen.2020.1965

Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2020 Mitani AA et al. JAMA Network Open.

Corresponding Author: Sebastien Haneuse, PhD, Department of Biostatistics, Harvard T.H. Chan School of Public Health, 655 Huntington Ave, Bldg II, Room 407, Boston, MA 02115 (

Conflict of Interest Disclosures: None reported.

Beam  AL, Kohane  IS.  Big data and machine learning in health care.  JAMA. 2018;319(13):1317-1318. doi:10.1001/jama.2017.18391PubMedGoogle ScholarCrossref
Avadhanula  S, Introne  WJ, Auh  S,  et al.  Assessment of thyroid function in patients with alkaptonuria.  JAMA Netw Open. 2020;3(3):e201357. doi:10.1001/jamanetworkopen.2020.1357Google Scholar
US Food and Drug Administration. Developing products for rare diseases and conditions. Accessed February 6, 2020.
Voelker  R. Shot in the arm for rare diseases.  JAMA. 2016;316(23):2474. doi:10.1001/jama.2016.17235
Patient-Centered Outcomes Research Institute. Rare diseases. Accessed February 6, 2020.
Gagne  JJ, Thompson  L, O’Keefe  K, Kesselheim  AS.  Innovative research methods for studying treatments for rare diseases: methodological review.  BMJ. 2014;349:g6802. doi:10.1136/bmj.g6802PubMedGoogle ScholarCrossref
Whicher  D, Philbin  S, Aronson  N.  An overview of the impact of rare disease characteristics on research methodology.  Orphanet J Rare Dis. 2018;13(1):14. doi:10.1186/s13023-017-0755-5PubMedGoogle ScholarCrossref
Wasserstein  RL,, Schirm  AL, Lazar  NA.  Moving to a world beyond “p < 0.05.”  Am Stat. 2019;73(suppl 1):1-19. doi:10.1080/00031305.2019.1583913Google ScholarCrossref
Rees  CA, Pica  N, Monuteaux  MC, Bourgeois  FT.  Noncompletion and nonpublication of trials studying rare diseases: a cross-sectional analysis.  PLoS Med. 2019;16(11):e1002966. doi:10.1371/journal.pmed.1002966PubMedGoogle Scholar
Mlinarić  A, Horvat  M, Šupak Smolčić  V.  Dealing with the positive publication bias: why you should really publish your negative results.  Biochem Med (Zagreb). 2017;27(3):030201. doi:10.11613/BM.2017.030201PubMedGoogle Scholar
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words