[Skip to Navigation]
Sign In
June 2018

Tips for Analyzing Large Data Sets From the JAMA Surgery Statistical Editors

Author Affiliations
  • 1Harbor–University of California Los Angeles Medical Center, Torrance, California
  • 2Northwestern University, Chicago, Illinois
  • 3Duke University, Durham, North Carolina
  • 4Statistical Editor, JAMA Surgery
JAMA Surg. 2018;153(6):508-509. doi:10.1001/jamasurg.2018.0647

With the advent of administrative databases and patient registries, big data is increasingly accessible to researchers. The large sample size of these data sets make the study of rare outcomes easier and provide the potential to determine national estimates and regional variations. As such, the JAMA Surgery editors and reviewers have seen more submissions using big data to answer clinical and policy-related questions. However, no database is completely free of bias and measurement error. With bigger data, random signals may denote statistical significance, and precision may be incorrectly inferred because of narrow confidence intervals. While many principles apply to all studies, the importance of these methodological issues is amplified in large, complex data sets.

Study Population Considerations

It is important for the reader to understand how the investigator arrived at the study population. Usually, it is drawn from a larger source population to which inclusion criteria have been applied. A flowchart of the included and excluded participants, with the number excluded and reasons why, should be clearly delineated. Similarly, if the study is longitudinal, loss to follow-up should be reported. This will help readers understand any selection bias present.

Methodological and Sample Size Considerations

The objective and outcome(s) of the study should have been defined prior to data collection and analysis. If an author is looking for a difference in some variable between 2 cohorts, this difference and its confidence intervals should also be preplanned. The difference in the effect estimate should be reported as a patient-centered, clinically meaningful, and interpretable difference1 in addition to the statistical result (eg, regression coefficient, P value). Unfortunately, mining large data sets without preplanning can lead to unintentional, often mistaken conclusions. Statistical significance is related to sample size, and with a large enough sample, statistical significance between groups may occur with very small differences that are not clinically meaningful.

When reporting the results of observational studies, authors should consider following the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guidelines. The study design should be clearly described and be consistent with how the data were collected and analyzed, and the study results should be presented in a concise yet complete manner. There should be some statement that the study was performed after institutional review board approval or exemption was obtained. Authors should also describe whether any interim analyses were performed and if there were any protocol violations. Limitations should be reported to promote scientific integrity and validity of conclusions, which should be fully supported by the data analysis. Interpretations of observational studies should only lead to descriptions of associations between variables, not to conclusions of causality.

Although insufficient power would not seem to be a problem with large databases, this is simply not true. Study samples may be inadequate to answer questions about rare outcomes. Thus, regardless of the size of the database, the sample size and power analysis should have been calculated a priori. A power analysis is particularly helpful in interpreting the study findings when statistically significant effects are not found.2 If a post hoc subgroup or power analysis is performed, then this should be stated in the Methods section of the resulting article. Consideration should also be given to adjusting for multiple comparisons and/or multiple testing, especially if these were not preplanned. If there are more than 20 tests performed, then by chance, one will be statistically significant. One strategy is to employ methods of correction (eg, Bonferroni correction, Hochberg sequential procedure) when the number of tests or comparisons exceeds 20.3

Data Elements and Presentation

Authors should present their data with sufficient detail that a reader could calculate and reproduce the results. Rather than simply reporting summary data or proportions, it is preferred that the authors present granular, raw data. The proportion of missing data for the variables and outcomes of interest should be clearly described.4 When there is a large proportion of missing data (>30%), the author should describe the pattern of missingness in the data, and there should be consideration for using techniques such as multiple imputation. In addition to reduced power, performing analyses in which only participants with complete data are included will result in bias. For example, if income data are more likely to be missing for those who do not have insurance and are considered sicker, there will potentially be problems analyzing the effect of socioeconomic status on surgical outcomes.

Given the observational nature of registry data, one consideration is to create a Directed Acyclic Graph,5 which will allow the reader to understand the role of potential confounders and intermediates. When there are a large number of tables, ancillary to the primary study objective, submitting online supplementary files should be considered. If the data can be depicted in a table or figure, there is no need to describe the results in the manuscript text. Pie charts and bar graphs add very little to what is already stated in the text, unless there are multiple, complex bins. If bar graphs are to be used, the 95% CI bars (sometimes called whiskers) should also be denoted.

If medical record abstraction was used, the methods for medical record review should be detailed, such as describing who the abstractors were, their background, how they were trained, and whether there was a standardized data collection instrument.6 Ideally, medical record abstractors should be blinded to the study hypothesis and objectives, and there should be at least 2 independent medical record abstractors. The inter-rater reliability (eg, κ) of the abstractors should also be described.

Analytic and Statistical Considerations

Since studies based on secondary analyses of large data sets are by definition observational, it would be preferred if there were less emphasis placed on statistical hypothesis testing and the reporting of P values. As per the American Statistical Association,7 a description of effect estimates (odds ratios, risk ratios, etc.) and the 95% CI is more informative than reporting P values. If 2 cohorts are being compared for a continuous variable (eg, duration of operation, length of hospital stay), then the mean difference (or median, if the data are not normally distributed) and its 95% CI should be reported.

When presenting a multivariable model, the theoretical basis of the model should be described. The type of model (eg, logistic, linear, Poisson) and the assumptions on which it is based should be clear (eg, the model assumed linearity or normality of the distribution of the data). The authors should demonstrate that model assumptions were not violated, thereby supporting the validity of the model. Additionally, a description of why certain predictor variables and which variables were chosen for the model should be clearly stated. Ideally, a model with its predictors will not be selected simply using criteria for statistical significance. Rather, the predictor variables should be chosen based on background literature and/or biological and clinical plausibility. If model selection is performed purely based on statistical significance, then the model should be presented as hypothesis-generating, rather than conclusive.8 For the purposes of sample size calculation for multivariable logistic regression analysis, for each additional included predictor, there should be at least 10 to 15 participants with the outcome of interest. Thus, if there are 20 deaths in a study sample, it would only be possible to assess 2 variables as predictors at most, such as age and diabetes in a multivariable model.

One other consideration in multivariable modeling is the potential for correlations within a cluster of participants. As an example, if one is assessing regional differences in postoperative wound infections after hernia repair, one would expect correlations of outcomes by surgeon. In this case, a cluster analysis using generalized estimating equations should be used. Similarly, if a study is evaluating repeated measures of a variable over time in the same patient (eg, quality of life scores at 3 months and 9 months after surgery), then a mixed model approach should be used.9 Finally, in presenting the model, authors should describe how they assessed for model fit, multicollinearity, and effect modification.10


Large data sets have many unique strengths, including broad representation, efficient sampling design, and often consistency in data structure. However, large data sets are not free from bias and measurement error, and it is important to respect and acknowledge the limitations of the data. The challenge with big data is that it requires a carefully thought-out research question and a transparent analytic strategy. The resulting article should have sufficient information demonstrating that appropriate design and statistical methods were used. Yet there needs to be a balance between the amount of information provided and journal space limits; thus, relevant methodologic information can be placed in a supplement if needed.

As editors of JAMA Surgery, we encourage researchers to continue to ask these critical research questions that are best answered with big data sets. When completing a request for revision and resubmittal, we encourage researchers to respond to each comment, whether or not the requested changes were implemented. For comments the research team chooses not to implement, it is helpful to include a detailed reason why a change is considered inappropriate.

We appreciate the work of JAMA Surgery authors. We sincerely hope that the considerations in this article, and the accompanying Editorial, “A Checklist to Elevate the Science of Surgical Database Research,”11 are helpful.

Back to top
Article Information

Corresponding Author: Amy H. Kaji, MD, PhD, Harbor–University of California Los Angeles Medical Center, 1000 W Carson St, Ste 21, Torrance, CA 90509 (akaji@emedharbor.edu).

Published Online: April 4, 2018. doi:10.1001/jamasurg.2018.0647

Conflict of Interest Disclosures: None reported.

McGlothin  AE, Lewis  RJ.  Minimally clinically important difference: defining what really matters to patients.  JAMA. 2014;312(13):1342-1343.PubMedGoogle ScholarCrossref
Stokes  L.  Sample size calculation for a hypothesis test.  JAMA. 2014;312(2):180-181.PubMedGoogle ScholarCrossref
Cao  J, Zhang  S.  Multiple comparison procedures.  JAMA. 2014;312(5):543-544.PubMedGoogle ScholarCrossref
Newgard  CD, Lewis  RJ.  Missing data: how to best account for what is not known.  JAMA. 2015;314(9):940-941.PubMedGoogle ScholarCrossref
Shrier  I, Platt  RW.  Reducing bias through directed acyclic graphs.  BMC Med Res Methodol. 2008;8:70.PubMedGoogle ScholarCrossref
Kaji  AH, Schriger  D, Green  S.  Looking through the retrospectoscope: reducing bias in emergency medicine chart review studies.  Ann Emerg Med. 2014;64(3):292-298.PubMedGoogle ScholarCrossref
American Statistical Association.  The ASA’s statement on P values: context, process and purpose.  Am Stat. 2016;70:129-133.Google ScholarCrossref
Meurer  WJ, Tolles  J.  Logistic regression diagnostics: understanding how well a model predicts outcomes.  JAMA. 2017;317(10):1068-1069.PubMedGoogle ScholarCrossref
Detry  MA, Ma  Y.  Analyzing repeated measurements using mixed models.  JAMA. 2016;315(4):407-408.PubMedGoogle ScholarCrossref
Tolles  J, Meurer  WJ.  Logistic regression: relating patient characteristics to outcomes.  JAMA. 2016;316(5):533-534.PubMedGoogle ScholarCrossref
Haider  AH, Bilimoria  KY, Kibbe  MR.  A checklist to elevate the science of surgical database research  [published online April 4, 2018].  JAMA Surg. doi:10.1001/jamasurg.2018.0628Google Scholar
Strengthening the Reporting of Observational Studies in Epidemiology Group. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement. https://www.strobe-statement.org/index.php?id=strobe-home. Published 2009. Accessed March 7, 2018.
1 Comment for this article
IRB approval or determination exemption usually does not apply
Mark Schreiner, MD. | The Children's Hospital of Philadelphia, Perelman School of Medicine at the University of Pennsylvania
The authors state that "There should be some statement that the study was performed after institutional review board approval or exemption was obtained." However, for many if not most research involving national database registries, IRB review would not required nor would a determination of exemption.

The definition of human subjects research at 45 CFR 46.102(f) requires that the investigator obtain private information such that the subjects would be readily identifiable.

"Private information must be individually identifiable (i.e., the identity of the subject is or may readily be ascertained by the investigator or associated with the information) in
order for obtaining the information to constitute research involving human subjects."

Most of the time, the data provided to investigators by national registries will not be individually identifiable. When this is the case, the research would not meet the definition of human subjects research and IRB review of the proposed research would be required nor would the research require a determination of exemption. Only when the research meets the definition of human subjects research is IRB review or a determination of exemption required.