Flowchart of screening reports of randomized controlled trials (RCTs) describing harm in medical journals with high impact factors.
Pitrou I, Boutron I, Ahmad N, Ravaud P. Reporting of Safety Results in Published Reports of Randomized Controlled Trials. Arch Intern Med. 2009;169(19):1756-1761. doi:10.1001/archinternmed.2009.306
Copyright 2009 American Medical Association. All Rights Reserved. Applicable FARS/DFARS Restrictions Apply to Government Use.2009
Reports of clinical trials usually emphasize efficacy results, especially when results are statistically significant. Poor safety reporting can lead to misinterpretation and inadequate conclusions about the interventions assessed. Our aim was to describe the reporting of harm-related results from randomized controlled trials (RCTs).
We searched the MEDLINE database for reports of RCTs published from January 1, 2006, through January 1, 2007, in 6 general medical journals with a high impact factor. Data were extracted by use of a standardized form to appraise the presentation of safety results in text and tables.
Adverse events were mentioned in 88.7% of the 133 reports. No information on severe adverse events and withdrawal of patients owing to an adverse event was given in 27.1% and 47.4% of articles, respectively. Restrictions in the reporting of harm-related data were noted in 43 articles (32.3%) with a description of the most common adverse events only (n = 17), severe adverse events only (n = 16), statistically significant events only (n = 5), and a combination of restrictions (n = 5). The population considered for safety analysis was clearly reported in 65.6% of articles.
Our review reveals important heterogeneity and variability in the reporting of harm-related results in publications of RCTs.
The reporting of harm is as important as the reporting of efficacy in publications of clinical trials. Both are essential in estimating the ratio of benefit to harm of medical interventions. However, harm is frequently insufficiently reported1- 4: reports usually emphasize benefits, especially when results for efficacy are statistically significant. A previous review of randomized controlled trials (RCTs) published from 1997 through 1998 by Ioannidis and Lau1 showed that safety reporting varied widely in reports of trials and was in general largely inadequate. These results were confirmed by reviews of specific medical or surgical areas.5- 7 However, these works focused mainly on the reporting of indispensable parameters, such as reporting harm with numbers instead of generic statements, reporting severe adverse events per study harm, or reporting patient withdrawal owing to severe adverse events. These works did not focus on the potential influence of safety reporting on readers' interpretation of the trial results. As an example, the reporting of the safety of rofecoxib in the VIGOR study was highly criticized.8,9 Krumholz et al10 stated that the presentation of results obscured the cardiovascular risk associated with rofecoxib by reporting the hazard of myocardial infarction as if naproxen was the intervention group (relative risk 0.2; range, 0.1-0.7) and was protecting against myocardial infarction. The authors did not report the absolute number of cardiovascular events, even though all other results were presented appropriately with rofecoxib as the intervention group.
Several guidelines aim to improve the quality of the reporting of safety in RCTs. The extension of the Consolidated Standards of Reporting Trials (CONSORT) statement for harm provides specific guidelines for reporting harm-related results of clinical trials.11- 13 Particularly, the extension cautions against common mistakes and gives examples of inadequate reporting of harm-related data (eg, reporting qualitative instead of quantitative data, global reporting instead of reporting per study arm).
Our aim was to appraise the reporting of harm in reports of RCTs published in general medical journals with high impact factors and to focus particularly on the presentation of harm-related results. For this purpose, we systematically reviewed the reporting of safety data in a sample of recent results in reports of RCTs published in 6 core medical journals.
We searched MEDLINE via PubMed to identify all reports of RCTs published from January 1, 2006, through January 1, 2007, in 6 general medical journals with a high impact factor: New England Journal of Medicine (NEJM), Lancet, JAMA, BMJ, Annals of Internal Medicine (Ann Intern Med), and PLoS Medicine (PLoS Med). We restricted our search to these journals because a high impact factor is usually considered a good predictor of quality in methodology and reporting of results.14
Retrieved articles were assessed by 1 reviewer (I.P.) who screened all titles and abstracts to identify the relevant studies. Articles were included if the study was identified as an RCT with 2 parallel arms. We excluded observational studies (ie, cohort and case-control studies), case reports, editorials, letters, and RCTs with the following designs: factorial plan, crossover, cluster RCT, multiple-arm trial, and equivalence and noninferiority trial. Articles for which only an abstract was available were excluded. We also excluded reports of RCTs if safety was not a major issue (eg, public health intervention, assessment of diagnostic or prognostic tests and screening procedures, and medico-economic analysis of RCTs). We excluded RCTs if safety and efficacy outcomes were identical (eg, if the primary outcome was “death” in the assessment of stents or revascularization in cardiology).
A standardized data extraction form was developed on the basis of a review of the literature1 and the CONSORT statement guidelines.13 (The data extraction form is available from the authors on request.) Before data extraction, as a calibration exercise, the standardized form was tested on a separate set of 20 reports. A single reviewer (I.P.) extracted data from all articles on the basis of their full text. Supplementary material was retrieved if available. The reviewer was not blinded to the journal or the authors' names. In addition, a random sample of 30 articles was reviewed by one of us (N.A.) for quality assurance.
Data were collected on general characteristics of reports, including journal, country of first author, funding source (public, private, or both), medical area of the trial (eg, cardiology, neurology), intervention classification (pharmacological or nonpharmacological treatment), number of centers, and sample size (ie, number of subjects randomized).
For each report, we checked the title for the term “safety” or a term related to safety (eg, toxicity, or adverse effects or events). We determined whether the following indispensable safety parameters identified by Ioannidis and Lau1 were reported: (1) harm with numerical data for each trial group instead of generic statements, (2) severity grades, (3) severe adverse events per trial group, and (4) patient withdrawal owing to adverse events per trial group with the description of the events.
In a second step, we focused on the results section in each report (text, figures, and tables). We determined whether restrictions were applied in reporting harm-related data, such as reporting only the adverse events observed at a certain frequency or rate (eg, >3% or >10% of participants), reporting adverse events that reached a P value threshold in the comparison of randomized groups (eg, P < .05) or reporting only the most severe events. If reporting was restricted to common events only, we collected the threshold values (eg, adverse events occurring in >20% patients). We also checked whether (1) safety data were described per event (eg, number of cases of unstable angina) or per patient (eg, number of patients with an episode of unstable angina), (2) data for different adverse events were combined per organ (eg, neurological, digestive events) into 1 composite outcome measure, (3) information on timing of adverse events was given, and (4) expected and unexpected adverse events were reported separately. We counted the total number of statistical tests reported for safety analysis and noted whether statistical comparisons were reported for each adverse event per arm, the totality of adverse events per arm, for patients with at least 1 adverse event, for organs or for severe adverse events.
In a third step, we focused on harm-related results reported in tables and figures. Tables and figures are a powerful means of conveying investigational results.15 We counted the number of figures and tables allocated to safety reporting. We checked whether safety data in tables or figures were displayed per event or per patient, whether and how many statistical comparisons were displayed with a P value, and whether the safety analyses reported in the tables and figures were intention-to-treat analyses (ie, data for all patients randomized were analyzed for the allocated groups).
Statistical analysis involved use of SAS statistical software (version 9.1; SAS Institute Inc, Cary, North Carolina). Descriptive statistics (mean [SD], median and interquartile range [IQR], extreme values) were used for continuous variables. Categorical variables were described with frequencies and percentages. We determined the degree of agreement between reviewers by the κ coefficient for categorical variables or the intraclass correlation coefficient for continuous variables.
The electronic search identified 325 citations. From this list, we selected 186 articles after screening titles and abstracts (Figure). Finally, 133 articles were reviewed on the basis of their full text. (See the eAppendix[http://www.archinternmed.com] for the list of reports reviewed.)
Table 1 describes the main characteristics of trials reviewed. Nearly half of the reports (n = 59 [44.4%]) were published in NEJM, 26 (19.5%) in Lancet, 21 (15.8%) in JAMA, 19 (14.3%) in BMJ, and 8 (6.0%) in Ann Intern Med. No report reviewed was published in PLoS Med. Most reports reviewed assessed pharmacological interventions (n = 110 [82.7%]). The trial was multicentric in 96 (72.2%) of the reports, and the median [IQR] sample size was 462 [185-1001]. The funding source was partially or completely for-profit funding in 73 reports (54.9%).
Table 2 describes the main characteristics of the reporting of harm-related results. Interobserver reproducibility was good for the main items assessed, with κ coefficients of 0.57 to 0.90 (for 3 main items the κ coefficients was not calculated owing to the κ paradox) and interrater agreement of 0.80 to 0.97.
Adverse events were mentioned in the abstract of 95 articles (71.4%; range, 47.4% for BMJ to 84.8% for NEJM). Adverse events were reported in 118 articles (88.7%), and numerical data were provided in 112 (84.2%; range, 57.9% for BMJ to 91.5% for NEJM).
No information on severity of adverse events was given in 36 articles (27.1%), and 16 (12.0%) reported only generic statements. Severity grades were described in 21 reports (15.8%). The name of the scale used was given in 18 of these. The main scale used was the Common Terminology Criteria for Adverse Events toxicity scale (n = 10). In 63 reports (47.4%), no data were given on withdrawals owing to adverse events. The description of adverse events leading to withdrawals was given in only 17 articles (12.8%).
In total, 43 articles (32.3%) exhibited some restrictions in the reporting of harm-related data. The adverse events reported were the most common adverse events only (n = 17), severe adverse events only (n = 16), statistically significant adverse events only (n = 5), and a combination of restrictions (n = 5). In reports that described only common adverse events, the median threshold rate was 5% (minimum = 2%; maximum = 20%). Safety results were reported at the level of the event (n = 114 [85.7%]) or the patient (ie, patients with ≥1 adverse event; n = 76 [57.1%]) or as a composite outcome combining adverse events per organ (n = 32 [24.1%]).
Sixty-three articles (47.4%) described the use of at least 1 statistical test to compare safety data among groups. The median number of statistical tests reported for comparison of safety data was 5 (IQR, 2-12; range, 1-75). Thirty-seven articles (27.8%) gave the statistical tests for each type of event described. Fifty-one articles (38.3%) described global statistical comparisons, for the total number of adverse events per group (n = 5 [9.8%]), the number of patients with at least 1 adverse event per group (n = 8 [15.7%]), the total number of severe adverse events per group (n = 12 [23.5%]), and the number of adverse events combined by organ per group (n = 5 [9.8%]).
Table 3 describes the main characteristics of tables and figures related to the reporting of harm-related results. In total, 43 articles (32.3%) contained no table or figure describing safety. Numerical data in tables were described at the level of events in 25 of 90 reports (27.8%), the patient in 31 of 90 (34.4%), and both in 8 (8.9%). No distinction between events and patients was discernable in 26 reports (28.9%). Among the 90 articles with at least 1 table or figure dedicated to safety, statistical tests were given in 56 (62.2%). The population considered for safety analysis was clearly reported in tables of 59 of 90 reports (65.6%), in which 32 described the analysis as intention-to-treat.
We assessed the reporting of harm-related results in reports of RCTs published in 6 core medical journals in 2006. These results highlight the finding that despite the publication of a CONSORT statement13 extension for harm-related data, the reporting of harm remains inadequate: descriptions of adverse events with numerical data in each trial arm were missing in 18% of the reports, and information related to the severity of adverse events and the withdrawal of patients owing to adverse events was lacking in 27.1% and 47.4% reports, respectively. We also found important heterogeneity and variability in the reporting of safety results. About one-third of the reports restricted the description of adverse events to the most common events or to severe or statistically significant results. This reporting of safety data goes against the recommendations of the CONSORT statement13 extension checklist for reporting harm. The restrictions in reporting harm-related results in published reports may obscure some important rare and potentially severe adverse events.
For statistical comparisons, a median of 5 statistical tests was reported for safety analyses. We observed important variability, ranging from 1 to 75, in the number of statistical tests reported. About one-third of the reports described at least 10 statistical tests. Randomized controlled trials are well known to have insufficient statistical power to assess safety outcomes. Tsang et al16 showed that in a sample of RCTs, the power to detect a statistically significant difference in serious adverse events yielded values ranging from 0.07 to 0.37. Consequently, as previously highlighted by Ioannidis et al17(p299): “We must no longer accept confusing lists of noncomparable percentages of adverse events for clinical or scientific purposes. These lists can needlessly alarm patients and physicians or invite dismissal of real medication hazards.” Jonville-Béra et al18 also emphasized that statements about “no differences” in adverse events between groups on the basis of nonsignificant P values is almost always inappropriate. Composite safety outcomes may be a method to deal with this inadequate statistical power19; however, only 32 reports described adverse events according to the organs involved, and 5 gave statistical comparisons per organ.
Although when reporting safety, the type of study population is of great importance, and any exclusion of treated patients could influence the validity of harm-related results, the population considered for safety analyses was not clearly reported in one-third of the reports. To our knowledge, no previous review has described with this precision the reporting of harm-related results from clinical trials. We identified few reviews that assessed the general reporting of harm-related results in clinical trials. Ioannidis and Lau1 assessed the completeness of safety reporting in RCTs published from 1997 through 1998. This review was restricted to 7 areas of drug therapy. Loke and Derry4 assessed the reporting of adverse drug reactions in reports of RCTs in 7 core medical journals published in 1997 before the publication of the CONSORT statement extension for harm-related data, whereas our review assessed reports published in 2006 after the publication of the extension. Ethgen et al7 assessed the reporting of harm-related results but compared pharmacological and nonpharmacological interventions in terms of harm.
Our study has some limitations. We considered reports of RCTs published in 6 core medical journals with a high impact factor. We did not consider specialized medical journals or journals with lower impact factors. Consequently, the generalizability of our results is limited. We excluded specific designs, such as factorial plans, crossover designs, cluster RCTs, and multiple-arm trials, because results of such trials are difficult and complicated to interpret and report. Consequently our results cannot be generalized to studies of such designs. Finally, a single reviewer extracted all the data; however, the quality assurance procedure gave good reproducibility.
In conclusion, our results highlight the need for improving the reporting of harm-related results from clinical trials. Despite the CONSORT statement extension for harm-related data, efforts should still be made to describe safety results with accuracy in reports of RCTs and to standardize practices for reporting.
Correspondence: Philippe Ravaud, MD, PhD, Département d'Epidémiologie, Biostatistique, et Recherche Clinique, Groupe Hospitalier Bichat-Claude Bernard, 46 rue Henri Huchard, 75877 Paris, CEDEX 18, France (firstname.lastname@example.org).
Accepted for Publication: February 2, 2009.
Author Contributions: All authors had full access to all of the data (including statistical reports and tables) in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: Pitrou and Ravaud. Acquisition of data: Pitrou and Ahmad. Analysis and interpretation of data: Pitrou, Boutron, Ahmad, and Ravaud. Drafting of the manuscript: Pitrou, Ahmad, and Ravaud. Critical revision of the manuscript for important intellectual content: Boutron and Ravaud. Obtained funding: Ravaud. Administrative, technical, and material support: Ravaud. Study supervision: Ravaud.
Financial Disclosure: None reported.
Funding/Support: Dr Pitrou was funded by a grant from the Ministry of Higher Education and Research, France. The researchers were independent from the funding source.