RCT indicates randomized controlled trial.
Boutron I, Dutton S, Ravaud P, Altman DG. Reporting and Interpretation of Randomized Controlled Trials With Statistically Nonsignificant Results for Primary Outcomes. JAMA. 2010;303(20):2058–2064. doi:10.1001/jama.2010.651
Author Affiliations: Centre for Statistics in Medicine, University of Oxford, Oxford, United Kingdom (Drs Boutron and Altman and Ms Dutton); INSERM, U738, Paris, France (Drs Boutron and Ravaud); Assistance Publique des Hôpitaux de Paris, Hôpital Hôtel Dieu, Centre d’Épidémiologie Clinique, Paris (Drs Boutron and Ravaud); and Université Paris Descartes, Faculté de Médecine, Paris (Drs Boutron and Ravaud).
Context Previous studies indicate that the interpretation of trial results can be distorted by authors of published reports.
Objective To identify the nature and frequency of distorted presentation or “spin” (ie, specific reporting strategies, whatever their motive, to highlight that the experimental treatment is beneficial, despite a statistically nonsignificant difference for the primary outcome, or to distract the reader from statistically nonsignificant results) in published reports of randomized controlled trials (RCTs) with statistically nonsignificant results for primary outcomes.
Data Sources March 2007 search of MEDLINE via PubMed using the Cochrane Highly Sensitive Search Strategy to identify reports of RCTs published in December 2006.
Study Selection Articles were included if they were parallel-group RCTs with a clearly identified primary outcome showing statistically nonsignificant results (ie, P ≥ .05).
Data Extraction Two readers appraised each selected article using a pretested, standardized data abstraction form developed in a pilot test.
Results From the 616 published reports of RCTs examined, 72 were eligible and appraised. The title was reported with spin in 13 articles (18.0%; 95% confidence interval [CI], 10.0%-28.9%). Spin was identified in the Results and Conclusions sections of the abstracts of 27 (37.5%; 95% CI, 26.4%-49.7%) and 42 (58.3%; 95% CI, 46.1%-69.8%) reports, respectively, with the conclusions of 17 (23.6%; 95% CI, 14.4%-35.1%) focusing only on treatment effectiveness. Spin was identified in the main-text Results, Discussion, and Conclusions sections of 21 (29.2%; 95% CI, 19.0%-41.1%), 31 (43.1%; 95% CI, 31.4%-55.3%), and 36 (50.0%; 95% CI, 38.0%-62.0%) reports, respectively. More than 40% of the reports had spin in at least 2 of these sections in the main text.
Conclusion In this representative sample of RCTs published in 2006 with statistically nonsignificant primary outcomes, the reporting and interpretation of findings was frequently inconsistent with the results.
Accurate presentation of the results of a randomized controlled trial (RCT) is the cornerstone of the dissemination of the results and their implementation in clinical practice. The Declaration of Helsinki states that “Authors have a duty to make publicly available the results of their research on human subjects and are accountable for the completeness and accuracy of their reports.” To help enforce this principle, trial registration is required,1 and reporting guidelines are available.2 However, investigators usually have broad latitude in writing their articles3; they can choose which data to report and how to report them.
Consequently, scientific articles are not simply reports of facts, and authors have many opportunities to consciously or subconsciously shape the impression of their results for readers, that is, to add “spin” to their scientific report.4 Spin can be defined as specific reporting that could distort the interpretation of results and mislead readers.3,5,6 The use of spin in scientific writing can result from ignorance of the scientific issue, unconscious bias, or willful intent to deceive.3 Such distorted presentation and interpretation of trial results in published articles has been highlighted in letters to editors criticizing the interpretation of results7 and in methodological reviews evaluating misleading claims in published reports of RCTs8,9 or systematic reviews.10 However, to our knowledge, the strategies used to create spin in published articles have never been systematically assessed.
We aimed to identify spin in reports of parallel-group RCTs with statistically nonsignificant results for the primary outcome and to develop a scheme for classification of spin strategies. We focused on trials with statistically nonsignificant primary outcomes because the interpretation of these results are more likely to be affected by a preconceived notion of effectiveness, resulting in a biased interpretation.9
The articles were screened from a representative cohort of articles of RCTs indexed in PubMed. The search strategy and eligibility criteria for this cohort have been described elsewhere.11 Randomized controlled trials were defined as prospective studies assessing health care interventions in human participants randomly allocated to study groups. Reports of cost-effectiveness studies, reports of diagnostic test accuracy, and non–English-language reports were excluded.
In brief, the Cochrane Highly Sensitive Search Strategy,12 performed in PubMed to identify primary reports of RCTs published in December 2006 and indexed in PubMed by March 22, 2007, yielded 1735 PubMed citations. After reading the titles and abstracts of retrieved citations, reports of obviously noneligible trials were excluded, and the full-text article and any online appendices were obtained and evaluated for 879 selected citations. Of these, 263 citations were excluded after the full text was read; the remaining 616 were included in this representative sample of RCTs.
From this sample, we selected parallel-group RCTs with clearly identified primary outcomes. We excluded equivalence or noninferiority trials, crossover trials, cluster trials, factorial and split-body designs, trials with more than 2 groups, and phase 2 trials. Primary outcomes were those explicitly reported as such in the published article. If none was explicitly reported, we considered the outcomes stated in the sample size estimation; if outcomes were not stated in the sample size estimation, we took the outcomes in the primary study objectives, if available. If no primary outcome was clearly identified (ie, explicitly specified in the article, in a sample size calculation, or in the primary study objectives), the article was excluded.
One reviewer (I.B.) screened the full-text articles and determined results for all primary outcomes according to statistical significance: results statistically significant (ie, P < .05), results that did not reach statistical significance (ie, P ≥ .05), or unclear results. We included only trials with nonsignificant results (ie, P ≥ .05) for all primary outcomes. When no formal statistical analyses were reported for the primary outcomes, we attempted to calculate the effect size and confidence interval for the primary outcomes, and the article was included if the estimated treatment effect was not statistically significant. If we could not calculate the effect size using the published data, the article was excluded.
For each selected article, 2 readers (I.B., S.D.) independently read the title, abstract, and Methods, Results, Discussion, and Conclusions sections, as well as online appendices referenced in the articles, when available. The reviewers independently appraised the content of the article using a pretested and standardized data abstraction form; then they met to compare results. All discrepancies were discussed to obtain consensus; if needed, the article was discussed with a third reader (D.G.A.). The reproducibility was moderate, with a κ of 0.47 (95% confidence interval [CI], 0.27-0.67) for presence of spin in the abstract Conclusions and of 0.64 (95% CI, 0.47-0.82) for spin in the article Conclusions.
For each selected article, we recorded the funding source (ie, for-profit, nonprofit, or both; not reported, no funding), 2007 journal impact factor, number of citations in 2008, the experimental intervention, comparator, sample size, and type of primary outcomes (safety, efficacy, both).
We checked whether the primary outcomes were clearly identified in the abstract. We also recorded the reporting of results for the primary outcomes both in the abstract and in the article (ie, reporting of estimated effect size with or without precision and reporting of summary statistics [eg, proportion of event, mean] for each group with or without precision).
In the context of a trial with statistically nonsignificant primary outcomes, spin was defined as use of specific reporting strategies, from whatever motive, to highlight that the experimental treatment is beneficial, despite a statistically nonsignificant difference for the primary outcome, or to distract the reader from statistically nonsignificant results.
All of the authors participated in the development of a classification scheme to standardize the collection of the strategies used for spin in the selected reports. For this purpose, in a first step, we reviewed the literature published on this topic.3,6,13- 22 We also contacted by e-mail all the members of the Cochrane Statistical Method Group and invited them to send us any examples of published RCTs with spin, in any medical field, and with any publication date. Lastly, we reviewed a sample of trials with statistically nonsignificant results published in general medical journals with high impact factors or in specialist journals.23 The classification scheme was developed following discussion and agreement among the authors.
Using the developed classification scheme, we searched for spin in each section of the manuscript in our sample, ie, abstract Results; abstract Conclusions; and main-text Results, Discussion, and Conclusions (ie, last paragraph of the manuscript when this paragraph summarized the results) sections. We then determined whether authors had used a spin strategy. The strategies of spin considered were (1) a focus on statistically significant results (within-group comparison, secondary outcomes, subgroup analyses, modified population of analyses); (2) interpreting statistically nonsignificant results for the primary outcomes as showing treatment equivalence or comparable effectiveness; and (3) claiming or emphasizing the beneficial effect of the treatment despite statistically nonsignificant results. All other spin strategies that could not be classified according to this scheme were systematically recorded and secondarily classified.
We determined the extent of spin across the whole report, defined as the number of sections with spin in the abstract (spin in the Results section only, in the Conclusions section only, or in both sections) and in the main text (spin in one section other than the Conclusions section, in the Conclusions section only, in 2 sections, or in all 3 sections). The assessment of the extent of spin is exploratory and should not be considered a scoring system. This classification scheme was developed by consensus among the authors for a pragmatic purpose: to be able to capture the diversity of spin in terms of volume (ie, whether spin concerned only a small part or most of the article).
We also classified the level of spin in the Conclusions sections of the abstract and the main text as follows. High spin was defined as no uncertainty in the framing, no recommendations for further trials, and no acknowledgment of the statistically nonsignificant results for the primary outcomes; in addition, when the Conclusions section reported recommendations to use the treatment in clinical practice, we classified this section as having a high level of spin. Moderate spin was defined as some uncertainty in the framing or recommendations for further trials but no acknowledgment of the statistically nonsignificant results for the primary outcomes. Low spin was defined as uncertainty in the framing and recommendations for further trials or acknowledgment of the statistically nonsignificant results for the primary outcomes. This classification of the level of spin is exploratory and not validated and should not be considered a scoring system. The level of spin was used to explore the heterogeneity of spin in the reporting of conclusions.
Medians and interquartile ranges for continuous variables and number (%) of articles for categorical variables were calculated. Statistical analyses were performed using SAS version 9.1 (SAS Institute Inc, Cary, North Carolina).
Of the 616 PubMed citations retrieved, 205 reports of parallel-group RCTs were identified. Among these reports, we identified and appraised 72 reports with statistically nonsignificant results for the primary outcomes (Figure). Characteristics of the included reports are presented in Table 1. Most reports evaluated efficacy (n = 63 [87.5%; 95% CI, 77.6%-94.1%]), and half evaluated pharmacological treatments. The funding source was for-profit (only or with a nonprofit source) in one-third of the reports and was not stated in 27 (37.5%).
Primary outcomes were clearly identified in 44 of the 72 report abstracts (61.1%; 95% CI, 48.9%-72.4%). In 3 abstracts (4.2%; 95% CI, 0.9%-11.7%), a secondary outcome was reported as being the primary outcome. Only 9 abstracts (12.5%; 95% CI, 5.9%-22.4%) reported the effect size and 95% confidence interval, and 28 (38.9%; 95% CI, 27.6%-51.1%) did not report any numerical results for primary outcomes. In only 16 articles (22.2%; 95% CI, 13.3%-33.6%) did the main text describe the effect size and its precision for primary outcomes; in 21 (29.2%; 95% CI, 19.0%-41.1%), the main text reported only summary statistics for each group, without precision.
The strategies of spin in each article section are shown in Table 2. The title was reported with spin in 13 of the 72 articles (18.0%; 95% CI, 10.0%-28.9%). Spin was identified in 27 (37.5%; 95% CI, 26.4%-49.7%) and 42 (58.3%; 95% CI, 46.1%-69.8%) of the abstract Results and Conclusions sections, respectively. We identified spin in 21 (29.2%; 95% CI, 19.0%-41.1%), 31 (43.1%; 95% CI, 31.4%-55.3%), and 36 (50.0%; 95% CI, 38.0%-62.0%) of the main-text Results, Discussion, and Conclusions sections, respectively.
The strategies of spin were also diverse (Table 2). Examples are provided in eTable 1. In abstracts, spin consisted mainly of focusing on within-group comparison and subgroup analyses in the Results section. One-quarter of the abstract Conclusions sections focused on only the beneficial effect of treatment, claiming equivalence or comparable effectiveness (n = 10 [13.9%; 95% CI, 6.9%-24.1%]), claiming efficacy (n = 4 [5.6%; 95% CI, 1.5%-13.6%]), or focusing on only statistically significant results such as within-group, secondary outcome, or subgroup analyses (n = 3 [4.2%; 95% CI, 0.9%-11.7%]). Furthermore, 9 abstract Conclusions sections (12.5%; 95% CI, 5.9%-22.4%) acknowledged statistically nonsignificant primary outcomes but focused on or emphasized statistically significant results.
Other specific strategies of spin were identified. In some reports in which primary outcomes concerned safety, authors interpreted statistically nonsignificant results as demonstrating lack of any difference in adverse events. As an example, the authors of one study concluded that “we have demonstrated (for the first time) that [with the treatment], embryo implantation is unaltered.” Some reports focused on an overall within-group comparison as if the trial planned was a before-after study, concluding, for example, that “the mean improvement . . . was clinically relevant in both treatment groups.”
Some authors focused on another objective to distract the reader from the statistically nonsignificant results, such as identifying a genetic prognostic factor of improvement.
As shown in Table 3, the extent of spin varied. In total, 49 of the 72 abstracts (68.1%; 95% CI, 56.0%-78.6%) and 44 main texts (61.1%; 95% CI, 48.9%-72.4%) were classified as having spin in at least 1 section. More than 40% of the articles had spin in at least 2 sections of the main text. Spin was identified in all sections of 20 abstracts (27.8%; 95% CI, 17.9%-39.6%) and 14 articles (19.4%; 95% CI, 11.1%-30.5%).
The level of spin in Conclusions sections is illustrated in Table 3, and examples are provided in eTable 2. We identified spin in more than half of the Conclusions sections; the level of spin was high (ie, no uncertainty in the framing, no recommendations for further trials, and no acknowledgment of the statistically nonsignificant results for the primary outcomes or recommendations to use the treatment in clinical practice) in 24 abstracts Conclusions sections (33.3%; 95% CI, 22.7%-45.4%) and 19 main-text Conclusions sections (26.4%; 95% CI, 16.7%-38.1%).
Examples of spin identified are presented in the eAppendix.
This study appraised the strategies of spin used in reports of RCTs with statistically nonsignificant results for primary outcomes. We evaluated 72 reports selected from all reports of RCTs published in December 2006.11 Spin used in the articles and their abstracts was common, but strategies used for spin varied. Furthermore, spin seemed more prevalent in article abstracts than in the main texts of articles.
Our results are consistent with those of other related studies showing a positive relation between financial ties and favorable conclusions stated in trial reports.24,25 Other studies assessed discrepancies between results and their interpretation in the Conclusions sections.10,26 Yank and colleagues10 found that for-profit funding of meta-analyses was associated with favorable conclusions but not favorable results. Other studies have shown that the Discussion sections of articles often lacked a discussion of limitations.27
Our results add to these previous methodological reviews10,24- 26 in that ours was a systematic study of the use of inappropriate presentation in published trial reports, for which we propose a classification of the strategies authors use for spin in their reports. Furthermore, unlike other studies10,24- 26 that investigated a specific category of journals, medical area, or category of treatment, ours took a representative sample.
We identified many strategies of spin. The most familiar and common approach was to focus on statistically significant results for other analyses, such as within-group comparisons, secondary outcomes, or subgroup analyses. Another common strategy was to interpret P > .05 as demonstrating a similar effect when the study was not designed to assess equivalence or noninferiority (such trials require specific design and conduct, as well as a larger sample size, than superiority trials). This dubious interpretation was used only when the comparator was an active treatment.
Some authors interpreted the trial results as being from a before-after study; they focused on within-group comparisons that were statistically significant for the experimental treatment but not for the comparator, which they incorrectly interpreted as demonstrating the beneficial effect of the treatment.28 Some authors reported that they had demonstrated the beneficial effect of both treatments when the results showed a statistically significant change from baseline for each group or for both groups combined.20,29 Some reports of safety trials provided an inadequate interpretation of the nonsignificant results by concluding lack of harm of the experimental treatment. Other methods relied on masking the nonsignificant results by focusing on other objectives. In one report, the authors statistically compared the experimental group, also not with the comparator in that trial, but rather with the placebo group of another trial to conclude that the treatment was better than placebo.
Lastly, our results highlight the important prevalence of spin in the abstract as compared with the main text of an article. These results have important implications, because readers often base their initial assessment of a trial on the information reported in an abstract. They may then use this information to decide whether to read the full report, if available. Furthermore, abstracts are freely available, and in some situations, clinical decisions might be made on the basis of the abstract alone.30
Our study has several limitations. First, the assessment of spin necessarily involved some subjectivity, because the strategies used for spin were highly variable and interpretation depended on the context. Interpretation of trial results is not a straightforward process, and some disagreement may arise, even among authors.31 We attempted to limit this subjectivity by having 2 reviewers extract the data independently using a standardized data abstraction form, with any disagreements resolved by consensus. However, to our knowledge, no objective measure exists for the subjective component of interpretation.32 Consequently, to be completely transparent, a detailed summary of all the examples of spin we classified is available in the eAppendix.
We dichotomized trial findings as positive or negative using an arbitrary value (P = .05) as a significance threshold. However, we acknowledge that the interpretation of RCTs should not be based solely on the arbitrary P value of .05 dichotomizing findings as positive or negative.
We focused on spin only in trials for which the primary outcomes were clearly defined and results for the primary outcomes were not statistically significant. This focus implies that the strategies identified may not be applicable to all reports of RCTs and that other strategies of spin may not have been identified. Furthermore, when the results of an RCT are not statistically significant, the risk of spin may be increased. Trialists and sponsors are rarely neutral regarding the results of their trial. They may have invested considerable time, energy, and money in developing the experimental intervention and expended much effort in planning and conducting the trial. Therefore, they may have a strong preconception about the beneficial effect of the experimental intervention. Furthermore, the results of the trial could have important implications at different levels, eg, for the publication of the trial results in terms of delay and type of journal33; for the use of the experimental treatment in clinical practice; and, consequently, for future career advancement or profit.34,35 A trial with statistically nonsignificant results will thus frequently be a disappointment and could lead to subconscious or even deliberate intent to mislead the reader when presenting and interpreting the trial results.32,36 Few authors have studied this phenomenon, but Hewitt and colleagues reviewed a panel of 17 trial reports with nonsignificant results published in BMJ. They found that, despite evidence that the treatment might be ineffective, in 3 trials the authors seemed to support the experimental intervention.9
We focused on only some categories of spin, and other forms of spin may not have been identified. For example, we did not consider some specific strategies of spin, such as authors obscuring the risk associated with the experimental treatment, as reported in the published Vioxx GI Outcomes Research (VIGOR) study. That report concealed the cardiovascular risk by presenting the hazard of myocardial infarction as if the comparator (ie, naproxen) were the intervention group, concluding on the protective effect of the comparator (relative risk, 0.2; 95% CI, 0.1-0.7)37 instead of the harmful effect of the experimental treatment (ie, rofecoxib) (relative risk, 5.00; 95% CI, 1.68-20.13).38
We cannot say to what extent the spin we identified might have been deliberately misleading, the result of lack of knowledge, or both. Nor are we able to draw conclusions about the possible effect of the spin on peer reviewers' and readers' interpretations. Studies evaluating the effect of framing on clinical practice have focused on the reporting of treatment-effect estimates and showed inconsistent results.39,40
Our study has identified many different strategies that authors use to provide a biased interpretation of results of RCTs with statistically nonsignificant results for primary outcomes. Peer and editorial reviewers must be aware of the different strategies of spin used to temper the article text. The choice of analyses reported (statistically significant analyses such as subgroup analyses or within-group analyses) and the terms used to report and interpret results are important in a scientific article. Special attention should be paid to inadequate interpretation of the trial results, particularly when authors conclude on efficacy from secondary outcomes, subgroup analyses, or within-group comparisons or when the authors inadequately interpret lack of difference as demonstrating equivalence in terms of safety or efficacy. The publication process in biomedical research tends to favor statistically significant results and to be responsible for “optimism bias” (ie, unwarranted belief in the efficacy of a new therapy).41 Reports of RCTs with statistically significant results for outcomes are published more often and more rapidly than are those of trials with statistically nonsignificant results.34,42 Good evidence exists of selective reporting of statistically significant results for outcomes in published articles.33,43- 46
In conclusion, in this representative sample of RCTs indexed in PubMed and published in December 2006 with statistically nonsignificant primary outcomes, the reporting and interpretation of findings was frequently inconsistent with the results. However, this work is only a first step, and future research is needed. Determining which category and level of spin affect readers' interpretation is important. Future research on the reasons for and the mechanisms of spin would also be useful. We hope that highlighting this issue may lead to more vigilance by peer reviewers and editors to reduce the use of these questionable strategies, which can distort the interpretation of research findings.
Corresponding Author: Isabelle Boutron, MD, PhD, Centre d’Épidémiologie Clinique, Hôpital Hôtel Dieu, 1, Place du Parvis Notre-Dame, 75181 Paris CEDEX 4, France (email@example.com).
Author Contributions: Dr Boutron had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Boutron, Dutton, Ravaud, Altman.
Acquisition of data: Boutron, Dutton.
Analysis and interpretation of data: Boutron, Dutton, Ravaud, Altman.
Drafting of the manuscript: Boutron.
Critical revision of the manuscript for important intellectual content: Dutton, Ravaud, Altman.
Statistical analysis: Boutron.
Financial Disclosures: None reported.
Funding/Support: Dr Boutron was supported by a grant from the Societe Francaise de Rhumatologie (SFR) and that Lavoisier Program (Ministère des Affaires étrangères et européennes).
Role of the Sponsors: The SFR and the Lavoisier Program (Ministère des Affaires étrangères et européennes) had no role in the design and conduct of the study; the collection, management, analysis, and interpretation of the data; or the preparation, review, or approval of the manuscript.
Additional Contributions: We are very grateful to Ly-Mee Yu, Msc (Center for Statistics in Medicine, University of Oxford, Oxford, United Kingdom), for her important work in developing the database of the representative reports of randomized controlled trials indexed in PubMed. Ms Yu received no compensation for her contributions.