Figure. Correlation between observed minus expected number of“positive” study data sets with the between-study heterogeneity (I2). Different colors are shown for each condition, and brain structures are also labeled on the plot. A indicates amygdala; ACC, anterior cingulated cortex; CN, caudate nucleus; GM, gray matter; GP, globus pallidus; H, hippocampus; l, left; LV, lateral ventricles; OC, orbitofrontal cortex; P, putamen; PF, prefrontal cortex; PTSD, posttraumatic stress disorder; r, right; T, thalamus; TL, temporal lobe; TV, third ventricle; VL1, vermal lobules I-IV; VL2, vermal lobules VI-VII; and WM, white matter.
Ioannidis JPA. Excess Significance Bias in the Literature on Brain Volume Abnormalities. Arch Gen Psychiatry. 2011;68(8):773-780. doi:10.1001/archgenpsychiatry.2011.28
Author Affiliations: Clinical and Molecular Epidemiology Unit, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, and Biomedical Research Institute, Foundation for Research and Technology–Hellas, Ioannina, Greece; Institute for Clinical Research and Health Policy Studies, Tufts Medical Center, and Department of Medicine, Tufts University School of Medicine, and Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts; and Stanford Prevention Research Center, Stanford University School of Medicine, Stanford, California.
Context Many studies report volume abnormalities in diverse brain structures in patients with various mental health conditions.
Objective To evaluate whether there is evidence for an excess number of statistically significant results in studies of brain volume abnormalities that suggest the presence of bias in the literature.
Data Sources PubMed (articles published from January 2006 to December 2009).
Study Selection Recent meta-analyses of brain volume abnormalities in participants with various mental health conditions vs control participants with 6 or more data sets included, excluding voxel-based morphometry.
Data Extraction Standardized effect sizes were extracted in each data set, and it was noted whether the results were“positive” (P < .05) or not. For each data set in each meta-analysis, I estimated the power to detect atα = .05 an effect equal to the summary effect of the respective meta-analysis. The sum of the power estimates gives the number of expected positive data sets. The expected number of positive data sets can then be compared against the observed number.
Data Synthesis From 8 articles, 41 meta-analyses with 461 data sets were evaluated (median, 10 data sets per meta-analysis) pertaining to 7 conditions. Twenty-one of the 41 meta-analyses had found statistically significant associations, and 142 of 461 (31%) data sets had positive results. Even if the summary effect sizes of the meta-analyses were unbiased, the expected number of positive results would have been only 78.5 compared with the observed number of 142 (P < .001).
Conclusion There are too many studies with statistically significant results in the literature on brain volume abnormalities. This pattern suggests strong biases in the literature, with selective outcome reporting and selective analyses reporting being possible explanations.
Brain volume abnormalities have been associated with a large variety of mental health diseases and conditions1- 12 and have typically been a key topic in the discussion of the pathophysiology of mental disorders for the past 25 years.13- 15 The literature on brain volume abnormalities is rapidly expanding, with hundreds of studies published to date. A considerable number of meta-analyses have already been published that try to summarize the results from these studies of brain volume abnormalities.1- 12 These meta-analyses identify significant associations for specific brain volumes and structures for almost any disease and condition assessed, including schizophrenia, major depression, bipolar disorder, posttraumatic stress disorder, obsessive-compulsive disorder, autism, and personality disorders.1- 8
The large number of statistically significant associations could have several explanations. One possibility is that all major mental conditions have genuine correlates with brain volumes. Some associations may indicate specific conditions, whereas others may be seen in very diverse diseases. Another possibility is that reporting bias is operating in the literature.16,17 Reporting bias could include the following mechanisms: (1) study publication bias, in which the results of nonstatistically significant (“negative”) studies are left unpublished; (2) selective outcome reporting bias, in which results of outcomes (in this case, the volume of specific brain structures) that are negative are left unpublished, whereas the“positive” associations with other brain volumes are published; and (3) selective analysis reporting bias, in which data on the volume of a particular brain structure are analyzed with different methods and in which positive results are preferentially published over negative results. The common denominator of all these mechanisms is that the published literature has an excess of statistically significant results (ie, excess significance bias).18
Detecting these biases is not a straightforward process. Many meta-analyses in this field have applied asymmetry (funnel plot) tests that determine whether small studies give different results from larger studies.19 If so, such small-study effects may be due to publication bias or other reporting bias. However, these tests are neither sensitive nor specific for detecting reporting biases.20 In the literature on brain volume abnormalities, they may be particularly unsuitable because all studies have limited sample sizes, thus the range of“large” vs“small” studies is too narrow.21 Moreover, there are few studies that report data on a particular brain structure, and asymmetry tests do not work well when there are fewer than 10 to 20 studies.20,21
A more appropriate alternative is to apply an excess significance test that specifically evaluates whether there are too many reported studies that have statistically significant results.18 This test has the additional advantage that it can evaluate the excess of significant studies not only in a single meta-analysis but across many meta-analyses in a given field. This could include all meta-analyses of brain volumes for a given condition or all meta-analyses of brain volumes for many different conditions. Herein, I have applied such a test to evaluate whether the literature on brain volume abnormalities is subject to excess significance bias.
Data were collected from recent comprehensive meta-analyses of studies comparing participants with specific mental health conditions vs control participants for differences in brain volumes of specific brain structures. I focused on volumetric studies and did not consider meta-analyses that used measures of gray matter density derived from voxel-based morphometry. After perusing PubMed, I selected articles using the following search strategy: brain volume AND meta-analysis. I limited the search to human studies (last search December 20, 2009) and focused only on meta-analyses published from 2006 to 2009, because earlier articles would be likely to miss a large number of recent studies. The full text of potentially eligible articles was scrutinized. Of those, articles were retained if they included at least 1 meta-analysis for volume differences in a brain structure in which information was provided or could be calculated per study on the number of participants in each of the 2 compared groups (those with the condition of interest and controls) and the standardized effect size (expressed as Cohen's d, Hedges' g, or other similar standardized metrics) for the comparison. When more than 1 article on the same condition was eligible and contained usable data, complementary information from more than 1 article was used, if the usable data pertained to volumes of different brain structures; conversely, only the most recent article was retained, if the usable data pertained to the volumes of the same brain structures.
Herein, each study data set corresponds to a separate estimate of effect. The evaluation did not consider total brain and total intracranial volumes because these are very nonspecific measures and are often treated as covariates of no or limited interest. It also excluded meta-analyses with fewer than 6 study data sets; the cutoff decision was made a priori because for most meta-analyses with fewer study data sets, it would have been unlikely to make solid conclusions about the presence or absence of excess significance, given the limited evidence. Meta-analyses were accepted regardless of whether they analyzed separately left/right structures or just the total volume of both sides. When meta-analyses on both the total volume and left/right volumes were presented, either the total or the left/right side meta-analyses were kept, depending on which had a larger sample size; when the sample size was the same, the separate left/right data were preferred. Data were eligible regardless of what imaging technique and technical parameters thereof had been used.
I used a previously developed test for excess significance.18,22 In brief, the test evaluates whether the number of single-study data sets that report nominally statistically significant results (P < .05) among those included in a meta-analysis is too large based on the power that these data sets have to detect plausible effects atα = .05. In each meta-analysis, I calculated the power of each study data set. The sum of the power estimates gives the expected number of positive study data sets (those with nominally statistically significant results). As previously presented in detail,18,22 the observed number, O, of positive study data sets in each meta-analysis can be compared against the expected number, E, of positive study data sets with aχ2 test or with a binomial test, and the results are practically equivalent. The O vs E comparison is extended to many meta-analyses, by summing the O and E numbers from each meta-analysis. If there is no excess significance bias, then O = E. The greater the difference between O and E, the greater is the extent of excess significance bias.
The estimated power of each study data set depends on what is the plausible effect size.18 The true effect size for any meta-analysis cannot be known. In the absence of bias, one would expect the observed (estimated) summary effect size to be a good representation of the true effect sizes, allowing simply for estimation or random error. In the presence of bias, one would expect the observed effect size to be larger than the true effect size, and the divergence would be expected to become larger with an increased level of bias. Thus, one has to consider a range of values for the plausible true effect size that may not be the same as the observed one. Herein, I considered an optimistic scenario, in which the true effect is assumed to be equal to the observed effect, and a more pessimistic scenario, in which the true effect is assumed to be equal to half the observed effect.23
All effect estimates were expressed as standardized mean differences for the 2 compared groups, with the metrics chosen by the authors of each original meta-analysis. Effect size computation in each study in each meta-analysis takes into account the mean volume and the variance of the volume in cases and controls, and variances can be different in cases and controls. Summary effects in meta-analyses are based on random-effects calculations. When the standardized mean difference and variance thereof were not given, they were calculated from the provided sample size n, mean values m, and standard deviations SD of the absolute measurements per each group (1 and 2) using the following formulas24:
For each meta-analysis, the P value of theχ2-based Q test and the I2 metric of inconsistency were recorded. The Cochran Q test25 is obtained by the weighted sum of the squared differences of the observed effect in each study minus the fixed summary effect. The I2 metric26 is the ratio of the between-study variance over the sum of the within- and between-study variance. When the I2 was not given, it was calculated from the formula I2 = (Q − k + 1)/ Q, where k is the number of studies.27 The Q test is considered significant here at P < .05, but it should be interpreted with caution owing to the possibility of both false positives and false negatives.28,29 For I2, as a rough guide, values exceeding 50% are considered as large heterogeneity beyond chance, and values exceeding 75% are considered very large heterogeneity.27 However, with a limited number of study data sets, the uncertainty in the estimates of I2 can be substantial, and thus these inferences should be made with great caution.30
The PS: Power and Sample Size Calculation program31 was used to estimate the power of each study. Excess significance is claimed at P < .05, and results are also presented with Bonferroni correction for the number of examined conditions and brain structures.
The search yielded 41 items, of which 22 were excluded after perusing the title and abstract. Of the remaining 19 articles, 3 were voxel-based meta-analyses, and 5 did not provide sufficient details of results per data set. Of the 11 articles with detailed results per data set, 3 addressed conditions and brain areas that had been examined in more recent meta-analyses and were thus excluded to avoid data overlap. Therefore, data from 8 articles were eligible for the analysis, including data on meta-analyses of brain volume abnormalities in major depressive disorder,1,2 bipolar disorder,3 obsessive-compulsive disorder,4 posttraumatic stress disorder,5 autism,6 first episode of schizophrenia,7 and relatives of patients with schizophrenia.8
Table 1 summarizes the evaluated data from the 8 eligible meta-analysis publications.1- 8 All meta-analyses include only magnetic resonance imaging studies, except for that of Kempton et al,3 who also allowed the inclusion of computed tomographic (CT) scan studies. A total of 461 data sets had been included in 41 meta-analyses on brain volumes for 7 different conditions. All studies included in the relevant meta-analyses had been published in peer-reviewed journals, except for 1 study (2 data sets included in the calculations) that had been published as an abstract. There were 6 to 31 data sets per meta-analysis (median, 10 data sets). These were typically small data sets, and cumulatively no meta-analysis had a sample size (cases and controls combined) exceeding 1000, with the exception of the meta-analyses of hippocampus volume in major depressive disorder and in relatives of patients with schizophrenia. In 14 meta-analyses, there were nominally statistically significant differences in larger brain volumes among cases; in 7 meta-analyses, there were nominally statistically significant differences in larger brain volumes among controls; and in 20 meta-analyses, there were no significant differences between the 2 groups. Only 5 effects sizes had an absolute magnitude exceeding 0.50 (anterior cingulated cortex in major depressive disorder as well as left hippocampus, right hippocampus, left lateral ventricle, and third ventricle in first-episode schizophrenia). Of the 41 effects, none have large point estimates (>0.8 in absolute magnitude), and the 95% confidence intervals exclude large effects for 40 of 41 meta-analyses. The 95% confidence intervals also exclude moderate effects (>0.5 in absolute magnitude) in 27 of the 41 meta-analyses. There was nominally statistically significant heterogeneity in 24 of 41 meta-analyses. I2 values exceeding 50% were noted in 19 of the 41 meta-analyses, and 5 of those had values exceeding 75%.
Table 1 also shows the observed and expected number of positive study data sets in each meta-analysis, assuming the plausible effect to be the summary effect of the meta-analysis or half of this effect. For most meta-analyses (29/41), the observed is larger than the expected, and in only 10 meta-analyses is the observed smaller than the expected (2 meta-analyses had an equal observed and expected number of positive study data sets), even if we assume that the summary effect seen in the meta-analysis is the most plausible estimate of the true effect. If the plausible effect is assumed to be half of what is seen in each meta-analysis, then O is larger than E for 36 meta-analyses, whereas the opposite occurs only in 5 meta-analyses with few studies, all of which have O = 0 and E = 0.3 to 0.6.
If we assume that the summary effect seen in the meta-analysis is the most plausible estimate of the true effect, then 16 of the 41 meta-analyses show evidence (P < .05) for an excess O over E and 7 of them show evidence for an excess O over E, even after correcting for the number of tested conditions and brain structures. If the plausible effect is assumed to be half of what is seen in each meta-analysis, then the respective numbers of meta-analyses are 27 and 14.
Meta-analyses that have larger estimates of heterogeneity (as expressed by I2) tend to also have large differences between O and E (Spearman correlation coefficientρ = 0.63 when plausible effects are considered to be those observed in the summary of the meta-analyses [Figure] andρ = 0.53 when plausible effects are considered to be half of those observed).
Table 2 shows the composite data from all meta-analyses for each mental health condition. The observed number of positive study data sets is always larger than the expected, regardless of the assumptions of what the plausible effect should be in each meta-analysis. The difference between E and O is beyond chance (P < .05) for 5 of the 7 conditions when the plausible effect is assumed to be the same as the summary effect in each meta-analysis (all except first-episode schizophrenia and relatives of patients with schizophrenia) and for all 7 conditions when the plausible effect is assumed to be half of that magnitude. With Bonferroni correction, the difference between E and O is statistically significant for 4 and 5 of the 7 conditions, respectively.
Table 2 also groups meta-analyses per brain structure. Of the 15 brain structures evaluated, 8 showed a difference between E and O beyond chance (P < .05), when the plausible effect is assumed to be the same as the summary effect in each meta-analysis, and this increases to 12 when the plausible effect is assumed to be half of that magnitude. With Bonferroni correction, the difference between E and O remains statistically significant for 5 and 9 of the 15 brain structures, respectively. The larger fold deviation of O from E was seen for amygdala under either assumption.
When data are combined from all 41 meta-analyses, there are 142 observed data sets with nominally statistically significant results among the 461 (31%), whereas the expected number would be 78.5 and 37.1 under the 2 effect assumptions, respectively (P < .001 for comparison with the observed for both analyses).
This evaluation of 461 data sets in 41 meta-analyses of brain volumes in diverse conditions shows that, in the literature, the number of positive results is way too large to be true. Even if the effect sizes observed in the meta-analyses are accurate, the number of positive results (n = 142) is almost double than what would have been expected based on power calculations for the included samples. If the true effect sizes are only half of those observed in the meta-analyses, then the number of positive results is about 4 times the expected number thereof. Bias may be present in meta-analyses of all 7 examined conditions and in most of the examined brain structures. Such bias threatens the validity of the overall literature on brain volume abnormalities.
The excess significance may be due to unpublished negative results, or it may be due to negative results having been turned into positive results through selective exploratory analyses. If all the excess significance is due to negative results not being published, then this means that only slightly more than 1 in 2 or 1 in 4 negative results have been published, depending on what plausible effect size is assumed. This would correspond to approximately 600 to 1200 unpublished negative results, besides the 319 that have been published. Conversely, the excess significance may be due to negative results becoming positive: given that the expected positive results are 79 or 37, with the 2 analyses, then one can estimate that a negative-to-positive conversion of 64 or 105 results, respectively, among the 142 observed positive ones would suffice to cause this excess of significance. Possibly both mechanisms contribute.
First, bias against the publication of negative results, the traditional form of publication bias,17 may exist. Some of the prior meta-analyses tried to investigate small-study effects (whether small studies give more prominent results than larger studies), which may signal publication bias. However, this association is nonspecific,20 and most studies on brain volumes are small anyhow, so differentiating between small and large makes little sense. Moreover, the typical investigation of brain volumes is likely to measure by default the volume of multiple brain structures. Because of multiple comparisons, most investigations may have at least 1 positive result to report, even if this is only a chance finding due to an inflated type I error. This suggests that bias is more likely to occur at the level of outcome reporting (ie, with only a subset of the brain regions, among the many evaluated, being reported in the published article, rather than the whole study remaining unpublished). The most suggestive evidence for this type of outcome reporting bias in the literature comes from the mere juxtaposition of the availability of information for different brain regions in studies addressing the same condition. Some brain regions have data reported from far more studies than others. For example, although there are 31 reported results on hippocampus volume in studies of major depression, only 6 of them report on putamen volumes, and only 7 report on orbitofrontal cortex or prefrontal cortex volumes.1 To some extent, this difference may reflect the fact that investigators genuinely focused only on the hippocampus in some studies or that interest in hippocampus abnormalities preceded interest in the study of other volumes. However, it is possible that many studies did measure comprehensively all or many of the major brain areas and reported selectively on a few, with reporting guided in part by the significance of the results.
Second, one suspects that some analyses that were negative have been presented as positive. This type of bias has been best documented in randomized trials (eg, trials of antidepressants).31 As noted earlier, selective analysis reporting bias can have a very influential effect on the results: one must convert relatively few studies from negative to positive to achieve the same bias as if 10 times more negative studies were entirely suppressed. Selective analysis reporting bias emerges when there are many analyses that can be performed and only one of them, the one with the“best” results, is presented.23,32 It is also facilitated when there are many steps in the analysis process that are subject to choices and measurements that can be biased. Some biases may also be facilitated when the assessors are not blinded33 to whether each brain scan is coming from a case with the condition of interest or a control participant. Information about such quality safeguards is often not reported in the literature on brain volume abnormalities.
Although brain volume measurements are sophisticated, there is room for error. Magnetic resonance imaging measurements have average errors in the range of±1.5%,34,35 whereas changes of 5% may be introduced by scanner hardware or software.34 Recognized sources of possible error include voxel misclassification during brain segmentation,36,37 the partial volume problem (when volume is less than a voxel),36,38,39 inconspicuousness of tissue edges, and head tilt. Nonsystematic errors will tend to dilute the observed effect sizes, if they are nondifferential. However, when errors are differential (eg, measurements are performed by observers who may favor a certain direction of the results), this can lead to inflated effects and spurious statistically significant associations. Moreover, nonsystematic mistakes may also occur during the analytic process.40 Most studies that we assessed used magnetic resonance imaging. However, at least 1 meta-analysis3 also included data on CT scans, and brain volume differences tended to be larger with CT measurements; this may be a manifestation of higher error rates in CT studies.3 Finally, some structures are often possible to measure using various anatomical definitions.2 Bias would be introduced if one assesses many different definitions and reports only the ones with the most significant results.
Brain volume differences may be confounded by drug treatment, inpatient vs outpatient status, differences in age and sex between the compared groups, severity of disease, and even disease definition per se. It is very difficult to control efficiently for all of these parameters. Selective analysis reporting may be introduced if investigators perform different analyses to adjust for various sets of confounders or exclude participants based on confounder values and then selectively report analyses based on the statistical significance of the results. There are many other aspects in the design, conduct, methods, analysis, and population characteristics of imaging studies that may affect the accuracy and reliability of the results. Case-control studies frequently report biased results because the cases are not representative of those affected in a population and/or because the controls are not representative of those not affected in the same population. It would be useful for experts in the field to adopt more unbiased designs, such as 2-stage sampling techniques.
The current evaluation did not consider voxel-based morphometry studies for which meta-analyses have started to appear recently.41- 43 Meta-analyses of such studies aim to reveal differences in gray matter density at specific brain coordinates rather than differences in volumes of prespecified regions of interest. These are whole-brain methods, and thus, in theory, they may avoid the selective reporting of selected regions of interest. However, the technique of activation likelihood estimation that is used for meta-analysis of voxel-based morphometry41 has the disadvantage that it can use only data from studies that have significant differences between cases and controls. This strengthens the potential for significance chasing bias. A more recent method, signed differential mapping,43 allows for the consideration of null findings and mitigates the excessive influence of single-study data sets. However, even if signed differential mapping allows for the incorporation of null findings, this will happen only if null findings are published so as to be available for meta-analysis.
Some limitations should be acknowledged in my study. First, several of the meta-analyses had substantial between-study heterogeneity, and the difference between O and E was larger in those meta-analyses with larger I2 estimates. Heterogeneity may be a manifestation of bias affecting differentially the constituent data sets, but it may also reflect genuine differences across studies. It is possible that some of these effects are genuinely heterogeneous. However, with genuine heterogeneity, one would not necessarily expect that single-study data sets would pass a“desired” threshold of significance (P < .05) and yield so many statistically significant results. Empirical evidence from other fields where many associations are evaluated (eg, candidate gene associations) suggests that heterogeneity is often a marker of bias.22 A better understanding of these genuine sources of diversity would require that we accumulate more unbiased data in the literature.
Second, even though the overall analysis suggests the presence of considerable bias, one cannot assume that all meta-analyses are equally affected. Probably the most useful application of the excess significance test is to give an overall impression about the average level of bias afflicting the field of brain volume abnormalities. The test can also be used to interpret separately the results of single meta-analyses, but here the interpretation should be more cautious. As shown in my results, some meta-analyses show strong evidence that they are affected by excess significance bias; some others seem spared of this bias, and their results can be considered to be reliable in this regard; and still others are difficult to interpret, mostly because of limited evidence. A negative test for excess significance in a single meta-analysis, especially one with few studies, does not exclude the potential for bias.
Third, the evaluation relied on effect sizes that had already been estimated in published meta-analyses because it would have been very inconvenient to perform 41 meta-analyses again from scratch. The data were obtained from meta-analyses published in high-profile peer-reviewed journals, but it is possible that some mistakes in the data may have been made, even in the best meta-analyses. However, there is no reason to believe that these would favor the presence or absence of excess significance bias. No effort was made to update these 41 meta-analyses with studies that appeared after the publication of each of the included meta-analyses. However, there is no reason to believe that these most recent studies would be much different in terms of susceptibility to bias. Moreover, the evaluation focused on meta-analyses published recently (ie, in the last 4 years).
Fourth, the exact estimation of excess significance can be influenced by the choice of plausible effect size, the potential miscalculation of the P values in the original data sets, and/or the miscalculation of power. This is why I have examined the influence of different effect sizes on the difference between O and E. Miscalculation of P values would require access to the raw data of each data set. For example, some of the excess significance may be in part due to P values in single data sets being reported as nominally significant owing to inappropriate assumptions (eg, equal variances). Power estimates with different assumptions (such as deviation from normality) and different software may diverge, but they are unlikely to change the big picture that almost all of these studies are small and largely underpowered and that the E is substantially smaller than the O.
In conclusion, the literature on brain volume differences is probably subject to considerable bias. This does not mean that none of the observed associations in the literature are true. It should be acknowledged that some meta-analyses may be more affected by bias than others and that some may be totally unbiased. However, the average level of bias is probably large, and steps should be taken to remedy the situation. Such steps could include, besides the use of newer technologies, the adoption of large multicenter studies, the standardization of definitions, outcomes, and analyses, and the registration of prespecified protocols for these studies. Large multicenter studies should be feasible, and it would be natural for such studies to use commonly agreed outcomes, definitions, and analyses in prespecified protocols that would be widely visible to all participating investigators and beyond. For most of the examined brain structures, definitions should be consistent, and this applies also to their subfields, which may yield additional insights if properly assessed.44 Significance testing should not be used as a criterion for publication,45,46 and journal editors can emphasize the need to make the full data (correlation matrices) and protocols available, as in other research fields.47- 49 After more than 25 years of research in this field, further progress requires stronger guarantees of reliability for the ensuing results.
Correspondence: John P. A. Ioannidis, MD, DSc, Stanford Prevention Research Center, Stanford University School of Medicine, MSOB X306, 251 Campus Dr, Stanford, CA 94305 (firstname.lastname@example.org).
Submitted for Publication: December 12, 2010; final revision received January 22, 2011; accepted January 28, 2011.
Published Online: April 4, 2011. doi:10.1001/archgenpsychiatry.2011.28
Author Contributions: Dr Ioannidis had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Financial Disclosure: None reported.