In a recent analysis of thousands of randomized controlled trials (RCT) in eight journals a simple method was offered which might enable skeptical scientist identification of data fabrication. Editor of the Anaesthesia journal John B. Carlisle of Torbay Hospital, UK, looked at baseline differences of means in more than 5000 randomized controlled trials, mainly in the field of Anesthesiology, but also more than 500 published in JAMA and more than 900 published in the New England Journal of Medicine . His study went online earlier this week. Analyzed articles were published between 2000 and 2015. In brief, if randomization was successful, baseline differences should be small. Giving p-values for baseline differences (in order to indicate successful randomization) is actually discouraged since they are not really interpretable, but Carlisle calculated them anyway. If the null hypothesis is true, p-values have a uniform distribution. So p-values between 0 and 1 would be equally likely.
What Carlisle did was to assess p-values of the differences of baseline parameters in the trials. Nick Brown on his blog undertook the effort to calculate some examples.
Carlisle’s idea is that, if the results have been fabricated (for example, in an extreme case, if the entire RCT never actually took place), then the fakers probably didn’t pay too much attention to the p values of the baseline comparisons. After all, the main reason for presenting these statistics in the article is to show the reader that your randomisation worked and that there were no differences between the groups on any obvious confounders. So most people will just look at, say, the ages of the participants, and see that in the experimental condition the mean was 43.31 with an SD of 8.71, and in the control condition it was 42.74 with an SD of 8.52, and think “that looks pretty much the same”. With 100 people in each condition, the p value for this difference is about .64, but we don’t normally worry about that very much; indeed, as noted above, many authors wouldn’t even provide a p value here.
Now consider what happens when you have ten baseline statistics, all of them fabricated. People are not very good at making up random numbers, and the fakers here probably won’t even realise that as well as inventing means and SDs, they are also making up p values that ought to be randomly distributed. So it is quite possible that they will make up mean/SD combinations that imply differences between groups that are either collectively too small (giving large p values) or too large (giving small p values).
For example, a vector of p-values for differences of variables at baseline of, hypothetical, 0.95, 0.84, 0.43, 0.75, and 0.92 (most p-values are greater than 0.5) would actually yield an associated p-value for the test that the above p-values arose by chance of 0.02. Carlisle (2007) used a much stricter threshold of p<0.00001 and identified a number of suspicious RCTs. And it turned out that one (NEJM 2007; 356: 911) is in Periodontology . In the respective spreadsheet in Appendix S1 of Carlisle’s article the study can be identified in row 6. A 1-sided p-value of 4.11 x 10E-13 is given for 20 baseline variables. Carlisle (2017) offers a possible explanation in his discussion. Authors or editors may have labelled some standard deviations as standard errors, although “a single solution does not explain extreme p values in [that particular] paper.”
For instance, NEJM 2007; 356: 911 reported mean (SE) tissue plasminogen activator concentrations of 4.5 (0.6) and 3.2 (0.4) in groups of 59 and 61, respectively: conversion of the SE to SD (4.6 and 3.1) resulted in a p value of 0.92, which is not particularly near 1 (or 0). However, conversion of the ‘SE’ for the 19 other variables resulted in p values averaging 0.02 and a composite trial p value of data corruption if one posited that the SD of 19 variables were incorrectly labelled SE, whereas the SE for tissue plasminogen activator concentration were correct.
In the above Table 1 (baseline characteristics of patients) in NEJM 2007;356:911, authors claim that means +/- SE were given, but as Carlisle (2017) suspects, that was probably only true for the concentration of tissue plasminogen activator. Standard deviations would make more sense.
The paper has actually been quoted 929 times (according to Google Scholar) and is one of the corner-stones of the so-called Perio-Systemic link. Based on their results, authors had concluded that,
Intensive periodontal treatment resulted in acute, short-term systemic inflammation and endothelial dysfunction. However, 6 months after therapy, the benefits in oral health were associated with improvement in endothelial function.
If it turns out that Carlisle’s suspicion is right, one might expect at least a correction.
 Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 2017; doi:10.1111/anae.13938.
 Tonetti MS, D’Aiuto FD, Nibali L, Donald A, Storry C et al. Treatment of periodontitis and endothelial function, N Engl J Med 2007; 356: 911-920.
11 June 2017 @ 5:57pm.
Last modified June 11, 2017.