# Superiority, Equivalence, Non-inferiority, Sample Size Calculation

A recent paper [1] has addressed, among other things, the following question: Is the use of 22% alcoholic solution as a mouthwash inferior, in terms of plaque and gingivitis, to Listerine? The answer given by the authors was a determined “No!” But looking at the data reveals that this might be overhasty. Classic Listerine is an alcoholic (about 22-27%) solution of a mixture of essential oils (thymol, menthol, methyl salicylate and eucalyptol) which is used as antiseptic mouthwash. There are systematic reviews of randomized clinical trials lasting at least 6 months (Stoeken et al. 2007 [2], Gunsolley 2010 [3]) which have proven that it has strong anti-plaque and anti-gingivitis efficacy. A meta-analysis of studies of at least 4 week duration indicated that Listerine mouthwash yielded slightly inferior results as regards plaque control than chlorhexidine mouthwash at 0.1 or 0.2%. The difference as regards reduction of gingival inflammation was not significant (van Leuwen et al. 2011 [4]).

As a novelty among the numerous Listerine studies, the current authors compared, in a modified experimental gingivitis trial, Listerine mouthwash with a 22% hydro-alcohol solution (the negative control) and a 0.2% chlorhexidine solution (the positive control). The upper right quadrant of the dentition (protected from mechanical removal of any plaque while brushing teeth with an individual plastic tooth guard) received mouthwash only, whereas the upper left quadrant was subjected to toothbrushing and mouthwash. The experiment lasted for 21 days. In each group, 15 dental hygienists, medical or dental students participated [5]. It cannot be assumed that the sample size had been calculated before the study was initiated. If interested in proving that a difference in plaque index at the study’s termination after 21 days between the chlorhexidine and the Listerine groups is significant at a 0.4 score difference and assuming homogeneous standard deviations in both groups of 0.4 scores (authors mention these figures in their paper), the null hypothesis of no difference can be rejected with a type I error alpha of 0.05 and a type II error beta of 0.2 (two-sided test) when 16 subjects were enrolled in each arm [6]. Since the authors stress the necessity that Listerine mouthwash should be tested against an alcohol vehicle as negative control (which has never been done before), another hypothesis is whether 22% hydro-alcohol as mouthwash would be non-inferior to Listerine mouthwash. Here, the null hypothesis is not “the two interventions yield the same effects.” Instead, one wants to reject the hypothesis that the negative control intervention (alcohol vehicle mouthwash) is “inferior” to the test intervention (Listerine mouthwash). As Schumi and Wittes (2011) provocatively note, “[t]rials to show superiority generally penalize the sloppy investigator […]. By contrast, non-inferiority trials tend to reward the careless. The less rigorously conducted the trial, the easier it can be to show non-inferiority.” [7] In the picture below (from Schumi and Wittes 2011), the role of a specified, clinically relevant margin, delta, in superiority, equivalence and non-inferiority trials is illustrated. It is clear that “non-significance” of any differences between Listerine and the alcoholic vehicle in plaque index, as emphasized by the authors, does not mean that both mouthwashes performed in a similar way. In fact, one might argue whether the 22% alcoholic solution was inferior, in particular when noticing that after 1 week the overall difference was about 0.1, after 2 weeks 0.2, and after 3 weeks more than 0.3. In order to reject the (null)hypothesis of inferiority, one has to calculate a different minimum sample size.

Given a reasonable margin delta of -0.2, a standard deviation of the outcome again of 0.4, a type I error alpha of 0.05 and 1-beta of 0.8, each group has to consist of 50 participants, see here. One might then conclude,

“If there is truly no difference between the standard and experimental treatment, then 100 patients are required to be 80% sure that the lower limit of a one-sided 95% confidence interval (or equivalently a 90% two-sided confidence interval) will be above the non-inferiority limit of -0.2.”

Since the authors claim equivalence, a respective minimum sample size calculation might be done as well, see here, which yields even a minimum of 69 subjects in each group. So, if there is truly no difference between the standard and experimental treatment, then 138 patients are required to be 80% sure that the limits of a two-sided 90% confidence interval will exclude a difference in means of more than 0.2 [8].

**Notes**

[1] Preus HR, Koldsland OC, Aass AM, Sandvik L, Hansen BF. The plaque- and gingivitis-inhibiting capacity of a commercially available essential oil product. A parallel, split-mouth, single blind randomized, placebo-controlled clinical study. *Acta Odontol Scand* 2013; **71**: 1613-1619.

[2] Stoeken JE, Paraskevas S, van der Weijden GA. The long-term effect of a mouthrinse containing essential oils on dental plaque and gingivitis: a systematic review. *J Periodontol* 2007; **78**: 1218-1228.

[3] Gunsolley JC. Clinical efficacy of antimicrobial mouthrinses. *J Dent* 2010; **38**(Suppl1): S6-10.

[4] Van Leeuwen MP, Slot DE, Van der Weijden GA. Essential oils compared to chlorhexidine with respect to plaque and parameters of gingival inflammation: a systematic review. *J Periodontol* 2011; 82: 174-194.

[5] Since RCTs are intended to allow for inferences as to the “real world”, the choice of highly aware-of-oral-hygiene participants is a bit inauspicious, as is the modified experimental gingivitis model. It won’t come at a surprise that plaque levels in the second quadrant, where toothbrushing and rinsing with experimental and control mouthwash were allowed, were unanimously low. On the other hand, gingival bleeding index differed at least after three weeks. In the chlorhexidine mouthwash group mean gingival bleeding index was 0.15 on average with a standard deviation of 0.17 while in the alcohol mouthwash group it was 0.56 (0.50). When calculating 95% confidence intervals, one gets 0.09-0.25 for the former and 0.24-0.63 for the latter. The overlap of confidence intervals is minimal suggesting in fact somewhat more gingival inflammation in the alcohol mouthwash group than in the chlorhexidine mouthwash group when toothbrushing is allowed.

[6] The minimum sample size under certain assumptions may easily be calculated online, see, for example, here. So, basically, the study seems to be underpowered even for the comparison between chlorhexidine and Listerine mouthwashes. Since differences in the means were somewhat higher than 0.4 scores and/or standard deviations a bit lower than 0.4 scores, authors could show that differences were statistically significant with *p*<0.05.

[7] Schumi J, Wittes JT. Through the looking glass: understanding non-inferiority. *Trials* 2011; **12**: 106.

[8] For those who want to really dive into the subject, the paper by Julious SA. Sample sizes for clinical trials with Normal data. *Stat Med* 2004; **23**: 1921-1986 is a good starting point.

12 September 2013 @ 12:33 pm.

Last modified September 3, 2015.