
One of the challenges of trying to get people to improve their statistical inferences is access to good software. After 32 years, SPSS still does not give a Cohen’s d effect size when researchers perform a t-test.

One of the challenges of trying to get people to improve their statistical inferences is access to good software. After 32 years, SPSS still does not give a Cohen’s d effect size when researchers perform a t-test.

In the latest exuberant celebration of how Bayes Factors will save science, Ravenzwaaij and Ioannidis write: “our study offers through simulations yet another demonstration of the unfortunate effect of p-values on statistical inferences.” Uh oh – what have these evil p -values been up to this time?

Greenland and colleagues (Greenland et al., 2016) published a list with 25 common misinterpretations of statistical concepts such as power, confidence intervals, and, in points 1-10, p-values. Here I’ll explain how 50% of these problems are resolved by using equivalence tests in addition to null-hypothesis significance tests.

In a previous post, I compared equivalence tests to Bayes factors, and pointed out several benefits of equivalence tests. But a much more logical comparison, and one I did not give enough attention to so far, is the ROPE procedure using Bayesian estimation. I’d like to thank John Kruschke for feedback on a draft of this blog post.

In this blog, I’ll compare two ways of interpreting non-significant effects: Bayes factors and TOST equivalence tests. I’ll explain why reporting more than only Bayes factors makes sense, and highlight some benefits of equivalence testing over Bayes factors.

After performing a study, you can correctly conclude there is an effect or not, but you can also incorrectly conclude there is an effect (a false positive, alpha, or Type 1 error) or incorrectly conclude there is no effect (a false negative, beta, or Type 2 error). The goal of collecting data is to provide evidence for or against a hypothesis.

I’m happy to announce my first R package ‘TOSTER’ for equivalence tests (but don’t worry, there is an old-fashioned spreadsheet as well). In an earlier blog post I talked about equivalence tests. Sometimes you perform a study where you might expect the effect is zero or very small.

One widely recommended approach to increase power is using a within subject design. Indeed, you need fewer participants to detect a mean difference between two conditions in a within-subjects design (in a dependent t -test) than in a between-subjects design (in an independent t -test). The reason is straightforward, but not always explained, and even less often expressed in the easy equation below.
I’m really excited to be able to announce my “Improving Your Statistical Inferences” Coursera course. It’s a free massive open online course (MOOC) consisting of 22 videos, 10 assignments, 7 weekly exams, and a final exam. All course materials are freely available, and you can start whenever you want. In this course, I try to teach all the stuff I wish I had learned when I was a student.
I think it was somewhere in the end of 2012 when my co-authors and I received an e-mail from Greg Francis pointing out that a study we published on the relationship between physical weight and importance was ‘too good to be true’. This was a stressful event. We were extremely uncertain about what this meant, but we realized it couldn’t be good. For me, it was the first article I had ever published. What did we do wrong?

You might have seen the ‘Dance of the p -values’ video by Geoff Cumming (if not, watch it here). Because p -values and the default Bayes factors (Rouder, Speckman, Sun, Morey, & Iverson, 2009) are both calculated directly from t -values and sample sizes, we might expect there is also a Dance of the Bayes factors. And indeed, there is. Bayes factors can vary widely over identical studies, just due to random variation.