This is an intermediate-level course. If you know what a two-sample t-test is, but you never tried a three-, four-, six- or nine-sample t-test, then this course may be for you. If you have ever concluded that a factor had no effect, just on the basis of a non-significant p-value, then this course is definitely for you. If you ever wondered how evidence in favour of the null hypothesis can be collected (whereas significance testing can only reject null hypotheses), then this course is for you.

## Preparation

I will assume you can work with the statistical software R, especially in the user interface provided by RStudio. If you have worked with SPSS instead, then this is a good time to learn R. You can find many introductory R courses on the internet, including my own Methoden en Technieken course in Dutch, which is also an introductory statistics course, or Daniel Navarro's course in English. So please install both R and RStudio installed on your laptop computers, which you then bring to the lectures.

And here is the program:

## Monday 15 June: How to compare

Have you ever written something like "The Dutch subjects improved significantly during the training (p = 0.01), whereas the English subjects did not improve (p = 0.60)"? Then this course is for you. The example just mentioned is a simple case of the most common fallacy in published work in our field, statistical inference from comparing p-values: conference proceedings are full of it, but the fallacy also abounds in journal articles by the leaders of the field. If you don't know what is wrong with it, then today you will learn; if you know it's wrong but think you have to do it because everybody does it, then today you'll learn that not everybody does it and that you can avoid it too; if you know it's wrong but think you have to do it because otherwise your results cannot be published, then you're thinking in the same way as some of our leaders, but this week you will learn many ways to publish your results without cheating with statistical inference.

For a review of how often this problem occurs in psychology, read Erroneous analyses of interactions in neuroscience: a problem of significance by Sander Nieuwenhuis, Birte Forstmann & Eric-Jan Wagenmakers (2011, Nature Neuroscience).

You can read here the presentation of today's lecture about comparing p-values.

## Tuesday 16 June: How to choose a sensitive design

One of the problems for obtaining good p-values is the low number of participants in many studies in linguistics. However, there are ways to obtain better p-values by choosing a sensitive design: many of you will be familiar with repeated-measures designs, which keep a major source of variability, namely the participant, constant across many measurements. One can also obtain better p-values by choosing a sensitive analysis method: much data in linguistics is of a discrete nature (e.g. correctness scores) rather than of a continuous nature (e.g. durations), and for discrete data the workhorse analysis method should not be a straightforward analysis of variance (e.g. a linear model on the scores), but logistic regression with participant as a random factor. After this course you will find this an easy concept.

For an account of tricks to raise p-values, and how to avoid that, read False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant by Joseph P. Simmons, Leif D. Nelson & Uri Simonsohn (2011, Psychological Science).

You can read here the presentation of today's lecture about confidence intervals.

## Wednesday 17 June: How not to pool or bin

Did you ever split up your participants into a "young" and an "old" group, using their median age as a criterion? This is just one way to convert continuous data into discrete data, and it is dubious. The method does allow you to use "Anova" with age as a binary factor, but also raises suspicions as to whether your binning of the age data might have been meant to improve your p-value. It is better to keep age in the model as a continuous factor, and this is not more difficult than binning.

For a proposal to perform more honest kinds of research, read An agenda for purely confirmatory research by Eric-Jan Wagenmakers, Ruud Wetzels, Denny Borsboom, Han van der Maas and Rogier Kievit (2012, Perspectives on Psychological Science).

You can read here the presentations of today's lecture about testing until significant and binning.

## Thursday 18 June: How to update the reader’s world view

If you like to be able to accept a null hypothesis as probably true, you cannot use p-value testing, because with p-value testing you can only accept the alternative hypothesis (if p<0.05) or not reject the null hypothesis (if p>0.05). Instead, you need a method that takes two hypotheses equally seriously, and compare their likelihoods given the data. Today you'll learn the jargon of Bayesian inference, and how to apply these methods to otherwise hopeless experimental results.

You can read here the presentation of today's lecture about Bayesian statistics.

## Friday 19 June: Horizons on your data

If you follow all of the above advice for your gigantic dataset, you'll often find that you end up creating a giant generalized linear mixed-effects model. You build this model overnight, only to find that the parameters "fail to converge" (in R) or come out as all zeroes (in SPSS). In such cases, your research questions can guide you toward simplification. For instance, if your research question is about the difference between two populations of speakers, one can typically collapse many cells of your data table into one value per speaker, computed in any interesting way that matches your specific research question. This technique, which subsumes, but is not limited to, "contrasts" for repeated measures, has pleasingly wide validity.