Statistical Inference in Linguistics

This publication is in the INCUBATOR section.

Roland Schäfer (in preparationStatistical Inference in Linguistics. To be submitted to Language Science Press when it’s done.

The Git repository is here (Roland Schäfer Statistical Inference in Linguistics Git repository), but there isn’t much going on at the moment. It’s still an empty document mostly.

In this book, linguists are introduced to the basics of statistical inference for corpus-based and experimental work. Instead of presenting recipe-like introductions to single methods and furthering mechanical application of simple statistical tests, I argue for careful empirical work under a probative approach (following Deborah Mayo’s work on her concept of Severe Testing; Mayo 2018), as part of which Fisherian and Neyman-Pearson (N-P) tests can be used under appropriate circumstances and with the right (modest) interpretation. The differences between classical frequentist frameworks (Fisher and N-P), Likelihoodism, Bayesianism, and Mayo’s Probativism are briefly discussed, but – since I am not a philosopher of science and statistics and do not intend pretending to be one – mostly in order to encourage readers to dig deeper and read Mayo’s works. Standard statistical tests used in linguistics are introduced (such as Fisher, Barnard, χ², Binomial, z, t, U, ANOVA with extensions, H) including pre-experiment power calculations and thorough checking of the tests’ assumptions. Furthermore, linear models, generalised linear models, and their multilevel generalisations are introduced. However, instead of just teaching readers to simply run the tests in some statistics software, they are encouraged to think about the tests and the (usually limited) inferences they warrant. Also, what I call Stefan Gries’ Modeling Everything Approach (MEA) to multilevel modeling is critically evaluated.

There is no introduction to any specific statistics software in the book, especially as such hands-on textbooks for linguistics have already flooded the market, and because they further the toxic recipe-like application of statistical methods without thinking. A series of Creative Commons-licenced videos will be created (and made available on streaming platforms), which show how to work with the methods discussed in the book in R and RStudio.

The book will be licensed under a Creative-Commons license and will be submitted to Language Science Press (series Textbooks in Language Sciences). I intend to insist on an open review process. The book is written in Xelatex, R, and knitr using RStudio Server installed at as an IDE. Since my views on statistical thinking were changed significantly several times over the past five years, I had to restart from scratch several times. Consequently, I will not be able to finish my book before late 2019. This project was previously called Modellierung grammatischer Alternationen and planned to be first written in German (and later in English). Another title I considered for some time was Statistical Modelling in Linguistics. The scope has widened, and it is now written in English first. Also, the title now reflects the more holistic approach to statistical inference. If it is a success, a German translation will be published.