Bildhauer & Schäfer: Creation, Use, and Analysis of Linguistically Annotated Resources (DGfS-CL Fall School 2017)

Felix Bildhauer and Roland Schäfer. Creation, Use, and Analysis of Linguistically Annotated Resources [download work-in-progress slides by clicking the title] at DGfS-CL Fall School 2017. 11 September – 22 September 2017 in Düsseldorf.

We are planning to turn most of this material (and then some) into a Creative Commons-licensed book with the working title Many things many linguists should know about the creation, evaluation, and use of corpora* (* But sometimes don’t bother to ask.) (called The Corpus Book for short.)

This course covers a range of topics from corpus creation through corpus evaluation to practical issues of corpus usage.
The first part addresses basic questions of document sampling and introduces a number of tools for (automatic) annotation on diverse linguistic levels. We also discuss adaptations of both tools and pre-trained models for dealing with the kind of non-standard writing frequently found in informal communicative settings. This part of the course is not only relevant to students planning to build their own corpora but also to corpus users, as it offers insights into how much the final product depends on design and processing decisions.
Next, we address a variety of topics in corpus comparison and evaluation. Among other things, we cover the automatic generation of document metadata and its usefulness in (computer-)linguistic research settings. We explore thematic corpus composition using topic modelling techniques, and we discuss task-based evaluation of corpora on the basis of distributional semantic models.
Finally, we turn to typical usage scenarios in corpus linguistics and computational linguistics and introduce several corpus processing/querying engines. We show how these can be integrated with software for statistical analysis (Python, R) and, in a number of small linguistic case studies, we illustrate an efficient workflow for querying large annotated corpora and post-processing the concordances thus generated.
Much of the software we use is written in Python, so some familiarity with Python will be a plus, but is by no means a necessary requirement. Also, some familiarity with the R statistical computation environment will make it easier to fully benefit from the last part of the workshop.