Bildhauer & Schäfer: Working with web corpora (Grammar and Corpora, IDS Mannheim, 8 November 2016)

Talk & tutorial at Grammar and Corpora 2016, IDS Mannheim.

Web corpora (huge, post-processed collections of web pages) provide an increasingly important source of data for linguistic research, thanks to their size, content, and availability. The last decade has seen important developments in the construction of web corpora, and the current generation surpasses its predecessors in cleanliness, level and quality of linguistic annotation and enrichment with meta data. At the same time, web corpora have peculiarities (such as sampling biases, duplication, non-standard orthography and language, lack of some meta data) that may discourage linguists from using them. Linguists working with web corpora should at all times be aware of these limitations.

This workshop will start with a brief introduction to the making of web corpora, discussing some of the most important questions of design and processing, including linguistic annotation. The main focus of the workshop, however, is on practical questions that frequently arise from a linguist’s perspective. In particular, we will discuss what web corpora can (and cannot) do for linguists in their daily corpus linguistic work, regarding such issues as reliability of annotation, availability of meta data, data integrity and representativeness and practical limitations of typical query engines. Much of the workshop will be hands-on examples and exercises, and we will introduce practical solutions and workarounds for a number of frequently encountered problems. For maximal benefit, participants should bring their own laptop computer.

Roland Schäfer and Felix Bildhauer have been involved in building corpora from the web since 2011. They have created some of the world’s largest web corpora for a variety of languages, including German.