Linguistic research with large annotated web corpora (2013). Pre-conference tutorial, The 20th International Conference on Head-Driven Phrase Structure Grammar, Berlin, August 26, 2013, 9:30 – 16:00
The world wide web most likely constitutes the hugest existing source of texts written in a great variety of languages. A feasible and sound way of exploiting this data for linguistic research is to compile a static corpus for a given language. For example, we have created linguistically annotated giga-token web corpora for various languages (Dutch 2.5 GT, English 3.9 GT, French 4.3 GT, German 9.1 GT, Spanish 1.6 GT, Swedish 2.3 GT) and are still in the process of creating new corpora (Danish, Japanese, Portuguese, etc.), as well as improving the old ones.
However, anyone who needs to do serious work with web corpora should be aware of the characteristics (and limitations) of such corpora, which depend to considerable extent on a number of decisions taken in the making of such corpora. The first aims of this tutorial is to illustrate the various steps that lead from data collection on the web to the final, linguistically annotated corpus, highlighting the stages where crucial decisions have to be made and how these may be reflected in the corpus.
The second part of this tutorial is a hands-on introduction to the use of the Open Corpus Workbench (a piece of software well suited to store and query very large corpora), with special attention to its integration with the R statistics environment. We use our own web corpora for the demonstration.
Pictures taken by Stefan Müller (external link)