This is a one-day hands-on tutorial for advanced COW users, invited by the Department of English Linguistics at Goethe University, Frankfurt/Main on 15 February 2017.
In this tutorial, participants are introduced (briefly) to important advantages and disadvantages of using large web-crawled corpora for linguistic research. Then, the state-of-the-art DECOW16 and ENCOW16 corpora are introduced. These 10 to 20 billion token web corpora contain a large number of annotations, such as POS, lemma, NER, dependency, constituency, morphosyntactic features (case, tense, number, etc.), base lemma (DE only), nominal compound analysis (DE only).
In the practical sessions, participants first learn how to use the NoSketchEngine interface (at www.webcorpora.org). NoSkE (with its underlying Manatee engine) is the only web interface that can deal efficiently with both complex annotations and very large corpora (>10 billion tokens) in an efficient way. It is versatile enough for most studies in corpus linguistics, and it is indispensable for data exploration and study design. In order to unleash the power of its underlying Manatee engine – and make optimal use of the COW16 corpora – Manatee can also be used as a library from diverse programming languages. Part 4 of this tutorial introduces participants to the basics of using Manatee from Python.
Basic knowledge of regular expressions is recommended for parts 3 and 4. Knowledge of Python (ideally at this level) is a prerequisite for part 4. Participants are invited to “bring their own research questions”. Participants should already have an account at www.webcorpora.org.
- What every linguist who wants to use web corpora needs to know about web corpora. Really, you are not allowed to use them unless you know this!
- Architecture of and annotations in DECOW16A, ENCOW16A
- [Practical] CQL at the command line and in NoSketchEngine
- [Practical] Using Manatee from Python