Probabilistic German Morphosyntax is a sequence of papers with a methodological introduction representing my kumulative Habilitation (cumulative version of the second thesis in the German-speaking systems). As a result, I obtained the venia legendi for German and General Linguistics from the Faculty of Language Sciences (Sprach- und literaturwissenschaftliche Fakultät) at Humboldt-Universität zu Berlin on 10 April 2019.
Work on this project at Freie Universität Berlin, German Grammar Group / German and Dutch Philology, is supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) grant SCHA1916/1-1.
Publications as of January 2019:
- Schäfer & Bildhauer (in prep.) on web characterisation and corpus comparison
- Bidlhauer & Schäfer (in prep.) on COReX and its usability in corpus studies
- Schäfer (2018) on corpora and cognitive representativity
- Schäfer & Pankratz (2018) on corpora and cognitive representativity
- Bildhauer & Schäfer (2017) on topic annotation
- Schäfer & Bildhauer (2016) on topic annotation
- Schäfer (2016b) on the ClaraX crawler
- Schäfer (2016a) on boilerplate detection
Software/data releases as of January 2019:
- DECOW16B corpus
- RanDECOW17 corpus
- COReX18 databases
- COWTek18 with COReX software
- ClaraX random walker with texrex
Principle investigator: Roland Schäfer
Funding amount: 286,100€
Runtime: January 2015 – June 2018 (interrupted April–September 2016)
- Kim Maser, Humboldt-Universität Berlin (2015–2017)
- Luise Rißmann, Freie Universität Berlin (2015–2018)
Officially collaborating institutions:
- Institut für Deutsche Sprache, Mannheim, Abteilung Lexik
- Institut für Deutsche Sprache, Mannheim, Abteilung Grammatik
- Institut für Maschinelle Sprachverarbeitung, Stuttgart
COW (COrpora from the Web) is a collection of linguistically processed gigatoken web corpora. We have corpora in major European languages (Dutch, English, French, German, Spanish, Swedish). Currently, the corpora are between 1 billion and 10 billion tokens large. The third-generation corpora COW2014 are all larger than their predecessors, some containing 10 billion tokens or more. We are also focusing on corpus quality in all areas (data collection as well as post-processing and linguistic annotation), not just larger corpus sizes. To avoid legal problems with copyright claims, the published corpora are sentence shuffles.
Please go to the web page of COW (Corpoa from the Web).