Category Archives: Projects

Probabilistic German Morphosyntax (Habilitation)

Probabilistic German Morphosyntax is a sequence of papers for my kumulative Habilitation (cumulative version of the second thesis in the German/Austrian system) to be submitted to the Faculty of Languages and Letters of the Humboldt University Berlin. The official process (Habilitationsverfahren) is expected to begin in March 2018.

As of December 2017, three papers have been published, one has been accepted with revisions. Paper 4 (Schäfer & Sayatz 2014) has been translated into English, and all papers are currently being edited into a book with added introductions to probabilistic grammar, web corpora as source of data (based on many of my other publications), and statistical methods.

  1. Roland Schäfer (2017, accepted with revisions) Competing Constructions for German Measure Noun Phrases: from Usage Data to Experiment.
  2. Roland Schäfer & Ulrike Sayatz (2016) Punctuation and Syntactic Structure in “obwohl” and “weil” Clauses in Nonstandard Written German. Written Language and Literacy (WLL) 19:2, 215–248.
  3. Roland Schäfer (2016 ahead of print) Prototype-driven Alternations: The Case of German Weak Nouns. Corpus Linguistics and Linguistic Theory (CLLT).
  4. Roland Schäfer & Ulrike Sayatz (2014) Die Kurzformen des Indefinitartikels im Deutschen (Cliticization of the Indefinite Article in German). Zeitschrift für Sprachwissenschaft (ZS) 33(2).

Linguistic web characterization and web corpus creation (DFG)

Work on this project at Freie Universität Berlin, German Grammar GroupGerman and Dutch Philology, is supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) grant SCHA1916/1-1.

Project leader: Roland Schäfer

Funding amount: 286,100€

Runtime: January 2015 – June 2018 (interrupted April–September 2016)

Student assistants:

  • Kim Maser, Humboldt-Universität Berlin (2015–2017)
  • Luise Rißmann, Freie Universität Berlin (2015–)

Officially collaborating institutions:

Continue reading

COW web corpus initiative

COW (COrpora from the Web) is a collection of linguistically processed gigatoken web corpora. We have corpora in major European languages (Dutch, English, French, German, Spanish, Swedish). Currently, the corpora are between 1 billion and 10 billion tokens large. The third-generation corpora COW2014 are all larger than their predecessors, some containing 10 billion tokens or more. We are also focusing on corpus quality in all areas (data collection as well as post-processing and linguistic annotation), not just larger corpus sizes. To avoid legal problems with copyright claims, the published corpora are sentence shuffles.

Please go to the web page of COW (Corpoa from the Web).