Putting web corpora to good use: DECOW and non-standard ‘weil’ and ‘obwohl’ (Potsdam 2017)

Roland Schäfer. Putting web corpora to good use: the next-generation DECOW corpus and non-standard ‘weil’ and ‘obwohl’ clauses in German. Potsdam University Computational Linguistics Colloquium 19 June 2017.

In this talk, I first introduce the next-generation COW16 web corpora (joint work with Felix Bildhauer, IDS Mannheim). They contain many more levels of linguistic annotation (for the German DECOW16: improved lemmatisation, morphological tagging, base lemma, full nominal compound analyses, topological parses, dependency parses, etc.) compared to the COW14 generation and most other (English, French, German, and Spanish) corpora. I also briefly talk about our new approaches to meta data generation, primarily the COReCo topic classification using Latent Dirichelt Allocation, and the COReX framework for extracting grammatical features for document-level annotation. I very briefly introduce the available options for using COW, most prominently NoSketchEngine and direct access from Python and R on our server (including an RStudio Server installation).
About half of our users are computational linguists, but COW was targeted specifically at theoretical/corpus linguists. Therefore, in the main part of the talk, I present a DECOW corpus study of non-standard uses of ‘obwohl’ and ‘weil’ clauses with embedded verb-second constituent order (V2) in German (joint work with Ulrike Sayatz, FU Berlin). In this study, which could not have been conducted using any other available corpus, we take an exclusively graphemic look at the phenomenon. We first argue for ‘Usage-Based Graphemics’, a usage-based prototype approach to the syntax-graphemics interface. Within this framework, we show that ‘weil’ and ‘obwohl’ are used with characteristic patterns of punctuation in non-standard texts, and that these patterns are a clear indicator that writers use punctuation to mark different levels of sentential integration of the respective clauses. Our study corroborates existing theories which assume that ‘obwohl’ with V2 has a different status than ‘weil’ with V2, and that the ‘obwohl’ clauses in question have a much lower degree of sentential integration.