Category Archives: Research

Academic CV, publications, list of courses taught in PDF form

Updated: 1 December 2019

Probabilistic German Morphosyntax (Habilitationsschrift)

Probabilistic German Morphosyntax is a sequence of papers with a methodological introduction representing my kumulative Habilitation (cumulative version of the second thesis in the German-speaking systems). As a result, I obtained the venia legendi for German and General Linguistics from the Faculty of Language Sciences (Sprach- und literaturwissenschaftliche Fakultät) at Humboldt-Universität zu Berlin on 10 April 2019.

Download: Roland Schäfer (2018) Probabilistic German Morphosyntax. General introduction, overview, and wrap-up (Rahmentext der kumulativen Habilitationsschrift). Continue reading →

Linguistic web characterization and web corpus creation (DFG)

Work on this project at Freie Universität Berlin, German Grammar Group / German and Dutch Philology, is supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) grant SCHA1916/1-1.

Publications as of January 2019:

Schäfer & Bildhauer (in prep.) on web characterisation and corpus comparison
Bidlhauer & Schäfer (in prep.) on COReX and its usability in corpus studies
Schäfer (2018) on corpora and cognitive representativity
Schäfer & Pankratz (2018) on corpora and cognitive representativity
Bildhauer & Schäfer (2017) on topic annotation
Schäfer & Bildhauer (2016) on topic annotation
Schäfer (2016b) on the ClaraX crawler
Schäfer (2016a) on boilerplate detection

Software/data releases as of January 2019:

Principle investigator: Roland Schäfer

Funding amount: 286,100€

Runtime: January 2015 – June 2018 (interrupted April–September 2016)

Student assistants:

Kim Maser, Humboldt-Universität Berlin (2015–2017)
Luise Rißmann, Freie Universität Berlin (2015–2018)

Officially collaborating institutions:

Continue reading →

ClaraX random walk crawler

Currently bundled with texrex on GitHub.

ClaraX (funded by the German Research Council through grant SCHA1916/1-1 Linguistic web characterization) is the companion of the planned (but delayed) HeidiX (Heidi is a crawler system) software. It performs parametrized random walk crawls in the web graph and integrates full texrex‘s web page cleaning functionality. It is purely experimental in the sense that it is designed to conduct experiments and fundamental research. It is in no way suitable for large-scale productive crawling. It is released under a permissive 2-clause BSD license.

texrex web page cleaning system

Moved to GitHub as of 1 May 2016 (from SourceForge rev. 622).

This is the work horse web page cleaning system behind the COW. It turns crawled HTML documents into clean XML corpus documents. It is released under a permissive 2-clause BSD license. Continue reading →

Colibri² corpus portal

Because none of the available web interfaces to the IMS Open Corpus Workbench was right for hosting the COW web corpora, I started working on a bespoke interface called Colibri². It is really a spare-time project, and I do not release the code because I consider it trivialware.

Continue reading →

COW web corpus initiative

COW (COrpora from the Web) is a collection of linguistically processed gigatoken web corpora. We have corpora in major European languages (Dutch, English, French, German, Spanish, Swedish). Currently, the corpora are between 1 billion and 10 billion tokens large. The third-generation corpora COW2014 are all larger than their predecessors, some containing 10 billion tokens or more. We are also focusing on corpus quality in all areas (data collection as well as post-processing and linguistic annotation), not just larger corpus sizes. To avoid legal problems with copyright claims, the published corpora are sentence shuffles.

Please go to the web page of COW (Corpoa from the Web).