- Roland Schäfer’s academic CV (German)
- Roland Schäfer’s publications PDF
- List of courses taught by Roland Schäfer (German)
Updated: 1 December 2019
Currently bundled with texrex on GitHub.
ClaraX (funded by the German Research Council through grant SCHA1916/1-1 Linguistic web characterization) is the companion of the planned (but delayed) HeidiX (Heidi is a crawler system) software. It performs parametrized random walk crawls in the web graph and integrates full texrex‘s web page cleaning functionality. It is purely experimental in the sense that it is designed to conduct experiments and fundamental research. It is in no way suitable for large-scale productive crawling. It is released under a permissive 2-clause BSD license.
Moved to GitHub as of 1 May 2016 (from SourceForge rev. 622).
This is the work horse web page cleaning system behind the COW. It turns crawled HTML documents into clean XML corpus documents. It is released under a permissive 2-clause BSD license. Continue reading
Because none of the available web interfaces to the IMS Open Corpus Workbench was right for hosting the COW web corpora, I started working on a bespoke interface called Colibri². It is really a spare-time project, and I do not release the code because I consider it trivialware.