Roland Schäfer

Professor of Linguistics | German Grammar | Jena

Menu

Skip to content
  • Research
    • Projects
    • External Funding
    • Software
  • CV
    • Education
    • Employment
  • Teaching
    • General Linguistics
    • German Linguistics
    • English Linguistics
    • Computational Linguistics
    • Languages
  • Publications
    • Incubator
    • Books
    • Papers
    • Theses
    • Chapters and Encyclopedia Articles
  • Talks
  • Confs
    • Workshops
    • Tutorials/Courses
  • Refereeing
    • Journals
    • Edited Volumes
    • Books
    • Conferences
  • Impressum (DE)
  • Datenschutz (DE)

Tag Archives: Parallel computing

CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws (Proc LREC)

Roland Schäfer (2016). CommonCOW: massively huge web corpora from CommonCrawl data and a method to distribute them freely under restrictive EU copyright laws. In In Proceedings of LREC 2016. 4500–4504. [BibTeX]

Continue reading →

texrex web page cleaning system

Moved to GitHub as of 1 May 2016 (from SourceForge rev. 622).

This is the work horse web page cleaning system behind the COW. It turns crawled HTML documents into clean XML corpus documents. It is released under a permissive 2-clause BSD license. Continue reading →

Processing and Querying Large Web Corpora with the COW14 Architecture (Proc CMLC)

Roland Schäfer. Processing and Querying Large Web Corpora with the COW14 Architecture. In Proceedings of Challenges in the Management of Large Corpora (CMLC-3) (IDS publication server). 28–34. [BibTeX]

Continue reading →

Scalable Construction of High-quality Web Corpora (JLTCL)

Chris Biemann, Felix Bildhauer, Stefan Evert, Dirk Goldhahn, Uwe Quasthoff, Roland Schäfer, Johannes Simon, Leonard Swiezinski & Torsten Zesch (2013) Scalable Construction of High-quality Web Corpora. In Journal for Language Technology and Computational Linguistics 18. 23–60. [BibTeX]

Continue reading →

Informationen zur Lehre

What happened to webcorpora.org?

SE Morphologie und Lexikologie
VL Deutsche Syntax

VL Deutsche Graphematik

My Einführung in die grammatische Beschreibung was downloaded 97,764 times and is the second best-downloading monograph of LangSci Press (as of 3 June 2025). The fourth edition will be out before the 100,000th download and the 10th anniversary in 2025. [Information and Errata]

Recent Posts

  • What happened to webcorpora.org?25 May 2025
  • Desintegration attributiver Adjektivphrasen (Zeitschrift für Sprachwissenschaft 2025)1 March 2025
  • Statistical Inference for Everybody and a Linguist (in progress)12 January 2025
  • Between syntax and morphology (Glossa)8 May 2024
  • Bei Bedarf … (Praxis Deutsch)20 August 2023

Office Address

Prof. Dr. Roland Schäfer
Germanistische Sprachwissenschaft
Fürstengraben 30
07743 Jena

Email address

Richtlinen für Arbeiten

Empfehlungen für Emails

Sprechstunden

Secretary: Nadin Friebe