Roland Schäfer (to appear) Accurate and Efficient General-Purpose Boilerplate Detection for Crawled Web Corpora. LREV.Available online first (paywall), Springer Nature SharedIt (crippled free). [BibTeX]
Category Archives: Publications
CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws (2016)
Processing and Querying Large Web Corpora with the COW14 Architecture (2015)
Roland Schäfer. Processing and Querying Large Web Corpora with the COW14 Architecture. In Proceedings of Challenges in the Management of Large Corpora (CMLC-3) (IDS publication server). 28–34. [BibTeX]
Die Kurzformen des Indefinitartikels im Deutschen (2014)
Roland Schäfer & Ulrike Sayatz (2014) Die Kurzformen des Indefinitartikels im Deutschen (Cliticization of the indefinite article in German). Zeitschrift für Sprachwissenschaft (ZS) 33(2). [BibTeX]
Focused Web Corpus Crawling (2014)
Proceedings of the 9th Web as Corpus Workshop (2014)
Felix Bildhauer & Roland Schäfer (eds) 2014. Proceedings of the 9th Web as Corpus Workshop (WAC-9). ACL: Stroudsburg. [BibTeX]
The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction (2013)
Web Corpus Construction (2013)
Roland Schäfer & Felix Bildhauer (2013) Web Corpus Construction. Morgan and Claypool. [BibTeX]
Websites: Morgan & Claypool (official), Companion web site (additional information, errata, etc.)
Reviews: Serge Sharoff in Computational Linguistics 41(1) (2015), Mats Wirén in Nordic Journal of Linguistics 37, 03 (2014)
Scalable Construction of High-quality Web Corpora (2013)
Building Large Corpora from the Web Using a New Efficient Tool Chain (2012)
Roland Schäfer & Felix Bildhauer (2012) Building Large Corpora from the Web Using a New Efficient Tool Chain. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12). 486–493. [BibTeX]
Please cite this paper if you use the COW corpora up to version COW16.