Felix Bildhauer & Roland Schäfer: Token-level noise in large Web corpora and non-destructive normalization for linguistic applications. Corpus Analysis with Noise in the Signal (CANS 2013). Corpus Linguistics 2013, Lancaster.
Tag Archives: Corpus construction
The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction (Proc WAC)
Web Corpus Construction (Morgan & Claypool)
Roland Schäfer & Felix Bildhauer (2013) Web Corpus Construction. Morgan and Claypool. [BibTeX]
Websites: Morgan & Claypool (official), Companion web site (additional information, errata, etc.)
Reviews: Serge Sharoff in Computational Linguistics 41(1) (2015), Mats Wirén in Nordic Journal of Linguistics 37, 03 (2014)
Scalable Construction of High-quality Web Corpora (JLTCL)
Building Large Corpora from the Web Using a New Efficient Tool Chain (Proc LREC)
Roland Schäfer & Felix Bildhauer (2012) Building Large Corpora from the Web Using a New Efficient Tool Chain. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12). 486–493. [BibTeX]
Please cite this paper if you use the COW corpora up to version COW16.
Building large corpora from the web (ESSLLI 2012)
Building large corpora from the web, Foundational course at the European Summer School in Logic, Language and Information 2012, Opole
Building large corpora from the web (for printing)
Building large corpora from the web (for screen reading)