Roland Schäfer (to appear) Accurate and Efficient General-Purpose Boilerplate Detection for Crawled Web Corpora. LREV.Available online first (paywall), Springer Nature SharedIt (crippled free). [BibTeX]
Roland Schäfer. Processing and Querying Large Web Corpora with the COW14 Architecture. In Proceedings of Challenges in the Management of Large Corpora (CMLC-3) (IDS publication server). 28–34. [BibTeX]
COW (COrpora from the Web) is a collection of linguistically processed gigatoken web corpora. We have corpora in major European languages (Dutch, English, French, German, Spanish, Swedish). Currently, the corpora are between 1 billion and 10 billion tokens large. The third-generation corpora COW2014 are all larger than their predecessors, some containing 10 billion tokens or more. We are also focusing on corpus quality in all areas (data collection as well as post-processing and linguistic annotation), not just larger corpus sizes. To avoid legal problems with copyright claims, the published corpora are sentence shuffles.
Please go to the web page of COW (Corpoa from the Web).
Chris Biemann, Felix Bildhauer, Stefan Evert, Dirk Goldhahn, Uwe Quasthoff, Roland Schäfer, Johannes Simon, Leonard Swiezinski & Torsten Zesch (2013) Scalable Construction of High-quality Web Corpora. In Journal for Language Technology and Computational Linguistics 18. 23–60. [BibTeX]
Roland Schäfer & Felix Bildhauer (2012) Building Large Corpora from the Web Using a New Efficient Tool Chain. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12). 486–493. [BibTeX]
Please cite this paper if you use the COW corpora up to version COW16.
Building large corpora from the web, Foundational course at the European Summer School in Logic, Language and Information 2012, Opole