Tag Archives: Boilerplate

Accurate and Efficient General-Purpose Boilerplate Detection for Crawled Web Corpora (2016)

Roland Schäfer (to appear) Accurate and Efficient General-Purpose Boilerplate Detection for Crawled Web Corpora. LREV.Available online first (paywall), Springer Nature SharedIt (crippled free). [BibTeX]

Full data set and scripts on GitHub.

Continue reading →

texrex web page cleaning system

Moved to GitHub as of 1 May 2016 (from SourceForge rev. 622).

This is the work horse web page cleaning system behind the COW. It turns crawled HTML documents into clean XML corpus documents. It is released under a permissive 2-clause BSD license. Continue reading →

Focused Web Corpus Crawling (2014)

Roland Schäfer & Adrien Barbaresi & Felix Bildhauer (2014) Focused Web Corpus Crawling. In Proceedings of the 9th Web as Corpus workshop (WAC-9). [BibTeX]

Continue reading →

Web Corpus Construction (2013)

Roland Schäfer & Felix Bildhauer (2013) Web Corpus Construction. Morgan and Claypool. [BibTeX]

Websites: Morgan & Claypool (official), Companion web site (additional information, errata, etc.)

Reviews: Serge Sharoff in Computational Linguistics 41(1) (2015), Mats Wirén in Nordic Journal of Linguistics 37, 03 (2014)

Continue reading →

Scalable Construction of High-quality Web Corpora (2013)

Chris Biemann, Felix Bildhauer, Stefan Evert, Dirk Goldhahn, Uwe Quasthoff, Roland Schäfer, Johannes Simon, Leonard Swiezinski & Torsten Zesch (2013) Scalable Construction of High-quality Web Corpora. In Journal for Language Technology and Computational Linguistics 18. 23–60. [BibTeX]

Continue reading →

Building Large Corpora from the Web Using a New Efficient Tool Chain (2012)

Roland Schäfer & Felix Bildhauer (2012) Building Large Corpora from the Web Using a New Efficient Tool Chain. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12). 486–493. [BibTeX]

Please cite this paper if you use the COW corpora up to version COW16.

texrex software on GitHub
COW annotation tool chain on GitHub.

Continue reading →

Building large corpora from the web (ESSLLI 2012)

Building large corpora from the web, Foundational course at the European Summer School in Logic, Language and Information 2012, Opole

Building large corpora from the web (for printing)
Building large corpora from the web (for screen reading)

Continue reading →