Tag Archives: Corpus construction

Accurate and Efficient General-Purpose Boilerplate Detection for Crawled Web Corpora (2016)

Roland Schäfer (to appear) Accurate and Efficient General-Purpose Boilerplate Detection for Crawled Web Corpora. LREV.Available online first (paywall), Springer Nature SharedIt (crippled free). [BibTeX]

Full data set and scripts on GitHub.

Continue reading →

CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws (2016)

Roland Schäfer (2016). CommonCOW: massively huge web corpora from CommonCrawl data and a method to distribute them freely under restrictive EU copyright laws. In In Proceedings of LREC 2016. 4500–4504. [BibTeX]

Continue reading →

texrex web page cleaning system

Moved to GitHub as of 1 May 2016 (from SourceForge rev. 622).

This is the work horse web page cleaning system behind the COW. It turns crawled HTML documents into clean XML corpus documents. It is released under a permissive 2-clause BSD license. Continue reading →

Processing and Querying Large Web Corpora with the COW14 Architecture (2015)

Roland Schäfer. Processing and Querying Large Web Corpora with the COW14 Architecture. In Proceedings of Challenges in the Management of Large Corpora (CMLC-3) (IDS publication server). 28–34. [BibTeX]

Continue reading →

Bildhauer & Schäfer: Working with web corpora (Corpus Linguistics 2015 workshop)

Details and registration at the Corpus Linguistics 2015 workshop web site

Continue reading →

Colibri² corpus portal

Because none of the available web interfaces to the IMS Open Corpus Workbench was right for hosting the COW web corpora, I started working on a bespoke interface called Colibri². It is really a spare-time project, and I do not release the code because I consider it trivialware.

Continue reading →

Sehr große Webkorpora – Aufbau, Zusammensetzung und Anwendung (2014)

Felix Bildhauer & Roland Schäfer: Sehr große Webkorpora – Aufbau, Zusammensetzung und Anwendung (“Very large web corpora – construction, composition, and application”). Invited talk at Institut für Deutsche Sprache (IDS), Mannheim.

9th Web as Corpus Workshop (WAC-9)

Endorsed by ACL SIGWAC, co-located with EACL 2014, April 26, 2014 (Gothenburg, Sweden).

Organized by Felix Bildhauer and Roland Schäfer.

Visit official WAC-9 homepage for details. Visit WAC-9 proceedings page.

Web Data as a Challenge for Theoretical Linguistics and Corpus Design (DGfS 2014)

Date:		March 5–6, 2014
Hosting event:		36th Annual Conference of the German Linguistic Society 2014
Location:		Marburg University (Marburg/Lahn, Germany)
Organizers		Felix Bildhauer (COW/German Grammar, FU Berlin/SFB632)
		Roland Schäfer (COW/German Grammar, FU Berlin)
Invited speaker:		Stefan Evert, FAU Erlangen

Continue reading →

COW web corpus initiative

COW (COrpora from the Web) is a collection of linguistically processed gigatoken web corpora. We have corpora in major European languages (Dutch, English, French, German, Spanish, Swedish). Currently, the corpora are between 1 billion and 10 billion tokens large. The third-generation corpora COW2014 are all larger than their predecessors, some containing 10 billion tokens or more. We are also focusing on corpus quality in all areas (data collection as well as post-processing and linguistic annotation), not just larger corpus sizes. To avoid legal problems with copyright claims, the published corpora are sentence shuffles.

Please go to the web page of COW (Corpoa from the Web).