Parallel computing | Roland Schäfer

Moved to GitHub as of 1 May 2016 (from SourceForge rev. 622).

This is the work horse web page cleaning system behind the COW. It turns crawled HTML documents into clean XML corpus documents. It is released under a permissive 2-clause BSD license. Continue reading →

Roland Schäfer

Professor of Linguistics | German Grammar

Tag Archives: Parallel computing

CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws (2016)

texrex web page cleaning system

Processing and Querying Large Web Corpora with the COW14 Architecture (2015)

Scalable Construction of High-quality Web Corpora (2013)