Moved to GitHub as of 1 May 2016 (from SourceForge rev. 622).
This is the work horse web page cleaning system behind the COW. It turns crawled HTML documents into clean XML corpus documents. It is released under a permissive 2-clause BSD license. Continue reading