texrex web page cleaning system

Moved to GitHub as of 1 May 2016 (from SourceForge rev. 622).

This is the work horse web page cleaning system behind the COW. It turns crawled HTML documents into clean XML corpus documents. It is released under a permissive 2-clause BSD license.

The current version is called texrex-behindthecow (following the established texrex naming scheme). The new features are:

  • multi-lingual input for CommonCrawl data (TrTextAssessment extension)
  • WARC reader (and cleaner reader pool implementation)
  • improvements in cleanup of broken encodings
  • Unicode normalization to NFC
  • MacOS support
  • bug fixes/code cleanups

texrex is a free software for processing data files from crawls and turn them into a corpus of web documents. Currently, it is limited to reading ARC files, but other input modules can be developed quickly. It performs the following processing steps:

    1. read ARC files document by document
  • filter perfect duplicates using a Bloom filter
  • strip HTML, scripts, stylesheets
  • extract meta information from crawl headers
  • normalize encodings to UTF-8 (using ICU), optionally treating all ISO-8859-1 input as Win-1252
  • convert all HTML entities to appropriate codepoints (including rogue Win-1252)
  • detect, remove, and/or annotate boilerplate blocks using a Multi-Layer Perceptron trained on 38 features (This method achieves far over 90% correct decisions in our evaluations and is thus far better than the previous state of the art. To be published.)
  • assess the text quality of the documents by looking at frequencies of short frequent word (requires language-specific models)
  • create w-shingling document fingerprints and filter near-duplicate documents
  • perform in-document deduplication (remove repeated paragraphs, insert a backreference to first copy)
  • perform additional normalization (e.g., reduce diverse Unicode dashes and hyphens to the basic codepoint)
  • write standard-compliant XML output
  • add server IP geolocation meta information (country, region, city – using GeoLite)

Technologically, the main features of texrex are:

    1. written in FreePascal (Object FPC mode)
    2. licensed under LGPL (Pascal units) and GPL (Pascal programs), as well as the licenses used by ICU and FANN for the header translations of those libraries
    3. uses multi-threading for single-machine parallelization
    4. uses simple INI files to configure processing jobs for the main tool
    5. can be run in the background, using an included IPC client to control the process
    6. depends only on two additional libraries: ICU and FANN

New tools included since texrex-neuedimensionen (June 2014):

    1. HyDRA hard hyphenation remover
    2. rofl tool to fix run-together sentences