Building large corpora from the web, Foundational course at the European Summer School in Logic, Language and Information 2012, Opole
Building large corpora from the web (for printing)
Building large corpora from the web (for screen reading)
The world wide web most likely constitutes the hugest existing source of texts written in a great variety of languages. A feasible and sound way of exploiting this data for linguistic research is to compile a static corpus for a given language. This has several advantages: (i) It obviates the problems encountered when using internet search engines in quantitative linguistic research, such as non-transparent ranking algorithms. (ii) Creating a corpus from web data is free. (iii) The size of corpora compiled from the WWW may exceed by several magnitudes the size of (usually expensive) language resources offered elsewhere. (iv) The data is locally available to the user, and it can be linguistically post-processed and queried with the tools preferred by her/him. We will address a number of theoretical and practical issues in the steps of creating a web corpus up to giga-token size, namely:
- collection of raw data: using search engine results, different ways of crawling the web, combinations of both
- post-processing
- stripping of markup and code, conversion between character encodings
- diverse approaches to boilerplate recognition and removal, including their evaluation
- document filtering: documents not written in the target language, texts not suitable for the intended kind of corpus
- (near) duplicate document recognition and removal
- linguistic issues: limitations of semi-automatically compiled corpora with respect to specific areas of linguistic research
- evaluation: comparison of web corpora (e.g., in terms of lexical coverage and distribution) to other corpora of the same language (e.g., balanced corpora)