Linguistic web characterization and web corpus creation (DFG)

Work on this project at Freie Universität Berlin, German Grammar Group / German and Dutch Philology, is supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) grant SCHA1916/1-1.

Publications as of January 2019:

Schäfer & Bildhauer (in prep.) on web characterisation and corpus comparison
Bidlhauer & Schäfer (in prep.) on COReX and its usability in corpus studies
Schäfer (2018) on corpora and cognitive representativity
Schäfer & Pankratz (2018) on corpora and cognitive representativity
Bildhauer & Schäfer (2017) on topic annotation
Schäfer & Bildhauer (2016) on topic annotation
Schäfer (2016b) on the ClaraX crawler
Schäfer (2016a) on boilerplate detection

Software/data releases as of January 2019:

Principle investigator: Roland Schäfer

Funding amount: 286,100€

Runtime: January 2015 – June 2018 (interrupted April–September 2016)

Student assistants:

Kim Maser, Humboldt-Universität Berlin (2015–2017)
Luise Rißmann, Freie Universität Berlin (2015–2018)

Officially collaborating institutions:

Large corpora constructed from web data usually contain several billions of tokens and are uniquely suited for many kinds of linguistic research. Because they are so large, they allow researchers to work on very rare phenomena, and they contain a great amount of linguistic variation. However, the immense size of the corpora necessitates their collection by an unsupervised search procedure (“crawling”) in the web. This usually means that the documents within web corpora lack the kind of meta data expected by most corpus linguists. Even the overall composition of large web corpora (in terms of text types or registers) is unknown. Furthermore, the crawling methods used so far produce provably biased, i.e., distorted, samples. Finally, web corpora always undergo fully automated cleaning and normalization procedures (such as removal of boilerplate text elements and duplicate removal), and corpus users usually know next to nothing about the precision and the effect of those procedures.

This project remedies this situation by performing fundamental methodological research on the German web. Firstly, in addition to conventional biased crawling procedures, unbiased crawling algorithms are used to generate random samples which are truly representative of the population of web documents. Additionally, available methods for the classification of text type, register, subject/topic domain etc. are compiled and applied to the samples in order to generate the necessary meta data. The creation of resources is not the primary goal of this project, but rather the development of a classification scheme which is suitable to be applied to large collections of web texts with satisfactory precision. In order to achieve the required precision, classical methods (e.g. Biber’s Multidimensional Analysis) are combined with more recent methods of document clustering and document classification (e.g. Latent Semantic Analysis and Topic Modeling) from Information Retrieval.

Once the meta data are available, the composition of large crawled web corpora—most importantly also depending on the chosen crawling method—can be specified for the very first time. Furthermore, it will be known how the usual cleaning and normalization procedures affect the composition of web corpora. Finally and most fundamentally, the availability of unbiased representative web document samples annotated with rich linguistic meta data allows for a deep linguistic characterization of the German web. For example, it will be known what the register composition of the German web is for documents of a certain length, etc. Such knowledge finally puts corpus linguists in a position to make educated decisions as to the suitability of web data and web corpora for their research question.