Linguistic web characterization and web corpus creation (DFG)

Work on this project at Freie Universität Berlin, German Grammar GroupGerman and Dutch Philology, is supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) grant SCHA1916/1-1.

Project leader: Roland Schäfer

Funding amount: 286,100€

Runtime: January 2015 – June 2018 (interrupted April–September 2016)

Student assistants:

  • Kim Maser, Humboldt-Universität Berlin (2015–2017)
  • Luise Rißmann, Freie Universität Berlin (2015–)

Officially collaborating institutions:

Large corpora constructed from web data usually contain several billions of tokens and are uniquely suited for many kinds of linguistic research. Because they are so large, they allow researchers to work on very rare phenomena, and they contain a great amount of linguistic variation. However, the immense size of the corpora neces­sitates their collection by an unsupervised search procedure (“crawling”) in the web. This usually means that the documents within web corpora lack the kind of meta data expected by most corpus linguists. Even the overall composition of large web corpora (in terms of text types or registers) is unknown. Furthermore, the crawling methods used so far produce provably biased, i.e., distorted, samples. Finally, web corpora al­ways undergo fully automated cleaning and normalization procedures (such as re­moval of boilerplate text elements and duplicate removal), and corpus users usually know next to nothing about the precision and the effect of those procedures.

This project remedies this situation by performing fundamental methodological re­search on the German web. Firstly, in addition to conventional biased crawling proce­dures, unbiased crawling algorithms are used to generate random samples which are truly representative of the population of web documents. Additionally, available meth­ods for the classification of text type, register, subject/topic domain etc. are compiled and applied to the samples in order to generate the necessary meta data. The creation of resources is not the primary goal of this project, but rather the development of a classification scheme which is suitable to be applied to large collections of web texts with satisfactory precision. In order to achieve the required precision, classical meth­ods (e.g. Biber’s Multidimensional Analysis) are combined with more recent methods of document clustering and document classification (e.g. Latent Semantic Analysis and Topic Modeling) from Information Retrieval.

Once the meta data are available, the composition of large crawled web corpora—most importantly also depending on the chosen crawling method—can be specified for the very first time. Furthermore, it will be known how the usual cleaning and normaliza­tion procedures affect the composition of web corpora. Finally and most fundamen­tally, the availability of unbiased representative web document samples annotated with rich linguistic meta data allows for a deep linguistic characterization of the Ger­man web. For example, it will be known what the register composition of the Ger­man web is for documents of a certain length, etc. Such knowledge finally puts corpus lin­guists in a position to make educated decisions as to the suitability of web data and web corpora for their research question.