Web Data as a Challenge for Theoretical Linguistics and Corpus Design (DGfS 2014)

Date: March 5–6, 2014
Hosting event: 36th Annual Conference of the German Linguistic Society 2014
Location: Marburg University (Marburg/Lahn, Germany)
Organizers Felix Bildhauer (COW/German Grammar, FU Berlin/SFB632)
Roland Schäfer (COW/German Grammar, FU Berlin)
Invited speaker: Stefan Evert, FAU Erlangen

Program (click title to download PDF)

Wednesday, March 5, 2014

14:00 Felix Bildhauer & Roland Schäfer (Freie Universität Berlin)
Web data as a challenge for theoretical linguistics and corpus design (Introduction)
14:30 Sonja Müller (Universität Bielefeld)
How webdata can challenge traditional generalizations: a case study of the order of modal particles in German
15:00 Susanne Flach (Freie Universität Berlin)
Solving the rare phenomenon problem? ‘Quasi-serial’ verb constructions in English
15:30 CANCELLED Dirk Goldhahn & Uwe Quasthoff (Universität Leipzig)
Using corpus-based statistics for linguistic typology
16:00 Coffee break
16:30 Adrien Barbaresi (ENS Lyon)
For a few points more: improving decision processes in web corpus construction
17:00 Lea Helmers (Freie Universitat Berlin)
Named entity recognition on German web corpora
17:30 Vladimír Benko (Slovak Academy of Sciences)
Near-duplicate data in web corpora
18:00 Jack Grieve (Aston University), Asnaghi Costanza (Università Cattolica del Sacro Cuore) & Tom Ruette (Humboldt-Universität zu Berlin)
Googleology is good science

Thursday, March 6, 2014

9:00 Stefan Evert (invited)
An NLP approach to the evaluation of web corpora
10:00 Ines Rehbein (Universität Potsdam)
Using Twitter for linguistic purposes – three case studies
10:30 Kazuya Abe (Atomi University)
Twitter corpus and collection of German phrases
11:00 Coffee break
11:30 Tom Ruette (Humboldt-Universität zu Berlin) & Jack Grieve (Aston University)
Cognitive sociolinguistics with Twitter: why do the Dutch swear with diseases?
12:00 CANCELLED Peter Grube (Martin-Luther-Universität Halle-Wittenberg)
A diachronic corpus of personal weblogs: possibilities and current constraints

Program Committee

  • Chris Biemann
  • Stefan Evert
  • Matthias Hüning
  • Anke Lüdeling
  • Alexander Mehler
  • Uwe Quasthoff
  • Amir Zeldes
  • Torsten Zesch
  • Arne Zeschel

Workshop Description (and Call for Papers)

The huge amounts of linguistic data on the web offer exciting new possibilities in empirically based theoretical linguistics. Web-derived linguistic resources can contain greater amounts of variation as well as non-standard grammar and writing compared to traditionally compiled corpora. Also, whole new registers and genres have been described to emerge on the web. Like spoken language – although clearly distinct from it – the language found on the web can thus challenge linguistic theories which are based mainly on standard written language as well as the categories assumed within these theories. At the same time, such non-standard features make the data harder to process for computational linguists, and additional care is required in making the decision of labeling material as “noise”, because it might be considered valuable data by some linguists.

This workshop aims to bring together researchers working in Theoretical Linguistics and Corpus Linguistics with those who create resources from web data. The primary question of the workshop is: Which new linguistic insights can we derive from web data? Secondarily, we ask how web data is (and how it should be) processed to produce easily accessible high-quality resources and thus facilitate this kind of innovative linguistic research.

Possible subjects for talks include (but are by no means restricted to):

  • theoretically motivated empirical studies of linguistic phenomena in web data,
  • work on problems with established linguistic categories specific to certain types of web data (problems with traditional part-of-speech classification, syntactic categories, register and genre classification, etc.),
  • problems of working with web corpora from the user’s perspective in concrete studies (low quality of: tokenization, POS tagging, named entity recognition, etc.; availability and lack of meta data),
  • assessments and improvements of the quality of available and newly designed tools and models to process or classify web data,
  • approaches to normalization of web data and evaluations of the acceptability of such normalizations from a linguistic perspective,
  • sampling of web data (e.g., stratified vs. randomly compiled corpora, linguistic web characterization)

We invite submissions for 30 minute talks (20 minutes plus 10 minutes of discussion) about completed or ongoing original research in which web data is used or which is about the creation and/or evaluation of web data resources. The scope of the workshop is neither restricted to resources of a specific size or nature nor to any specific language(s). Submitted abstracts will be reviewed anonymously by at least two reviewers. We hope to offer authors of accepted talks the opportunity to publish an extended version of their talk in a special issue of a peer-reviewed corpus linguistics journal.