Tag Archives: Digital Humanities

Linguistic web characterization and web corpus creation (DFG)

Work on this project at Freie Universität Berlin, German Grammar Group / German and Dutch Philology, is supported by the German Research Council (Deutsche Forschungsgemeinschaft, DFG) grant SCHA1916/1-1.

Publications as of January 2019:

Schäfer & Bildhauer (in prep.) on web characterisation and corpus comparison
Bidlhauer & Schäfer (in prep.) on COReX and its usability in corpus studies
Schäfer (2018) on corpora and cognitive representativity
Schäfer & Pankratz (2018) on corpora and cognitive representativity
Bildhauer & Schäfer (2017) on topic annotation
Schäfer & Bildhauer (2016) on topic annotation
Schäfer (2016b) on the ClaraX crawler
Schäfer (2016a) on boilerplate detection

Software/data releases as of January 2019:

Principle investigator: Roland Schäfer

Funding amount: 286,100€

Runtime: January 2015 – June 2018 (interrupted April–September 2016)

Student assistants:

Kim Maser, Humboldt-Universität Berlin (2015–2017)
Luise Rißmann, Freie Universität Berlin (2015–2018)

Officially collaborating institutions:

Continue reading →

Induktive Topikmodellierung und extrinsische Topikdomänen (IDS Jahrestagung 2016)

Felix Bildhauer & Roland Schäfer. Induktive Topikmodellierung und extrinsische Topikdomanen. Kurzvortrag und Poster. Jahrestagung des Instituts für Deutsche Sprache (IDS) Mannheim. 09. März 2016.
Continue reading →

Processing and Querying Large Web Corpora with the COW14 Architecture (2015)

Roland Schäfer. Processing and Querying Large Web Corpora with the COW14 Architecture. In Proceedings of Challenges in the Management of Large Corpora (CMLC-3) (IDS publication server). 28–34. [BibTeX]

Continue reading →

Bildhauer & Schäfer: Working with web corpora (Corpus Linguistics 2015 workshop)

Details and registration at the Corpus Linguistics 2015 workshop web site

Continue reading →

Colibri² corpus portal

Because none of the available web interfaces to the IMS Open Corpus Workbench was right for hosting the COW web corpora, I started working on a bespoke interface called Colibri². It is really a spare-time project, and I do not release the code because I consider it trivialware.

Continue reading →

COW web corpus initiative

COW (COrpora from the Web) is a collection of linguistically processed gigatoken web corpora. We have corpora in major European languages (Dutch, English, French, German, Spanish, Swedish). Currently, the corpora are between 1 billion and 10 billion tokens large. The third-generation corpora COW2014 are all larger than their predecessors, some containing 10 billion tokens or more. We are also focusing on corpus quality in all areas (data collection as well as post-processing and linguistic annotation), not just larger corpus sizes. To avoid legal problems with copyright claims, the published corpora are sentence shuffles.

Please go to the web page of COW (Corpoa from the Web).

The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction (2013)

Roland Schäfer, Adrien Barbaresi & Felix Bildhauer (2013) The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction. In Proceedings of the 8th Web as Corpus Workshop (WAC-8). [BibTeX]

Continue reading →

Web Corpus Construction (2013)

Roland Schäfer & Felix Bildhauer (2013) Web Corpus Construction. Morgan and Claypool. [BibTeX]

Websites: Morgan & Claypool (official), Companion web site (additional information, errata, etc.)

Reviews: Serge Sharoff in Computational Linguistics 41(1) (2015), Mats Wirén in Nordic Journal of Linguistics 37, 03 (2014)

Continue reading →

Scalable Construction of High-quality Web Corpora (2013)

Chris Biemann, Felix Bildhauer, Stefan Evert, Dirk Goldhahn, Uwe Quasthoff, Roland Schäfer, Johannes Simon, Leonard Swiezinski & Torsten Zesch (2013) Scalable Construction of High-quality Web Corpora. In Journal for Language Technology and Computational Linguistics 18. 23–60. [BibTeX]

Continue reading →