Sparv: Språkbanken’s corpus annotation pipeline infrastructure (SLTC 2016)

Lars Borin, Markus Forsberg, Martin Hammarstedt, Dan Rosén, Roland Schäfer & Anne Schumacher: Sparv: Språkbanken’s corpus annotation pipeline infrastructure. Swedish Language Technology Conference (SLTC), Umeå.

Sparv is Språkbanken’s corpus annotation pipeline infrastructure. The easiest way to use the pipeline is from its web interface with a plain text document. The pipeline uses in-house and external tools on the text to segment it into sentences and paragraphs, tokenise, tag parts-of-speech, look up in dictionaries and analyse compounds. The pipeline can also be run using a web API with XML results, and it is run locally at Språkbanken to prepare the documents in Korp, our corpus search tool. While the most sophisticated support is for modern Swedish, the pipeline supports 15 languages.