Linguistic research with large annotated web corpora

Linguistic research with large annotated web corpora (2013). Pre-conference tutorial, The 20th International Conference on Head-Driven Phrase Structure Grammar, Berlin, August 26, 2013, 9:30 – 16:00

COW Tutorial: Scripts
COW Tutorial: Slides
COW Tutorial: Worksheet

The world wide web most like­ly con­sti­tutes the hugest ex­ist­ing source of texts writ­ten in a great va­ri­ety of lan­guages. A fea­si­ble and sound way of ex­ploit­ing this data for lin­guis­tic re­search is to com­pile a stat­ic cor­pus for a given lan­guage. For ex­am­ple, we have cre­at­ed lin­guis­ti­cal­ly an­no­tat­ed gi­ga-​to­ken web cor­po­ra for var­i­ous lan­guages (Dutch 2.5 GT, En­glish 3.9 GT, French 4.3 GT, Ger­man 9.1 GT, Span­ish 1.6 GT, Swedish 2.3 GT) and are still in the pro­cess of cre­at­ing new cor­po­ra (Dan­ish, Japanese, Por­tuguese, etc.), as well as im­prov­ing the old ones.

How­ev­er, any­one who needs to do se­ri­ous work with web cor­po­ra should be aware of the char­ac­ter­is­tics (and lim­i­ta­tions) of such cor­po­ra, which de­pend to con­sid­er­able ex­tent on a num­ber of de­ci­sions taken in the mak­ing of such cor­po­ra. The first aims of this tu­to­ri­al is to il­lus­trate the var­i­ous steps that lead from data col­lec­tion on the web to the final, lin­guis­ti­cal­ly an­no­tat­ed cor­pus, high­light­ing the stages where cru­cial de­ci­sions have to be made and how these may be re­flect­ed in the cor­pus.
The sec­ond part of this tu­to­ri­al is a hands-​on in­tro­duc­tion to the use of the Open Cor­pus Work­bench (a piece of soft­ware well suit­ed to store and query very large cor­po­ra), with spe­cial at­ten­tion to its in­te­gra­tion with the R statis­tics en­vi­ron­ment. We use our own web cor­po­ra for the demon­stra­tion.

Pictures taken by Stefan Müller (external link)