Third Workshop on NLP for Similar Languages, Varieties and Dialects
December 12th, 2016 - Osaka, Japan
September 29th, 2016
October 21st, 2016
October 31st, 2016
DSL Shared Task
In the DSL shared task participants are asked to train systems to discriminate between similar languages, language varieties, and dialects.
This year we divided the DSL Shared Task into two sub-tasks.
For the Sub-task 1, we released a new version of the DSL corpus collection (DSLCC). The corpus contains 20,000 instances per country (18,000 training + 2,000 development). Each instance is an excerpt extracted from journalistic texts with the country of origin of the text.
The languages and varieties included in this year's edition grouped by similarity are:
For sub-task 1 two test sets (A and B) were released. Each of them contain 1,000 unidentified instances of each language to be classified according to the country of origin.
This year, for the first time the DSL shared task included a sub-task on Arabic dialects.
As dialects are mostly used in conversational speech, in sub-task 2 we provided a dataset containing ASR transcripts.
We released training and testing data for the following Arabic dialects: Egyptian, Gulf, Levantine, and North-African, and Modern Standard Arabic (MSA)
We will considered two types of submission:
A total of six submissions (3 for closed and 3 for open) was allowed for each training set (A, B, C).
After the shared task participants were invited to submit a paper to the VarDial workshop describing their findings (8 pages + 2 for references). Submissions should be formatted according to the COLING template.
We are happy to announce the results of the the 2016 DSL shared task and we thank all teams for their participation. The results can be downloaded here.