Third Workshop on NLP for Similar Languages, Varieties and Dialects
December 12th, 2016 - Osaka, Japan
Submission deadline
September 29th, 2016
Acceptance notification
October 21st, 2016
Camera-ready deadline
October 31st, 2016
VarDial
vardialworkshop
@gmail.com
DSL Shared Task
dsl.sharedtask
@gmail.com
Following the success of the first two editions (held in 2014 and 2015), in VarDial 2016 we organized the third edition of the DSL shared task featuring two sub-tasks.
In the DSL shared task participants are asked to train systems to discriminate between similar languages, language varieties, and dialects.
The tables containing the results are available here.
The shared task report can be found here and the bib entry is the following:
@InProceedings{malmasi-EtAl:2016:VarDial3,
author = {Malmasi, Shervin and Zampieri, Marcos and Ljube\v{s}i\'{c}, Nikola and Nakov, Preslav and Ali, Ahmed and Tiedemann, J\"{o}rg},
title = {Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task},
booktitle = {Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)},
month = {December},
year = {2016},
address = {Osaka, Japan},
pages = {1--14}
}
This year we divided the DSL Shared Task into two sub-tasks.
For the Sub-task 1, we released a new version of the DSL corpus collection (DSLCC). The corpus contains 20,000 instances per country (18,000 training + 2,000 development). Each instance is an excerpt extracted from journalistic texts with the country of origin of the text.
The languages and varieties included in this year's edition grouped by similarity are:
For sub-task 1 two test sets (A and B) were released. Each of them contain 1,000 unidentified instances of each language to be classified according to the country of origin.
This year, for the first time the DSL shared task included a sub-task on Arabic dialects.
As dialects are mostly used in conversational speech, in sub-task 2 we provided a dataset containing ASR transcripts.
We released training and testing data for the following Arabic dialects: Egyptian, Gulf, Levantine, and North-African, and Modern Standard Arabic (MSA)
We will considered two types of submission:
A total of six submissions (3 for closed and 3 for open) was allowed for each training set (A, B, C).
After the shared task participants were invited to submit a paper to the VarDial workshop describing their findings (8 pages + 2 for references). Submissions should be formatted according to the COLING template.