Site menu:

DSL Shared Task 2015

Jointly with the LT4VarDial workshop, we will be organizing the 2nd edition of the DSL Shared Task.


Discriminating between Similar Languages 2015 (Finished)

Discriminating between similar languages and language varieties is one of the bottlenecks of language identification systems. This aspect has been topic of a number of papers published in the last years as well as of the first Discriminating between Similar Languages (DSL) shared task held in 2014.

Results

The tables containing the results are available in this link.

The shared task report can be found here and the bib entry is the following:

@InProceedings{zampieri-EtAl:2015:LT4VarDial1,
     author = {Zampieri, Marcos and Tan, Liling and Ljube\v{s}i\'{c}, Nikola and Tiedemann, J\"{o}rg and Nakov, Preslav},
     title = {Overview of the DSL Shared Task 2015},
     booktitle = {Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects},
     month = {September},
     year = {2015},
     address = {Hissar, Bulgaria},
     publisher = {Association for Computational Linguistics},
     pages = {1--9}
}

Data

For the DSL shared task 2015 edition, we are releasing two new versions of the DSL corpus collection (DSLCC), the version 2.0 and 2.1. The version 2.0 is the standard shared task training material whereas the version 2.1 can be used for the unshared task track or as additional training material.

This year apart from the similar languages and varieties we are also releasing texts from other languages to emulate a real-world language identification scenario. The two versions who will be release are the following:

The corpus contains 20,000 instances per language (18,000 training + 2,000 development). Each instance is an excerpt extracted from journalistic texts containing 20 to 100 tokens and tagged with the country of origin of the text. A language tree with all the languages is here.

The Shared Task

Twenty-five days after the release of the training set (see dates below), two test sets (A and B) will be released. Each of them will contain 1,000 unidentified instances of each language to be classified according to the country of origin.

Participants should return their results in up to 2 days after the release of the test sets. Scores will be calculated according to the systems' accuracy in identifying the country of origin of the text. We allow two kinds of submissions (please indicate this when you fill out your registration form):

Teams are allowed to submit up to six runs to each submission (three for each test set (A and B)).

The best systems will be invited to submit a paper describing their findings (8 pages + 2 for references). Submissions should be formatted according to the RANLP template

The Unshared Task Track

The DSLCC was compiled to serve as a standardise dataset for the discrimination of similar languages and related tasks. For this reason, along with the shared task, this edition we are proposing an unshared task track inspired by the unshared task in PoliInformatics from 2014.

To participate in the unshared task track, teams should use any of the versions of the DSLCC to investigate differences between similar languages and language varieties using NLP methods. The fundamental question is: What are the most important differences between these similar languages and language varieties?

Some questions that should be answered are:

The unshared task track is not a competition but an interesting linguistic exercise. Interested participants should fill out the registration to receive the data and submit their papers until July 20th, 2015. Submissions should be formatted according to the RANLP template. More information here.

Dates

The shared task dates for 2015 are:


DSL 2014: In the 2014 edition we provided a training and a test set of 13 different languages organized in 6 groups (more information here). The dataset contained journalistic texts and it is entitled DSL Corpus Collection. For more information, you may wish to take a look at the DSL 2014 Shared Task report and the dataset description paper.