DSL Corpus Collection (DSLCC)

This is the repository for the DSL Corpus Collection (DSLCC).

The DSLCC is a multilingual collection of short excerpts of journalistic texts. It has been used as the main data set for the DSL shared tasks organized within the scope of the workshop on NLP for Similar languages, Varieties and Dialects (VarDial). For more information, please check the DSL shared task reports (links below) or the website of past editions of VarDial workshop: VarDial 2017 at EACL, VarDial 2016 at COLING, LT4VarDial 2015 at RANLP, and VarDial 2014 at COLING.

So far, five versions of the DSLCC have been released. Languages included in each version of the DSLCC grouped by similarity are the table below. Click on the respective version to download the dataset.

Language/Variety	DSLCC v1.0	DSLCC v2.0	DSLCC v2.1	DSLCC v3.0	DSLCC v4.0
Bosnian	X	X	X	X	X
Croatian	X	X	X	X	X
Serbian	X	X	X	X	X
Czech	X	X	X
Slovak	X	X	X
Indonesian	X	X	X	X	X
Malay	X	X	X	X	X
Brazilian Portuguese	X	X	X	X	X
European Portuguese	X	X	X	X	X
Macanese Portuguese			X
Argentine Spanish	X	X	X	X	X
Mexican Spanish			X	X
Peninsular Spanish	X	X	X	X	X
Peruvian Spanish					X
Bulgarian		X	X
Macedonian		X	X
Canadian French				X	X
Hexagonal French				X	X
American English	X
British English	X
Persian					X
Dari					X

Citing the Dataset

If you used the dataset we kindly ask you to refer to the corpus description paper where you can also find more information about the DSLCC:

Liling Tan, Marcos Zampieri, Nikola Ljubešić, Jörg Tiedemann (2014) Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection. Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). pp. 6-10. Reykjavik, Iceland. [pdf] [bib]

The DSL Reports

For the results obtained by the participants of the four editions of the DSL shared task, please see the shared task reports below. In 2017, the DSL shared task was part of the VarDial evaluation campaign.

2017 - Marcos Zampieri, Shervin Malmasi, Nikola Ljubešić, Preslav Nakov, Ahmed Ali, Jörg Tiedemann, Yves Scherrer, Noëmi Aepli (2017) Findings of the VarDial Evaluation Campaign 2017. Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). pp. 1-15. Valencia, Spain. [pdf] [bib]

2016 - Shervin Malmasi, Marcos Zampieri, Nikola Ljubešić, Preslav Nakov, Ahmed Ali, Jörg Tiedemann (2016) Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task. Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). pp. 1-14. Osaka, Japan. [pdf] [bib]

2015 - Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Preslav Nakov (2015) Overview of the DSL Shared Task 2015. Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial). pp. 1-9. Hissar, Bulgaria. [pdf] [bib]

2014 - Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann (2014) A Report on the DSL Shared Task 2014. Proceedings of the 1st Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial). pp. 58-67. Dublin, Ireland. [pdf] [bib]

Additional Datasets

The following datasets have been used in other shared tasks organized within the scope of the VarDial workshop.

Arabic Dialect Identification (ADI): A dataset containing four Arabic dialects: Egyptian, Gulf, Levantine, North African, and MSA.

German Dialect Identification (GDI): The ArchiMob corpus containing Swiss German dialects from Basel, Bern, Lucerne, and Zurich.

Cross-lingual Parsing (CLP): Datasets for parsing similar languages: Croatian - Slovenian, Slovak - Czech, Norwegian - Danish and Swedish.