FIRE '13 Dataset


This webpage contains the development and test sets, originally a part of the query labelling subtask in the FIRE 2013 Track on Transliterated Search.

All the data available on this website must be used for non-commercial and research purposes only.


Dataset Description

We release 500 labelled queries for English-Hindi in the development set. These contain 1056 distinct word transliteration pairs. Due to the small size of the data, it is recommended to use this, not as a part of training algorithms, but rather as a development set for tuning model parameters and understanding and analyzing word transliteration pairs. Additionally, we also release 406 queries as a test set, language labelled at the word level.


Format Description

The aforementioned development set can be found here. An example labelled query is shown below. Here, each word is language tagged ('H' : Hindi; 'E' : English), and each Hindi word is transliterated to the Devanagari script as well.

banarasi\H=बनारसी silk\E sarees\E

The aforementioned test set can be found here. An example labelled query is shown below. Here, each word is language tagged ('hi' : Hindi; 'en' : English; 'NE*' : Named Entity).

bharat\hi ka\hi bharosa\hi dravid\NE zimbabwe\NE tour\en