FIRE '13 Dataset

All the data available on this website must be used for non-commercial and research purposes only.

We release 500 labelled queries for English-Hindi in the development set. These contain 1056 distinct word transliteration pairs. Due to the small size of the data, it is recommended to use this, not as a part of training algorithms, but rather as a development set for tuning model parameters and understanding and analyzing word transliteration pairs. Additionally, we also release 406 queries as a test set, language labelled at the word level.

The aforementioned development set can be found here. An example labelled query is shown below. Here, each word is language tagged ('H' : Hindi; 'E' : English), and each Hindi word is transliterated to the Devanagari script as well.

banarasi\H=बनारसी silk\E sarees\E

The aforementioned test set can be found here. An example labelled query is shown below. Here, each word is language tagged ('hi' : Hindi; 'en' : English; 'NE*' : Named Entity).

bharat\hi ka\hi bharosa\hi dravid\NE zimbabwe\NE tour\en