Datasets for FIRE 2013 Track on Transliterated Search


This webpage contains datasets and tools that will be required for participating in the FIRE 2013 Track on Transliterated Search. These datasets or tools are either: (a) directly downloadable from this site, (b) downloadable from external sites, or (c) available on email request.

All the data available on this website must be used for non-commercial and research purposes only. Wherever applicable, relevant papers must be cited in your FIRE 2013 working notes. Distribution of these datasets is not permitted under any circumstances.


General | Subtask 1 | Subtask 2


General datasets

These datasets are generally useful for transliteration tasks. These include word frequency lists, word transliteration pairs, and miscellaneous tools and corpora.


Datasets specific to Subtask 1 [Back to top]

We will provide 500, 100 and 150 labelled queries for English, Bangla and Gujarati respectively. These contain 1056, 298 and 546 distinct word transliteration pairs respectively. Due to the small size of the data, we do not recommend that you use these for training your algorithms, but rather as a development set for tuning model parameters and understanding and analyzing word transliteration pairs.

Open these text files in UTF-8 encoding to properly view contents. You may use Notepad++ (Encoding -> Encode in UTF-8).

To obtain these datasets, please send an email with subject "Request for FIRE 2013 Transliteration Track Datasets (Subtask 1)" to monojitc [AT] microsoft [DOT] com with cc to rishiraj [DOT] saharoy [AT] gmail [DOT] com. The email should contain your full name, affiliation, email and complete contact address with mobile number (along with the same for your team members if you are participating as a team). The sample input files are marked as "<language> - Dev.txt" and the output files are named as "<language> - Annotated - Dev.txt".


Datasets specific to Subtask 2 [Back to top]


Test data release and submission dates:

For logistic reasons, we would like each team to submit their output exactly within two days of release of the test data. However, each team will be allowed to choose their preferred dates (within a stipulated time period) for test data release. The data will be sent to you by email on your chosen date, and we will expect your runs to be submitted within 48 hours from the time the data was sent. If we do not receive your run within 48 hours, we cannot guarantee the evaluation of your submission.

Your preferred date for test data release has to be between 1st and 15th October 2013. No test data will be released before 1st or after 15th October 2013. Also note that we will not release the development-cum-training datasets (which is currently being distributed) beyond 13th October 2013.

Evaluation report will be sent to the teams by 30th October 2013.

Input/Output Formats:

The output format is exactly the same as the annotated dev data for Subtask 1. Words have to be marked as \E or \L(L being H or B or G according as the concerned language is Hindi or Bangla or Gujarati)=<word in native script>. For example,

beetein\H=बीतें lamhein\H=लम्हें video\E download\E

For subtask 2, output format is like the qrels dev file, without, of course, the relevance judgment. You will be provided query-ids, and the output (top-ten documents only) should look like, for example:

query-id-34
doc-id-32181
doc-id-32369
doc-id-32191
doc-id-56082
doc-id-35605
doc-id-49650
doc-id-15865
doc-id-36242
doc-id-14931
doc-id-44854

query-id-29
doc-id-13332
doc-id-55515
doc-id-34214
doc-id-34682
doc-id-24655
doc-id-16832
doc-id-46317
doc-id-57516
doc-id-34711
doc-id-9462

There should be a newline after each query-id and doc-id, and the last doc-id for each query must be followed by two newlines.

Number of Runs allowed:

A "run" is the output of your system on our test data in the prescribed format. If you want to test different approaches to solve the same subtask, you can do so by submitting multiple runs for the subtask. A team is allowed to submit at most 3 runs per subtask.

Submission process:

Step 1: Send an email addressed to monojitc [AT] microsoft [DOT] com with cc to rishiraj [DOT] saharoy [AT] gmail [DOT] com by 4th October 2013 with subject "Test data release date". Clearly identify yourself so that we know which team you represent, and specify your preferred date for test data release, and the tracks (subtask I, subtask II or both) for which you would like to receive the test data. Also specify the email address(es) where you would like to receive the test data.

Step 2: Read and familiarize yourself with the input/output format of the test data. Make sure that the output file(s) generated by your system conform to the specified format exactly, because otherwise we may not be able to evaluate your submission or evaluation results might be incorrect.

Step 3: On your chosen date, you will receive the test data by 10am IST.

Step 4: Download the test set(s). Run your system(s) on these files. If you wish to submit multiple runs for a subtask, name your output text files as subtask<#>run<#>.txt (e.g., subtask1run2.txt). For subtask1, append the language name with an underscore (_). For example, subtask1run3_gujarati.txt. Put all your output files in a zipped file. If you want, you can add a Readme file in the zipped archive with some notes that you want us to know before evaluating your submission.

Step 5: Email us this zipped file before 10am IST on the third day from your chosen date for test data release (e.g., if you had chosen 7th October 2013 as the test data release, you will receive the test data on 7th Oct by 10am IST, and you have to send us back the output files by 10am IST on 9th Oct).

 


Date modified: October 01, 2013