This dataset comprises of 500 Bing Australia queries (relatively rarer queries with query frequency between five and fifteen in the original Bing Australia log of March 2010), a list of numbered Web URLs, relevance judgment sets (qrels) for each query, query segmentations according to four algorithms and three human annotators, and the best quoted query versions (as explored through brute force). For details on how this dataset was constructed, please refer to:
SIGIR 2012 paper to be cited when using this dataset
Cite as: R. Saha Roy, N. Ganguly, M. Choudhury and S. Laxman. An IR-based Evaluation Framework for Web Search Query Segmentation. In Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '12), Portland, USA, pages 881-890, ACM, 2012. [Bibtex]
This dataset (queries, segmentations and qrels) is freely available upon an email request sent to the authors. All senders will be treated as having read and accepted the terms of the Microsoft Research License Agreement (MSR-LA).
Click here to send the request.
The accompanying processed corpus (case folded, HTML tags removed, each sentence fragment appearing between any two punctuations on a new line, and extra whitespace and newlines removed) can be downloaded from here [74.4 MB archive, 13959 files inside].
Address all queries regarding dataset to any one of:
monojitc [AT] microsoft [DOT] com
niloy [AT] cse [DOT] iitkgp [DOT] ernet [DOT] in
rishiraj [DOT] saharoy [AT] gmail [DOT] com
srivatsan [DOT] laxman [AT] gmail [DOT] com
Last updated: October 15, 2013.
You are visitor number:
since May 23, 2012.