Welcome to the publicly available dataset for Web search Query Segmentation

This dataset comprises of 500 Bing Australia queries (relatively rarer queries with query frequency between five and fifteen in the original Bing Australia log of March 2010), a list of numbered Web URLs, relevance judgment sets (qrels) for each query, query segmentations according to four algorithms and three human annotators, and the best quoted query versions (as explored through brute force). For details on how this dataset was constructed, please refer to:

SIGIR 2012 paper to be cited when using this dataset

Cite as: R. Saha Roy, N. Ganguly, M. Choudhury and S. Laxman. An IR-based Evaluation Framework for Web Search Query Segmentation. In Proceedings of the 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '12), Portland, USA, pages 881-890, ACM, 2012. [Bibtex]

This dataset (queries, segmentations and qrels) is freely available upon an email request sent to the authors. All senders will be treated as having read and accepted the terms of the Microsoft Research License Agreement (MSR-LA).

Click here to send the request.

The accompanying processed corpus (case folded, HTML tags removed, each sentence fragment appearing between any two punctuations on a new line, and extra whitespace and newlines removed) can be downloaded from here [74.4 MB archive, 13959 files inside].

Address all queries regarding dataset to any one of:

monojitc [AT] microsoft [DOT] com

niloy [AT] cse [DOT] iitkgp [DOT] ernet [DOT] in

rishiraj [DOT] saharoy [AT] gmail [DOT] com

srivatsan [DOT] laxman [AT] gmail [DOT] com


Last updated:
October 15, 2013.

You are visitor number:

Hit Counter

since May 23, 2012.