I am a Ph.D. student (Microsoft Research India Ph.D. Fellow) in the Department of Computer Science and Engineering, IIT Kharagpur, since December 2009. My Ph.D. is jointly advised by Prof. Niloy Ganguly (IIT Kharagpur) and Dr. Monojit Choudhury (Microsoft Research India). I am a part of the Complex Networks Research Group (CNeRG) at IIT Kharagpur. Even though I am very passionate about search and my Ph.D. is on Web search query analysis, my general research interests include Information Retrieval, Text Mining, Machine Learning, Natural Language Processing, Complex Networks, and Linguistics. I joined Adobe Research Labs India, Bangalore, on 1st April 2014, as a Computer Scientist.
Research | Publications | Academics | Others | Contact
What's New?
[02 August 2014] Delivered invited talk on Experiment Design and Evaluation for Information Retrieval at the Pre-FIRE Workshop on Information Retrieval for Transliterated Queries 2014, Microsoft Research India, Bangalore.
[29 July 2014] Journal paper accepted for publication in Web Semantics: Science, Services and Agents on the World Wide Web, Elsevier.
[18 July 2014] Delivered invited talk on Query and Document Understanding at PESIT, Bangalore.
[07 July 2014] Submitted my Ph.D. Thesis at IIT Kharagpur.
My Ph.D. research focuses on various aspects of query analysis, applied to the domain of Web search. The thesis idea explores the proposition of search queries having evolved into a distinct language, resulting from continuous two-way interactions between users and the search engine. Our results quantify similarities and differences between queries and their parent natural language (English in our case). However, the focus has always been on solving practical open problems in information retrieval. Specifically, I have worked on flat and hierarchical query segmentation (algorithms and evaluation), intent analysis, language modelling, complex network modelling, and cognitive experiments conducted through crowdsourcing.
General philosophy
The co-evolution of the Web and commercial search engines, and the inability of such search engines to process natural language (NL) questions, have resulted in search queries being formulated in a syntax which is more complex than a bag-of-words model, but more flexibly structured than sentences conforming to NL grammar. In this thesis, we take the first steps to understand this unique syntactic structure of Web search queries in an unsupervised setup, and apply the acquired knowledge to make important contributions to Information Retrieval (IR). First, we develop a query segmentation algorithm that uses query logs to discover syntactic units in queries. We find that our algorithm detects several syntactic constructs that differ from NL phrases. We proceed to augment our method with Wikipedia titles for identifying long named entities. Next, we develop an IR-based evaluation framework for query segmentation which is superior to previously employed evaluation schemes against human annotations. Here, we show that substantial IR improvements are possible due to query segmentation. We then develop an algorithm that uses only query logs to generate a nested query segmentation, where segments can be embedded inside bigger segments. Importantly, we also devise a technique for directly applying nested segmentation to improve document ranking. Subsequently, we use segment co-occurrence statistics computed from query logs to find that query segments broadly fall into two classes - content and intent. While content units must match exactly in the documents, intent units can be used in more intelligent ways to improve the quality of search results. More generally, the relationship between content and intent segments within the query is vital to query understanding. Finally, we generate large volumes of artificial query logs constrained by n-gram model probabilities estimated from real query logs. We perform corpus-level and query-level comparisons of model-generated logs with the real query log based on complex network statistics and (crowdsourced) user intuition of real query syntax, respectively. The two approaches together provide us with a holistic view of the syntactic complexity of Web queries which is more complex than what $n$-grams can capture, but yet more predictable than NL.
PUBLICATIONS (LONG PAPERS) [Google Scholar Profile] [Back to top]
Rishiraj Saha Roy, Rahul Katare, Niloy Ganguly, Srivatsan Laxman and Monojit Choudhury, “Discovering and understanding word level user intent in Web search queries”, in Web Semantics: Science, Services and Agents on the World Wide Web, Elsevier, 2014 (in press).
Rishiraj Saha Roy, Rahul Katare, Niloy Ganguly and Monojit Choudhury, “Automatic Discovery of Adposition Typology”, in Proceedings of the 25th International Conference on Computational Linguistics (Coling ’14), 23 – 29 August 2014, Dublin, Ireland, pages 1037 - 1046. [Poster] [BibTeX] [Preprint]
Rishiraj Saha Roy, M. Dastagiri Reddy, Niloy Ganguly and Monojit Choudhury, "Understanding the Linguistic Structure and Evolution of Web Search Queries", in Proceedings of the 10th International Conference on the Evolution of Language (Evolang X), 14 - 17 April 2014, Vienna, Austria, pages 286 – 293. [Slides] [BibTeX] [Preprint]
Rishiraj Saha Roy, Monojit Choudhury, Prasenjit Majumder and Komal Agarwal, "Overview and Datasets of FIRE 2013 Track on Transliterated Search", in Proceedings of the Fifth Forum for Information Retrieval Evaluation 2013 (FIRE '13), 04 - 06 December 2013, New Delhi, India. [Slides] [BibTeX] [Data]
Rohan Ramanath, Monojit Choudhury, Kalika Bali and Rishiraj Saha Roy, "Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation", in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL '13), 4 - 9 August 2013, Sofia, Bulgaria, pages 1713 - 1722. [Slides] [Supplementary material] [BibTeX]
Rishiraj Saha Roy, Niloy Ganguly, Monojit Choudhury and Srivatsan Laxman, "An IR-based Evaluation Framework for Web Search Query Segmentation", in Proceedings of the 35th Annual ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR '12), 12 - 16 August 2012, Portland, USA, pages 881 – 890. [Slides] [Data] [BibTeX]
Rishiraj Saha Roy, Monojit Choudhury and Kalika Bali, "Are Web Search Queries an Evolving Protolanguage?", in Proceedings of the 9th International Conference on the Evolution of Language (Evolang IX), 13 - 16 March 2012, Kyoto, Japan, pages 304 – 311. [BEST RESEARCH POSTER AWARD] [Poster] [BibTeX] [Preprint]
Durga Toshniwal and Rishiraj Saha Roy, “Shape pattern matching: A tool to cluster unstructured text documents”, in Journal of Computational Methods in Science and Engineering (JCMSE), Volume 10, Supplement 1, IOS Press, September 2010, pages 73 – 84. [BibTeX]
Rishiraj Saha Roy and Durga Toshniwal, “Fuzzy Clustering of Text Documents Using Naïve Bayesian Concept”, in Proceedings of the 1st International Conference on Recent Trends in Information, Telecommunication and Computing 2010 (ITC '10), 12 – 13 March 2010, Kochi, India, pages 55 – 59. [BibTeX]
Durga Toshniwal and Rishiraj Saha Roy, “Clustering Unstructured Text Documents Using Naïve Bayesian Concept and Shape Pattern Matching”, in International Journal of Advancements in Computing Technology (IJACT), Volume 1, Number 1, AICIT (Advanced Institute of Convergence Information Technology) Press, September 2009, pages 52 - 63. [BibTeX]
Durga Toshniwal and Rishiraj Saha Roy, “A Hierarchical Clustering Scheme for Unstructured Text Data”, in Proceedings of the International Conference on Information and Knowledge Engineering 2009 (IKE '09), 13 – 16 July 2009, Las Vegas, USA, pages 640 – 646. [BibTeX]
Rishiraj Saha Roy, Anusha Suresh, Niloy Ganguly and Monojit Choudhury, "Place Value: Word Position Shifts Vital to Search Dynamics", in Posters of the 22nd International World Wide Web Conference 2013 (WWW '13), 13 - 17 May 2013, Rio de Janeiro, Brazil, pages 153 – 154 (companion). [Poster] [BibTeX]
Rishiraj Saha Roy, "Analyzing Linguistic Structure of Web Search Queries", in Doctoral Consortium of the 22nd International World Wide Web Conference (WWW '13), 13 - 17 May 2013, Rio de Janeiro, Brazil, pages 395 – 399 (companion). [Slides] [BibTeX]
Rishiraj Saha Roy, Niloy Ganguly, Monojit Choudhury and Naveen Kumar Singh, “Complex Network Analysis Reveals Kernel-Periphery Structure in Web Search Queries”, in Proceedings of the 2nd International ACM SIGIR (Association for Computing Machinery Special Interest Group on Information Retrieval) Workshop on Query Representation and Understanding 2011 (QRU '11), 28 July, 2011, Beijing, China, pages 5 – 8. [Slides] [Poster] [BibTeX]
Nikita Mishra, Rishiraj Saha Roy, Niloy Ganguly, Srivatsan Laxman and Monojit Choudhury, "Unsupervised Query Segmentation Using only Query Logs", in Posters of the 20th International World Wide Web Conference 2011 (WWW '11), 28 March - 1 April 2011, Hyderabad, India, pages 91 – 92 (companion). [Poster] [BibTeX]
ACADEMICS [Back to top]
Curriculum vitae [as on 13 December 2014]
Ph.D. Student in Computer Science and Engineering (CSE), IIT Kharagpur [2009 - present]
M.Tech. in Information Technology from IIT Roorkee with CGPA 7.98 [2007 - 2009]
B.E. in Information Technology from Jadavpur University, Kolkata, with CGPA 8.76 [2003 - 2007]
Schooling from Bhavan's Gangabux Kanoria Vidyamandir (BGKV, also known as "Bharatiya Vidya Bhavan"), Kolkata (CBSE Class X = 89.80%, CBSE Class XII = 89.40%) [1989 - 2003]
Internships and visits
Visited Microsoft Research India, Bangalore (June 2013 - July 2013)
Visited Microsoft Research India, Bangalore (March 2012 - April 2012)
Internship at Microsoft Research India, Bangalore (October 2011 - January 2012)
Visited Microsoft Research India, Bangalore (November 2010 - December 2010)
Internship at Microsoft Research India, Bangalore (May 2010 - July 2010)
Awards, Fellowships and Grants
ACM SIGIR Student Travel Support for attending the 35th Annual SIGIR Conference on Research and Development in Information Retrieval (SIGIR '12) at Portland, USA (August 2012)
Best Research Poster Award at Evolang IX (2012), Kyoto Japan (sponsored by the John Benjamins Publishing Co.) (March 2011)
Evolang student grant for attending the 9th International Conference on the Evolution of Language (Evolang IX) at Kyoto, Japan (March 2012)
Microsoft Research India Ph.D. Fellowship for three years (August 2011)
NIXI Fellowship for attending the 20th International World Wide Web Conference 2011 (WWW 2011) at Hyderabad, India (March 2011)
Teaching Assistantships
Autumn 2013: Information Retrieval, CSE, IIT Kharagpur
Spring 2013: Complex Networks, CSE, IIT Kharagpur
Autumn 2012: Information Retrieval, CSE, IIT Kharagpur
Spring 2012: Complex Networks, CSE, IIT Kharagpur
Autumn 2011: Information Retrieval, CSE, IIT Kharagpur
Spring 2011: Programming and Data Structures Laboratory, CSE, IIT Kharagpur
Autumn 2010: Programming and Data Structures Laboratory, CSE, IIT Kharagpur
Spring 2009: Data Mining, IT, IIT Roorkee
Autumn 2008: Object Oriented Programming, IT, IIT Roorkee
Clustering Unstructured Text Documents Using Naïve Bayesian Concept and Shape Pattern Matching (July 2008 – June 2009, Indian Institute of Technology Roorkee, Roorkee) [M.TECH. PROJECT] (Advisor: Prof. Durga Toshniwal)
Text Document Classification and a Search Tool for News Articles (July 2008 – October 2008, Indian Institute of Technology Roorkee, Roorkee) (Advisor: Prof. Durga Toshniwal, Collaborator: Shweta Modi)
Analysis and Enhancement of TCP Performance in Mobile Ad Hoc Networks (June 2006 – May 2007, Jadavpur University, Kolkata) [B.E. PROJECT] (Advisor: Prof. Uttam Kumar Roy)
My webpage as maintained by IIT Kharagpur.
Invited speaker on "Experiment Design and Evaluation for Information Retrieval" at the Pre-FIRE Workshop on Information Retrieval for Transliterated Queries 2014, Microsoft Research India, Bangalore.
Invited speaker on "Query and Document Understanding" at PESIT Bangalore.
Task coordinator for the Transliterated Search track, with Dr. Monojit Choudhury (Microsoft Research India), Prof. Prasenjit Majumder (DAIICT) and Komal Agarwal (DAIICT), at FIRE 2013, 4 - 6 December 2013, New Delhi, India. [Overview PDF] [Slides] [BibTeX] [Track data]
Invited speaker on "Query and Document Understanding" [Slides] and student vounteer at MSR India and IRSI Pre-FIRE Workshop on Multilingual Information Retrieval, 15 - 17 June 2013, Bengaluru, India.
Student volunteer for WWW 2013, 13 - 17 May 2013, Rio de Janeiro, Brazil.
Invited poster on "Are Web Search Queries Evolving into a Language of their Own?" (with Prof. Niloy Ganguly, IIT Kharagpur and Dr. Monojit Choudhury, Microsoft Research India) at FIRE 2012, 17 - 19 December 2012, Kolkata, India [Poster].
CONTACT ME [Back to top]
Permanent Address: AH - 8, Sector - 2, Salt Lake, Kolkata - 700091, West Bengal, India.
Department Address: Room No. 208, Computer Science and Engineering, IIT Kharagpur, Kharagpur - 721302, West Bengal, India.
Hostel Address: C - 122, JCB Hall, IIT Kharagpur, Kharagpur - 721302, West Bengal, India.
Email addresses: rishiraj [DOT] saharoy [AT] gmail [DOT] com; 10cs9401 [AT] iitkgp [DOT] ac [DOT] in.
Date modified: 13 December 2014
You are visitor number:
since February 01, 2011.