Suman Kalyan Maity
IBM PhD Fellow (2016 - present)

Microsoft Research India PhD Fellow (2014-2016)







My Fav Quotes

"Here's to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They're not fond of rules. And they have no respect for the status quo. You can quote them, disagree with them, glorify or vilify them. About the only thing you can't do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world, are the ones who do."

- Apple Inc.

Research Activities

Answerability in Quora

Quora is one of the most popular community Q&A sites of recent times. However, many question posts on this Q&A site often do not get answered. However, with increasing question posts over time and the posts covering a wide range of topics (unlike focused Q&A sites like Stack Overflow), not all of them are getting answered. Measuring answerability (i.e., whether a question shall get answered or not) involves collecting expensive human judgment data that can differentiate the characteristics of an answered question from an unanswered (aka open) one. Factors to judge if a question would remain open include if it is subjective, open-ended, vague/imprecise, ill-formed, off-topic, ambiguous etc. The major problem of this method is that it is not scalable as it is difficult to collect such judgements for thousands of questions. In this paper, we quantify for the first time i) user-level and ii) question-level linguistic activities - that can nicely correspond to many of the judgement factors noted above, can be easily measured for each question post and that appropriately discriminates an answered question from an unanswered one. Our central finding is that the way users use language while writing the question text can be a very effective means to characterize answerability. This characterization further helps us to predict early if a question remaining unanswered for a specific time period t will eventually be answered or not and achieve high accuracy. Notably, features representing the language use patterns of the users are most discriminative.

Early prediction of popular hashtag compounds

Hashtag is the new "paralanguage" of Twitter. What started as a way for people to connect with others and to organize similar tweets together, propagate ideas, promote specific people or topics has now grown into a language of its own. As hashtags are created by people on their own, any new event or topic can be referred to by a variety of hashtags. This linguistic innovation in the form of hashtags is a very special feature of Twitter which has become immensely popular and are also widely adopted in various other social media like Facebook, Google+ etc. and have been studied extensively by researchers to analyze the competition dynamics, the adoption rate and popularity scores. One of the interesting and prevalent linguistic phenomena in today's world of brief expressions, chats etc. is hashtag compounding where new hashtags are formed through combination of two or more hashtags together with the form of the individual hashtags remaining intact. For example, #PeoplesChoice and #Awards together form #PeoplesChoiceAwards. #KellyRipa and #CelebrationMonth make #KellyRipaCelebrationMonth; #WikipediaBlackout is formed from #Wikipedia and #Blackout; #OregonBelieveMovieMeetup is formed from #Oregon, #BelieveMovie and #Meetup; #Educational, #Ipad, #Apps together make #EducationalIpadApps etc. There are marketing strategic needs, needs for fulfilling communicative intents (affective expression, political persuasion, humor etc.) as well as spontaneous needs for use of hashtag compounds. For example, the e-commerce company Amazon used #AmazonPrimeDay to promote the discounted sale of its product. The hashtag is a compound of #Amazon and #PrimeDay whereas the individual hashtag #PrimeDay was also popular. So, there is a trade-off whether to use hashtag compounds or the uncompounded constituents. Similarly, assume another scenario where an event is taking place, say the premiere of a movie 'The Imitation Game'. Here one can use both the hashtags #TheImitationGame and #Premiere or can use a hashtag compound #TheImitationGamePremiere. In this context, one needs to identify which version one should use so that the hashtag being used gains a higher frequency of usage in the near future. #CSCW2016 is being used to tag the activities taking place related to the 2016 CSCW conference. This is also a compound hashtag made of #CSCW and #2016 where #CSCW refers to all CSCW conferences and #2016 refers to all the events/activities going to take place in 2016. The hashtag #CSCW2016 is used for a more focused purpose and refering to only the 2016 edition of the conference whereas #CSCW could also have served the purpose. Hashtag compounds also serve the communicative intents like political campaign hashtags (#PresidentTrump = #President + #Trump : hashtag that shows support for Donald Trump for the 2016 US Presidential election). Hashtag compounding also happen spontaneously. These hashtags are generally conversational or personal themed hashtags like #TheBestFeelingInARelationship (#TheBestFeeling + #InARelationship), #ThrowbackThursday (#Throwback + #Thursday), #ComeOnNowDontLie (#ComeOnNow + #DontLie). In this work, we identify for the first time that while some of these compounds gain a high frequency of usage over time (even higher than the individual constituents) many of them are soon lost into oblivion. We focus and investigate in detail the reasons behind the above observations and propose a prediction model that can identify with 77.07% accuracy if a pair of hashtags compounding in the near future (i.e., 2 months after compounding) shall become popular. At longer times T = 6, 10 months the accuracies are 77.52% and 79.13% respectively. This technique has strong implications to trending hashtag ecommendation since newly formed hashtag compounds can be recommended early, even before the compounding has taken place. As an additional contribution, we ask human subjects to guess whether a hashtag compound will become popularfrom the structural information of the hashtags. Humans can predict compounds with an overall accuracy of only 48.7%. Notably, while humans can discriminate the relatively easier cases, the automatic framework is successful in classifying the relatively harder cases.

Characterizing Out-of-Vocabulary words in Twitter

Language in social media is mostly driven by new words and spellings that are constantly entering the lexicon thereby polluting it and resulting in high deviation from the formal written version. The primary entities of such language are the out-of-vocabulary (OOV) words. In this work, we study various sociolinguistic properties of the OOV words and propose a classification model to categorize them into at least six categories. Our proposed classification framework achieves a high accuracy of 81.26% with a high precision and recall. We observe that the content features are the most discriminative which alone can contribute to accuracy of 77.83%. We believe that such a categorization is the first step to a deeper understanding of the semantics of OOV words and would also be very effective in classifying so far unseen OOV words that get introduced in the social media platform very frequently.

Out of vocabulary words decrease, running texts prevail and hashtags coalesce: Twitter as an evolving sociolinguistic system

Twitter is one of the most popular social media. Due to the ease of availability of data, Twitter is used significantly for research purposes. Twitter is known to evolve in many aspects from what it was at its birth; nevertheless, how it evolved its own linguistic style is still relatively unknown. In this work, we study the evolution of various sociolinguistic aspects of Twitter over large time scales. To the best of our knowledge, this is the first comprehensive study on the evolution of such aspects of this OSN. We performed quantitative analysis both on the word level as well as on the hashtags since it is perhaps one of the most important linguistic units of this social media. We studied the (in)formality aspects of the linguistic styles in Twitter and find that it is neither fully formal nor completely informal; while on one hand, we observe that Out-Of-Vocabulary words are decreasing over time (pointing to a formal style), on the other hand it is quite evident that whitespace usage is getting reduced with a huge prevalence of running texts (pointing to an informal style). We also observed that Twitter texts follow Zipf's law like natural language and has a strong core-periphery structure with words in the cores hardly migrating over time. We perform similar linguistic studies on hashtags as we did on the words and observe that both the core periphery, Zipf's law and other linguistic quantities show similar behavior. We also analyze and propose quantitative reasons for repetition and coalescing of hashtags in Twitter. Though repetition of same words in natural language text is uncommon, we observe that in Twitter hashtag repetition is not so uncommon. People tend to repeat the hashtags when he/she is usually expressing a strong opinion on some issue/event. Also people use repetition of hashtags to express excitement or happiness. For example, we come across a tweet that contains only #snow appearing in it mainly expressing a strong feeling of the user regarding possibly the current weather condition. In etymology, we come across words that are formed from various other words sampled from the same or a different language. This linguistic phenomena of word coalescing is not new and we found many instances of word coalescing over the history of evolution of any language. For example, in English, 'milkman' has been formed from 'milk' and 'man', walkman is the combination of 'walk' and 'man' with meaning of the words getting slightly modified due to coalesce. In Twitter also, we observe such merging phenomena which is far more prevalent than in standard texts. Further, such mergings happen at very short timescales compared to years/centuries in case of languages. For example, #peopleschoice and #awards together form #peoplechoiceawards. #journals, # justinbieber, #book form #justinbieberjournalsbook ; #oregonbelievemoviemeetup is formed by #oregon, #believemovie and #meetup etc.

Analysis and prediction of question topic popularity in Quora

In this work, we consider a dataset of more than four years and analyze the dynamics of topical growth over time; how various factors affect the popularity of a topic or its acceptance in Q&A community. We propose a regression model to predict the popularity of the topics and discuss the important discriminating features. We achieve a high prediction accuracy (correlation coefficient ~0.773) with low root mean square error (~1.065). We further categorize the topics into a few broad classes by implementing a simple Latent Dirichlet Allocation (LDA) model on the question texts associated with the topics. In comparison to the data sample with no categorization, this stratification of the topics enhances the prediction accuracies for several categories. However, for certain categories there seems to a slight decrease in the accuracy values and we present an in-depth discussion analyzing the cause for the same pointing out potential ways for improvement.

A stratified learning approach for predicting the popularity of Twitter Idioms

Twitter Idioms (#10ThingsAboutMe, #4WordsAfterABreakup, #ThingsMyMamaDo, #ICantForgetAboutYou etc.) are one of the important types of hashtags that spread in Twitter. In this work, we propose a classifier that can stratify the Idioms from the other kind of hashtags with 86.93\% accuracy and high precision and recall rate. We then learn regression models on the stratified samples (Idioms and non-Idioms) separately to predict the popularity of the Idioms. This stratification not only itself allows us to make more accurate predictions but also makes it possible to include Idiom-specific features to separately improve the accuracy for the Idioms. Experimental results show that such stratification during the training phase followed by inclusion of Idiom-specific features leads to an overall improvement of ~19% in correlation coefficient over the baseline method.

Other projects

I have worked on couple of inter-related computational linguistic problems. One of them being developing microscopic models that capture the essential features of language variations in society. In this project, I have collaborated with Dr. Tyll Krueger, University of Bielefeld, Germany. The other project I have worked is to develop computational models to understand how word-meaning association is formed in a multi-agent scenarios and what are the factors influencing the phenomena. This work has been done with the active collaboration from Dr. Francesca Tria and Dr. Vittorio Loreto, ISI Foundation, Turin, Italy. I had worked on opinion formation in social network and its various aspects for more than 2 years which was the central theme of my MS research.

Part-time projects

  • Title : Developed a browser plug-in for Search Trail Suggestion at Yahoo Hack U '13. Duration : March '13.

  • Title : Analysing the impact of group performance in Test Cricket (Term Project). Duration : February '13 - April '13.

  • Title : Developed a small-scale search engine capable of retrieving information to domain queries (Term Project). Duration : August '11 - October '11.