Students - ongoing (PhD / MS)

PhD Students


Pradeep R

Area of Research : Music Signal Processing

Research objective:

Email : pradeeprengaswamy@gmail.com


Tanumay Mandal

Area of Research : Biomedical Signal Processing

Research objective:

Email : tanum.dets@gmail.com


Sutapa Bhattacharya

Area of Research : Speech Processing

Research objective: Human speech is a non-linear and non-stationary signal. Apart from the content of the spoken message, speech signal is characterized by various speaker-dependent features such as age, gender, accent, emotion, presence of ailment etc. Moreover, speech signal produced by humans can be corrupted by noise present in the ambience as well as speech recording framework. The current state of the art systems treat speech signal as linear and stationary over small intervals and extract features representing each of the short frames. MFCC, LPC and PLP are the most widely used features in various speech processing tasks such as speech recognition, speaker identification, emotion recognition etc. However, the conventional features extraction methods do not always provide the best discriminating features for various recognition tasks. In past recent time-frequency domain features of speech signal are being explored both in lieu of and along with conventional feature extraction methods to get best performance in speech processing systems. Time- frequency domain analysis involves decomposing the speech signal into finite number of narrowband components that add up to the original speech signal. My research objective is exploring various decomposition methods to find the narrowband components that mostly contain the spoken content of the message and discard the components containing mostly noise and speaker-dependent information. Speech signal thus decomposed is desirable for speaker-independent noise-robust speech recognition.
Two methods used extensively in recent past for time-frequency analysis of non-linear, nonstationary signals are empirical mode decomposition (EMD) and variational mode decomposition (VMD). EMD is a completely data-dependent algorithm whereas VMD needs certain parameters to be provided by user prior to decomposition of signal, so the user need to have some prior knowledge of the nature of the signal to decompose it into physically significant components using VMD. On the other hand, EMD is not well-defined mathematically, is less noise-robust than VMD and suffers from a problem named mode mixing. Along with using VMD for meaningful decomposition of speech, my goal is also to modify it’s algorithm so that it can incorporate the benefit of being completely data- depndent like EMD, so that the algorithm does not reply on the user providing correct input arguments to decompose the signal.

Email : bsutapaece@gmail.com


Kiran Reddy M

Area of Research : Speech Processing

Research objective:

Email : kiran.reddy889@gmail.com


Pradeep R

Area of Research : Speech Processing

Research objective:

Email : rangan.pradeep@gmail.com


Saikat Biswas

Area of Research : Bioinformatics

Research objective:

Email : s.sin443@gmail.com


Kishore Kumar R

Area of Research : Speech Processing

Research objective: A massive volume of audio data is piling up from several sources in day to day life, such as news channels, entertainment, education, etc. Organizing these data and retrieving the relevant audio content, for the queried audio remains a challenging task. My objective is to segregate the entire speech corpus into meaningful groups at a broader semantic level. The standard domain keywords match between pairs of speech utterances are discovered to achieve speech utterances clustering task. With the obtained clusters, the retrieval task is carried out by detecting the keywords match with the queried audio, and the relevant speech utterances associated with that particular keyword is retrieved. The final clusters may represent the broader classes of information such as politics, sports, and weather, etc.

Email : rskishorekumar@gmail.com


Kumud Tripathi

Area of Research : Speech Processing

Research objective: The natural way of communication for humans is speech. Therefore, automatic speech recognition (ASR) systems are explored by the researchers to provide the natural interaction between machine and human. As the demand for speech recognition in multiple languages grows, the development of multilingual ASR system which combines the phonetic unit of all languages to be recognized into one single global acoustic model set is of increasing importance. Traditional ASR systems are developed for read mode of speech. However, speech can be broadly classified into three modes, such as read, extempore, and conversation. The performance of ASR will be affected when input utterance belongs to a different mode of speech. The reason for this is the mismatch in the acoustic and linguistic characteristics of speech signals across various modes of speech. Therefore, my research objective is focussed on developing a framework for automatically recognizing phonetic units present in a speech utterance of any language spoken in any mode.

Email : kumudtripathi.cs@gmail.com


Debopriyo Banerjee

Area of Research :Data Analytics

Research objective:

Email : deb.ban89@gmail.com


Abhijit Debnath

Area of Research :Multi-modal lecture video analysis

Research objective: With the advent of internet technologies and the popularity of Massive Open Online Courses (MOOC)s large number of educational videos in the form of e-learning courses, hosted at the webportals, namely edx, coursera, udemy, NPTEL, etc where courses from diverse domains are available in the internet. This has truely democratised learning and has extended quality education to masses beyond the ambient of traditional classrooms. However in most of the video sharing platforms like Youtube, Vimeo, Dailymotion etc. video search are done based on textual meta-data. The textual meta- data is limited and are generally done by humans. The manual entry is error-prone, laborious and expensive. Also finding specic point of interest requires a search engine that can analyze the content of video and identify important semantic segments and keywords automatically. It is therefore desired to generate automatic meta-data for indexing and retrieval of lecture videos.
The overall objective of the work is to perform automatic semantic segmentation of video lecture and spoken word recognition by exploiting audio-visual features from the video file. With the semantic segmentation of video lectures and spoken words recognized from the audio tracks, a system is proposed to be developed that facilitates indexing and retrieval of video lectures with a utility to perform text-based search at two levels - segmental level and word level. To address these issues, deep learning techniques are used to perform semantic segmentation of video lectures and forced alignment technique are explored for spoken word recognition. After the multi-modal analysis of lecture videos, a web-based system is proposed that performs indexing and retrieval.

Email : dnabhijit@gmail.com


Gurunath Reddy M

Area of Research : Music Signal Processing

Research objective:

Email : mgurunathreddy@gmail.com


Haque Arijul

Area of Research : Speech Processing

Research objective: In the evolving field of human–computer interaction (HCI), there are a large number of modes by which humans communicate with computers. Modes like speech, text, GUI-based interaction using mouse or touchscreen devices are the most common. Among these, speech is one of the most intuitive and natural modes of communication. Embedded in speech is a very important aspect of the message one intends to convey-- emotions. For more natural HCI through speech, this emotional aspect needs to be incorporated in the machines. Computers should be able to both understand the emotion in speech conveyed by a human, as well as generate a speech in response that corresponds to both the message and the emotion identified. This requires machines to be able to recognize emotions from human speech.
My work focuses on this aspect, i.e., identifying emotions automatically from human speech. There have been many previous works in this area and this is still a hot topic of research. Many common signal processing techniques have been explored to extract features from emotional speech and many pattern recognition algorithms have also been tried out. However, the use of deep neural networks (DNN), which has the potential to virtually derive the features from raw speech itself, has not yet been adequately explored in this area. This is relatively a new paradigm and needs to be explored more. Therefore, I am trying to employ different types of DNN with various configurations to identify emotions from speech. Also, speech is not always noise-free. Therefore, the robustness of these techniques to noise will also be investigated. Apart from that, an analysis of some important emotions (like happiness, anger, sadness) will also be attempted to find out which features of speech contribute to these emotions the most.

Email : rjlhq05@gmail.com


Priya Dharshini G

Area of Research : Speech Processing

Research objective:

Email : priyagdarshi@gmail.com


Soumya Majumdar

Area of Research : Speech Processing

Research objective: Video is a visual multimedia source that combines sequences of images to form moving objects. There are different types of videos like TV series, movies, music, educational, sports etc. Nowadays people have less time in hand and they want an abstract video, so researches on video summarization gain momentum. Different research work on video summarization is going on by academic researchers and industries. Video summarization has two kinds of approaches- static video summarization (or key-frame based approach) and dynamic video summarization (or video skimming). Static summarization approach develops a set of key-frames as a summary of the given video. Dynamic summarization approach develops a short video which contains key events of the given video. Static summarization identifies distinct frame and uses them in preparing summary. Dynamic summarization uses semantic contents like color, texture and motion.
My objective is to develop a model which will automatically generate a dynamic summary of any video. For developing this model; few steps are required like shot detection, detection of semantic connection between these shots based on video text and audio, evaluation of semantic importance of a shot in a video, making those shots as a part of video summary based on semantic importance and keeping semantic connection between them.

Email : soumya.majumdar92@gmail.com


Aravinda Reddy PN

Area of Research : Speech Processing

Research objective:

Email : aravindareddy.27@gmail.com

Madhu Keerthana Y

Area of Research : Speech Processing

Research objective:

Email : madhu.keerthu@gmail.com

Sudhakar Pandiarajan

Area of Research : Speech Processing

Research objective: Speech based communication is one of the natural and easiest communication between human beings. Though speech communication is natural, the devices that respond to the speech query were limited. Many researchers are working on Automatic Speech Recognition (ASR) and Speech Synthesis (TTS) systems to bridge the communication gap. Being a part of Speech processing group, my focus will be developing Text to Speech (TTS) systems for Indian languages (Low resource languages) that produce human intelligible speech against input text seamlessly.

Email : sudhakar.asp@gmail.com