SHRUTI Bengali Continuous ASR Speech Corpus


Version I

Download Corpus

(Distributed Free Through Society for Natural Language Technology Research)


SHRUTI is a read speech corpus designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. Text corpus design was an effort of Communication Empowerment Lab, Indian Institute Of Technology, Kharagpur (IITKGP) with collaboration with Media Lab Asia. The speech was recorded, transcribed and has been maintained, verified at IITKGP. 


1. Corpus Speaker Distribution

SHRUTI contains a total of 7383 unique sentences. Sentences are spoken by 34 speakers from region of standard Bengali colloquial language of the West Bengal. The percentages of male and female speakers are 75% and 25% respectively. The speaker age in the corpus varies in between 20 to 40 yrs. A speaker's dialect region is the geographical area of the West Bengal, India where they lived during their childhood years.



Speaker

Levels

Male

Female

Age/gender

16-30 (y)

31-40 (y)

22

4

5

3

Education

Undergraduate

graduate

2

22

0

8

2. Corpus Details 
The text material in the SHRUTI prompts (found in the file "shruti_train.transcription") consists of sentences designed at IITKGP. Text material is collected from Anandabazar patrika, story book. Four major news domain articles were collected to cover phonetic variations of Bengali language. The domains are sports, political, general news and geographical news. To make it usable for general ASR system, the corpus was carefully designed. For this, most commonly spoken sentences were also collected and recorded Total sentences are recorded by 34 speakers with two sessions. The phonetically-compact sentences were designed to cover most of the frequent speaking word in Bengali language. Each speaker read variable number of sentences. Sentences were collected in ITRANS format. The corpora details are as follows:

Sentences 7383

Phoneme 49

Total Words 22012

Total Duration 21.64 hours


3. Directory and File Structure


We have used the corpus for CMU SPHINX speech recognition toolkit. The speech and associated data is organized according to the following hierarchy:

/<CORPUS>/<SEX><SPEAKER>/<SPEAKER_ID>/<SENTENCE_ID>.<FILE_TYPE>

Directorywav:

Contents:

      1. Sound file of male and female

      2. Text file of each speaker

Structure:


Where CORPUS:==SHRUTI
	SEX:==MALE/FEMALE
	 SPEAKER:==Speaker name
	 SPEAKER_ID:==<First name><Sentence number>
	 SENTENCE_ID:==<SENTENCE_NUMBER>
	 FILE_TYPE:==.WAV
 SENTENCE_NUMBER :== 0 ... 2342

Examples:
     /SHRUTI/WAV/MALE/bd/Bd_0.wav

4. File Types

The SHRUTI corpus includes several files associated with each utterance. In addition to a speech waveform file (.wav), one associated transcription file (.txt) exists.

File Type
Description
.wav
  MSWav speech waveform file.
.txt
Associated transcription of the words the person said.
.dic
The phonetic dictionary of the corresponding words
Directoryetc: 
Contents:
	1. shruti.dic  - Pronunciation dictionary       
	2. shruti_train.transcription  -   transcript file
	3. shruti.phoneme  -  phoneme list
	4. shruti_train.fileids  -  control file contains list of sound files
	5. sphinx_train.cfg   -  configuration file of SPHINX training               

Examples ofshruti.dic:

	Abola	A b o l
	Abu	A b u
	abyAhata	aa b b A h aa tt o
	abyAhata(2)	o b b A h aa tt o
	abyAhati	aa b b A h o tt i
	abyAhati(2)	o b b A h o tt I


Examples ofshruti_train.transcription:

<s> Ami satyi satyii oi jagadIsha jeThamAlAni bhadralokera kono khabarA`` khabara` rAkhi nA </s> (Bd_00)

<s> bombAi dillI mAdrAja yekhAnei tini thAkuna tAte AmAra` kichhu yAYa Ase nA </s>(Bd_01)

<s> sulekhA balalo ora chokhera maNi bhAganeTira ki halo </s> (Bd_02)

<s> bhadraloka to Age khuba YyAkaTibha chhilena </s> (Bd_03)

<s> AmAke ebAra satya kathA` sbIkAra karate halo </s> (Bd_04)


Publications Related to the Corpus:

1. Biswajit Das; Sandipan Mandal; Pabitra Mitra, "Bengali speech corpus for continuous automatic speech recognition system," Proc. Conf. Speech Database and Assessments (Oriental COCOSDA), pp.51-55, Taiwan, 2011

2. Sandipan Mandal; Biswajit Das; Pabitra Mitra; Anupam Basu, "Developing Bengali Speech Corpus for Phone Recognizer Using Optimum Text Selection Technique," Proc. Conf. Asian Language Processing (IALP), 2011 pp.268-271, 2011


Contributors:

Biswajit Das (Contact)
Communication Empowerment Lab.
Department of Computer Science and Engineering
Indian Institute of Technology, Kharagpur
email: biswajit.net@gmail.com 

Sandipan Mandal
Communication Empowerment Lab.
Department of Computer Science and Engineering
Indian Institute of Technology Kharagpur
email: mandal.sandipan@gmail.com

Kajal Das
Communication Empowerment Lab.
Indian Institute of Technology Kharagpur
email: kdas690@gmail.com

Pabitra
Mitra (Contact) Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
email: pabitra@gmail.com

Anupam Basu Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
email: anupambas@gmail.com

Acknowledgements:

Suparna Das
Center for Artificial Intelligence and Robotics
Bangalore, India
email: sparnakdas@gmail.com

Anupam Mandal
Center for Artificial Intelligence and Robotics
Bangalore, India
email: anupam_405@yahoo.com