SHRUTI
contains
a
total
of
7383
unique
sentences.
Sentences
are
spoken
by
34
speakers
from
region
of
standard
Bengali
colloquial
language
of
the
West
Bengal.
The
percentages
of
male
and
female
speakers
are
75%
and
25%
respectively.
The
speaker
age
in
the
corpus
varies
in
between
20
to
40
yrs.
A
speaker's
dialect
region
is
the
geographical
area
of
the
West
Bengal,
India
where
they
lived
during
their
childhood
years.
Speaker |
Levels |
Male |
Female |
Age/gender |
16-30 (y) 31-40 (y) |
22 4 |
5 3 |
Education |
Undergraduate graduate |
2 22 |
0 8 |
2. Corpus DetailsThe text material in the SHRUTI prompts (found in the file "shruti_train.transcription") consists of sentences designed at IITKGP. Text material is collected from Anandabazar patrika, story book. Four major news domain articles were collected to cover phonetic variations of Bengali language. The domains are sports, political, general news and geographical news. To make it usable for general ASR system, the corpus was carefully designed. For this, most commonly spoken sentences were also collected and recorded Total sentences are recorded by 34 speakers with two sessions. The phonetically-compact sentences were designed to cover most of the frequent speaking word in Bengali language. Each speaker read variable number of sentences. Sentences were collected in ITRANS format. The corpora details are as follows:
Sentences 7383
Phoneme 49
Total Words 22012
Total Duration 21.64 hours
3. Directory and File Structure
We have used the corpus for CMU SPHINX speech recognition toolkit. The speech and associated data is organized according to the following hierarchy:
/<CORPUS>/<SEX><SPEAKER>/<SPEAKER_ID>/<SENTENCE_ID>.<FILE_TYPE>
Directory “wav”:
Contents:
Sound file of male and female
Text file of each speaker
Structure:
Where CORPUS:==SHRUTI SEX:==MALE/FEMALE SPEAKER:==Speaker name SPEAKER_ID:==<First name><Sentence number> SENTENCE_ID:==<SENTENCE_NUMBER> FILE_TYPE:==.WAV SENTENCE_NUMBER :== 0 ... 2342 Examples: /SHRUTI/WAV/MALE/bd/Bd_0.wav
4. File Types
The SHRUTI corpus includes several files associated with each utterance. In addition to a speech waveform file (.wav), one associated transcription file (.txt) exists.
File Type |
Description |
.wav |
MSWav speech waveform file. |
.txt |
Associated transcription of the words the person said. |
.dic |
The phonetic dictionary of the corresponding words |
Directory “etc”: Contents: 1. shruti.dic - Pronunciation dictionary 2. shruti_train.transcription - transcript file 3. shruti.phoneme - phoneme list 4. shruti_train.fileids - control file contains list of sound files 5. sphinx_train.cfg - configuration file of SPHINX training Examples of “shruti.dic”: Abola A b o l Abu A b u abyAhata aa b b A h aa tt o abyAhata(2) o b b A h aa tt o abyAhati aa b b A h o tt i abyAhati(2) o b b A h o tt I Examples of “shruti_train.transcription”:
<s> Ami satyi satyii oi jagadIsha jeThamAlAni bhadralokera kono khabarA`` khabara` rAkhi nA </s> (Bd_00)
<s> bombAi dillI mAdrAja yekhAnei tini thAkuna tAte AmAra` kichhu yAYa Ase nA </s>(Bd_01)
<s> sulekhA balalo ora chokhera maNi bhAganeTira ki halo </s> (Bd_02)
<s> bhadraloka to Age khuba YyAkaTibha chhilena </s> (Bd_03)
<s> AmAke ebAra satya kathA`
sbIkAra karate halo </s> (Bd_04)
Publications Related to the Corpus:
1. Biswajit Das; Sandipan Mandal; Pabitra Mitra, "Bengali speech corpus for continuous automatic speech recognition system," Proc. Conf. Speech Database and Assessments (Oriental COCOSDA), pp.51-55, Taiwan, 2011 2. Sandipan Mandal; Biswajit Das; Pabitra Mitra; Anupam Basu, "Developing Bengali Speech Corpus for Phone Recognizer Using Optimum Text Selection Technique," Proc. Conf. Asian Language Processing (IALP), 2011 pp.268-271, 2011 Contributors: Biswajit Das (Contact) Communication Empowerment Lab. Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur email: biswajit.net@gmail.com
Sandipan Mandal
Communication Empowerment Lab.
Department of Computer Science and Engineering
Indian Institute of Technology Kharagpur
email: mandal.sandipan@gmail.com
Kajal Das
Communication Empowerment Lab.
Indian Institute of Technology Kharagpur
email: kdas690@gmail.com
Pabitra Mitra (Contact) Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
email: pabitra@gmail.com
Anupam Basu Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
email: anupambas@gmail.com
Acknowledgements:
Suparna Das
Center for Artificial Intelligence and Robotics
Bangalore, India
email: sparnakdas@gmail.com
Anupam Mandal
Center for Artificial Intelligence and Robotics
Bangalore, India
email: anupam_405@yahoo.com