Multivariate Gaussian Mixture Model based Automatic Phoneme Recognizer for Kannada

Prashanth Kannadaguli; Vidya Bhat

doi:10.17577/IJERTV3IS100597

Volume 03, Issue 10 (October 2014)

Multivariate Gaussian Mixture Model based Automatic Phoneme Recognizer for Kannada

DOI : 10.17577/IJERTV3IS100597

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 74
Total Downloads : 269
Authors : Prashanth Kannadaguli, Vidya Bhat
Paper ID : IJERTV3IS100597
Volume & Issue : Volume 03, Issue 10 (October 2014)
Published (First Online): 27-10-2014
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Multivariate Gaussian Mixture Model based Automatic Phoneme Recognizer for Kannada

Prashanth Kannadaguli1, Vidya Bhat2

Department of Electronics and Communication Engineering, Manipal Institute of Technology, Manipal, India

AbstractWe build an automatic phoneme recognition system based on Gaussian Mixture Modeling (GMM) which is a static modeling scheme. Models were built by using Stochastic pattern recognition and Acoustic phonetic schemes to recognize phonemes. Since our native language is Kannada, a rich South Indian Language, we have used 15 Kannada phonemes to train and test these models. Mel Frequency Cepstral Coefficients (MFCC) are well known Acoustic features of speech[3][4]. Hence we have used the same in speech feature extraction. Finally performance analysis of models in terms of Phoneme Error Rate (PER) justifies the fact that though static modeling yields good results, improvisation is necessary in order to use it in developing Automatic Speech Recognition systems.

KeywordsPhoneme Modeling; GMM; Pattern Recognition; MFCC; PER; Kannada

INTRODUCTION

The Automatic Speech Recognition (ASR) System of any language must be able to recognize spoken sentences, words, syllables and phonemes of that particular language [3]. Here sentences consist of many utterances of different

recorded the same phoneme under different background noise but using the same microphone and software tool. Hence we have 7500 phonemes in the training database and 3000 phonemes in testing database. The training phase of phonemes includes the mean and covariance of their MFCC to generate a probability density function using multivariate modeling. Given this model in testing phase, we can estimate the likelihood of any testing sample belonging to all 15 classes and that class which gives higher likelihood is the recognized phoneme.
WORKING OF GMM

The basic idea here is to develop a model that aims at the production of the most probable phoneme Q* when we give an acoustic observation sequence S as an input. If is the i- th possible phoneme sequence and the conditional probability is evaluated over all the possible phonemes and represents the parameters that are used to estimate the probability distribution, then the Bayesian or MAP decision rule can be given by[7]
words, words are made up of many syllables and each

=

,

(1)

syllable is a meaningful utterance of phonemes. Hence it is very clear that phoneme is the smallest part of speech and it is absolutely necessary to build a phoneme recognition system which can be later used for syllable or word recognition which in term can be used for recognizing sentences leading to a language model basically which works in controlled

Since each phoneme Q* has to be realized in infinite number of possible acoustic ways, it can be represented by its model which yields

environments. Keeping this in mind, in order to build a language model for Kannada, this work is our first approach

= ,

(2)

to build a phoneme recognition system.

For phoneme recognition there are several signal processing techniques that have been proposed [1][2][5][6][9], which evidently proves that we get the PER in the range 5% to 30%. The most successful results are for HMM which have used MFCC as speech features[5]. Since speech is a pseudo-random signal having quasi-periodic nature, we can also use stochastic analysis for its features pattern recognition. Hence we have used

Here M* is the model of the sequence of phoneme data which represents the linguistic message in the speech input S,

is the possible phoneme data sequence , , is the posterior probability model of phoneme data sequence given the acoustic input S and the maximum is evaluated over all the possible models. Now we can apply Bayes ruleas follows

, /

GaussianMixtureModeling which uses Bayesian decision rule, also known as Maximum a Posteriori (MAP).

To demonstrate these concepts, we have built a database of 15 Kannada phonemes. Each phoneme is recorded 500 times for training and 200 times for testing with a sampling rate of 8kHz. While recording the phonemes, we have

, =

/ (3)

METHODOLOGY

There are two phases in our work, training and testing.

Construction of Database

Though the ultimate goal is to develop a speaker independent system, to start with, we have decided to build a speaker dependent system. So all the samples were recorded for the same native Kannada speakerboth for training and testing.Details of the database are shown in Table1.

TABLE1: DETAILS OF PHONEME DATABASE

Unicode	Kannada Character	Number of Training samples	Number of Testing Samples
0C85		500	200
0C87		500	200
0C89		500	200
0C8E		500	200
0C92		500	200
0C950CBD		500	200
0C950CBF		500	200
0C950CC1		500	200
0C950CC6		500	200
0C950CCA		500	200
0C970CBD		500	200
0C970CBF		500	200
0C970CC1		500	200
0C970CC6		500	200
0C970CCA		500	200

Pre-processing

Since the recordings of speech samples were made in normal conditions with different background noise, it becomes absolutely necessary to isolate speech from noise including end point detection of speech. We have used the method proposed in [8] for noise removal. Our database has different folders arranged by phoneme Unicode inside which, all corresponding phonemes are saved after pre-processing in

.wav format.
Feature Extraction

Mel-Frequency Cepstral Coefficients were used as the acoustic phonetic features. The MFCC extraction includes Pre-emphasis, Framing, Windowing, computation of Fast Fourier Transform(FFT), Mel Frequency Warping, its logarithm and finally computation of Discrete Cosine Transform(DCT) as explained in [4][9]. The output of DCT is of 12 dimensions. For pictorial representation of phonemes, we have used first two dimensions of MFCC data. Such a plot

for four phonemes is as shown in Fig.1 and it can be observed that phonemes have serious overlap in 2D vector space.

Fig.1: 2D scatter plot for four phonemes
Phoneme recognition using GMM

To recognize an unknown phoneme from our testing database given its MFCC, we perform Gaussian multivariate model for each class by calculating the mean and covariance matrices of corresponding phoneme sequencs. The mean and standard deviation ellipse of the multivariate processes shown in Fig.1 is plotted in Fig.2.

Fig.2: Mean and Standard deviation ellipse for the multivariate process in Fig.1.

Later we estimate the likelihood of the given test feature vector using the multivariate model for each class. We have used the standard Gaussian Probability Density Function. This implicitly assumes that the MFCC vectors in each class

have a uni-modal normal distribution, which returns the estimates of mean and covariance matrix of Gaussian multivariate data samples. 2D plot of data samples of four phonemes using first two MFCC values and the equivalent 3D plots using Gaussian Mixture Modeling are shown in Fig.3.

Fig.3: 2D scatterplots and 3D Gaussian PDF plots for phonemes 0C85, 0C87, 0C89 and 0C8E

The Expectation Maximization (EM) algorithm is used as the main training function in GMM. The EM tries to maximize the likelihood of the data, for the given GMM parameters like mean, covariance. The estimation step is purely soft classification wherein for each feature vector it calculates probability of a class, given that feature vector. In maximization step, mean and covariance of each class is updated using all features and a weight. The algorithm iterates on both steps until the total likelihood increases for the training data. During testing we use the features of unknown signal to estimate the likelihood of the sequences in the feature vector and obtain the posterior probabilities.

RESULTS AND DISCUSSIONS

The result analysis was done by using Phoneme Error Rate(PER), which can be defined as the ratio of the number of phonemes misclassified to the total number of phonemes used for testing. The PER calculation is as shown in Table2.

TABLE 2: PER CALCULATIONS

Unicode	Kannada Character	PER (%)
0C85		0
0C87		0
0C89		0
0C8E		0
0C92		0
0C950CBD		1.6
0C950CBF		0
0C950CC1		0.4
0C950CC6		2.4
0C950CCA		5.6
0C970CBD		0
0C970CBF		2.2
0C970CC1		6.2
0C970CC6		0
0C970CCA		0

Histograms of the GMM based phoneme recognizer is as shown in Fig.4. It is clear that the phonemes having similar PDF or similar pronunciations lead to misclassification. Above results are consistent with the results obtained from other traditional methods [5] and the results are better than Bayesian phoneme recognizer [10].

Fig. 4: Histogram of the outputs of the GMM based phoneme recognizer, for samples of each of the fifteen possible input phonemes. The integer values

along x-axis refer to the index of the phoneme.

CONCLUSION

In this work, we presented Gaussian Mixture Modeling for phoneme recognition. This is a novel approach, different from traditional methods. Results reveal that, this method is suitable for building automatic phoneme recognition systems. This work can be further extended by including various acoustic phonetic features and by using Hidden Markov Modeling as a different approach in automatic phoneme recognition for Kannada language.

ACKNOWLEDGMENTS

We thank everyone who supported us with valuable suggestions during this work.

REFERENCES

R. K. Aggarwal and M. Dave, Using Gaussian mixture for Hindi speech recognition system, International Journal of Speech processing,image Processing and Pattern Recognition, vol. 4, no. 4, December2011.
C. H. Lee, J. L. Gauvain, R. Pieraccini and L. R Rabiner, Large vocabulary speech recognition using subword units, Speech Communication, vol. 13, pp. 263279, 1993.
Lawrence R. Rabiner, B. H. Juang, Fundamentals of Speech recognition, 2nd Indian Reprint, Pearson Education, pp. 103-455, Delhi, 1993.
Y. Lee and K.W. Hwang, Selecting Good Speech Features for Recognition, ETRI, vol. 18, Apr. 1996.
S. Young, The general use of tying in phoneme based HMM speech recognition, proceedings of ICASSP, 1992, pp. 569-572.
S. A. Zahorian, P. Silsbee, and X. Wang, Phone classification with segmental features and a binarypair partitioned neural network classifier, proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-97), 1997, pp. 1011 -1014.
T.Dutoit, F. Marques,Applied signal processing, Springer 2008.
G. Saha, Sandipan, A New Silence Removal and Endpoint Detection Algorithm for Speech and Speaker Recognition Applications, Department of Electronics and Electrical Communication Engineering Indian Institute of Technology, Khragpur, Kharagpur, India.
M. A. Anusuya and S. K. Katti, "Mel Frequency Discrete Wavelet Coefficients for Kannada Speech Recognition using PCA" in Proceedings of International Conference on Advances in Computer Science, 2010.
Prashanth Kannadaguli, Vidya Bhat, Phoneme modeling for speech recognition in kannada using bayesian multivariate modeling, Unpublished.

Multivariate Gaussian Mixture Model based Automatic Phoneme Recognizer for Kannada

Leave a Reply