- Open Access
- Total Downloads : 147
- Authors : Anjan Kumar Sahu, Gyana Ranjan Mati
- Paper ID : IJERTV5IS080466
- Volume & Issue : Volume 05, Issue 08 (August 2016)
- DOI : http://dx.doi.org/10.17577/IJERTV5IS080466
- Published (First Online): 27-08-2016
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Odia Isolated Word Recognition using DTW
Anjan Kumar Sahu
Dept. of ECE
Centre for Advanced Post Graduate Studies Rourlela, INDIA
Gyana Ranjan Mati
Dept. of ECE
Centre for Advanced Post Graduate Studies Rourlela, INDIA
Abstract Speech Recognition is the process of communication between human and computer, it has a wide area of applications in a security system, healthcare, military, telephony system, and equipment designed for handicapped. Speech is the vocalized form of communication, based upon the syntactic combination of lexicas and names that are drawn from very large (usually about 1,000 different words) vocabularies. Each spoken word is created out of the phonetic combination of a limited set of vowel and consonant speech sound units, so a proper algorithm is required for feature extraction and recognition process. For the feature extraction, we have used MFCC (Mel Frequency Cepstral Coefficients) and for recognition, we have used DTW (Dynamic Time Warping) all the implementation of Speech recognition have been done using MATLAB 2012b software.
Keywords Speech Recognition, MFCC, DTW, FFT, Isolated word
-
INTRODUCTION
Speech is the ancient way to express ourselves. Speech recognition is the process of training the computer or a machine to identify spoken isolated word of a person using speech signal information. In the training phase a large number of data has to be stored for speech recognition process. In the second phase the model is used for classification. When a person speaks, the speech signal is captured. To identify the speech information acoustic analysis is carried out. First the feature is extracted of the data then from the feature extracted data it was given to a classifier which can classifies the isolated spoken word by matching it with the data base and the maximum match of utterance will recognized as the spoken word.
Basically speech recognition problem can focus on identifying the speech or speaker who uttered the speech. The proposed model aims to design an isolated word recognizer which can identifies the person.
Fig.1 Proposed Word Recognition Model
-
LITERATURE REVIEW
Various methods have been proposed for the isolated word recognition over a decayed from which Hidden Markov Model (HMM) has been extensively used for large vocabulary data base for high reliability [10]. Artificial Neural Networks (ANN) is another classifier for speech recognition with accuracy which is acceptable [11], Support Vector Mechanism (SVM) have been used to classify speech pat-tern using linear and non-linear discrimination models [12]. For simple isolated word detection DTW and MFCC approach is enough and efficient [5]. However, if continuous speech detection with speaker discrimination is needed MFCC alone is not necessary for assuring the algorithm. Combination Various Classifiers required for high reliability, for simple word recognition and for small amount of database creation MFCC and DTW approach is the simpler approach then HMM, ANN and SVM [5].So in this Paper we have adopted MFCC and DTW as our proposed model.
-
METHODOLOGY
-
Recognition Module
In the process of isolated word recognition process we have to follow two approach. First process is the feature extraction model and second one is the feature matching model. For feature extraction model we have used MFCC and before going for feature extraction we have to calculate the energy, spectrogram and PSD (Power Spectral Density). In this process we have fist recorded the speech with 16KHZ sampling frequency then we do our pre-processing steps, in the pre-processing 1st we have calculated the energy and its energy spectrum is shown below fig.2.
The energy of an input speech is calculated for know-how much energy is present inside the signal, then we have to calculate the PSD, power is calculated to know the amount of power inside a speech signal and it can be shown in the below fig. 3 from that power spectrum we can know that which one is our desired signal and which one is our noise or the silence part after calculating the power we will go for the spectrogram analysis of the input speech signal using wideband spectrogram and narrowband spectrogram and it can be described in the below fig. 4.
Spectrogram is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time or some other variable. Spectrogram can be used to identify the spoken word phonetically and to analyze the various calls of human.
Fig.2 Energy of the given word
Fig. 3 Power Spectral Density of the Speech
Fig.4 Spectrogram of the speech
-
Feature Extraction
After calculating the preprocessing steps we have to go for MFCC analysis of the spoken word and it can be described as, first the input voice is preprocessed it can be done by the help of three steps that are pre emphasis, normalization, and
mean subtraction [1]. In the pre-emphasis a FIR high pass filter with the transfer function H (z) =1-0.98z-1 is used to flatten the signal spectrum. The high frequencies of the speech signal formed in the vocal tract are attenuated as the sound pass through the lips. Then normalization is done to reduce amplitude variation from speech samples for all the words. In normalization, all the samples of the signal are divided by the highest amplitude sample value in the signal. Mean subtraction is done to remove the dc offset introduced due to the microphone and some other effects introduced at the time of recording [4]. Then we go for framing and windowing and it can be done by taking the pre-emphasized signal is di-vided into short frame blocks and hamming window is applied to these frames. Hence, the speech signal is divided into frames and the assumption is that the signal is stationary for this small frame and features are calculated for each frame. The purpose of this window is to limit the time interval to be analyzed so that the properties of the waveform do not change appreciably.
Fig.5 Hamming Window
Windowing also serves to remove the signal discontinuities at the beginning and end of each frame. Hamming window is used for this purpose since it provides smoother spectrum. Hamming window is given by Eq.
W(n)=0.54-0.46cos[2n/(N-1)] 0nN-1 (1)
Where, N is the number of samples in a single frame
Next we are going for FFT (First Fourier Transform), here we calculate the DFT of an input signal in an efficient manner and thus saving processing power and reducing computation time.
The FFT is characterized by the following equation: X(k)=x(j)WN(j-1)(k-1) (2)
Where x (j) is the jth sample, WN .This gives spectral
coefficients of the windowed frame.
Then we go for MEL filtering which converts the frequency domain signal into its corresponding MEL domain. The process of obtaining Mel-Cepstral Coefficients involves the use of a Mel-scale filter bank. The Mel-scale is a logarithmic scale resembling the way that the human ear perceives sound.
Fig.6 MFCC Feature Extraction Process
The filter bank is composed of 20 triangular filters that are not- uniformly placed in frequency such that these filters will have a linear frequency response at low frequency (up to 1 KHz) and logarithmic at high frequencies as shown in fig. 7 The Mel scale is represented by the following equation:
Mel (f) = 2595 * log10 (1+f/700) (3) Where f is the frequency.
The spectral coefficients of the frames are multi-plied by the filter gain and the result is obtained, the MEL filter output is shown in below figure 7.
Fi. 7 Mel filter bank output
Then after MEL filtering we go for LOG and DCT which converts the MEL domain parameter into its corresponding frequency domain parameter. Which is given by
(4)
Where m is the MEL domain parameter and the output of LOG is shown in figure 10. And the DCT is calculated by
(5)
Where n= 1, 2, 3, 4..k = FFT coefficients)
Voice Sample after FFT
MEL filter
LOG
Voice Sample after FFT
MEL filter
LOG
MFCC
DCT
MFCC
DCT
Fig. 9 steps in converting frequency domain to time domain
Fig.8 Mel Vs Frequency plot
The above figure shows the relationship between Mel and frequency and the process of converting of Mel to Frequency is shown in below figure 9
Fig. 9 LOG (Mel) output
-
Feature Matching
In this stage, the features of word calculated in pre-vious step are compared with the help of the data-base to calculating the exact spoken word. DTW algorithm is implemented to calculate the least distance between features of word utterance and reference templates [5]. Corresponding to least value among calculated score with each template, the word is detected. How can we find the optimum mapping path in DTW? To compute the smallest Euclidean distance between the paths the obvious choice is for-ward Distance Path, which can be summarized by simple three steps and the below are showing the optimum choice/ the 3steeps:
D (i-1, j) D (i, j)
D (i-1, j-1) D (i, j-1)
Figure 10. DTW distance mapping
-
Optimum value function: Define D(i, j) as the DTW distance between t(1:i)and r(1:j), with the mapping path starting from (1,1) to (i, j) .
-
Recursion:
D + min (6) With the initial condition D (1, 1) = (7)
-
Final answer : D(m, n)
In practice, we need to construct a matrix D of dimensions m×n first and fill in the value of D (1, 1) by using the initial condition. Then by using the recursive formula, we fill the whole matrix one element at a time, by following a column by column or row by row order. The final answer will be available as D (m, n), with a computational complexity of
O (mn) corresponds to the least distance is the word detected.
-
-
RESULTS AND DISCUSSION
The distance while comparing similar words and different words are shown in table1. With similar words distance is below 150 with different words the distance is more than 300 i.e., 397.9128. Thus a threshold of 200 or less can filter a given word from set of saved templates. As DTW calculates possible alignment between two vector paths, the distance obtained when two same sequences compared should be 0.
No. Pronounced in English
No. Pronounced in Odia
ONE
EKA
Two
DUI
THREE
TINI
FOUR
CHARI
FIVE
PANCHA
No. Pronounced in English
No. Pronounced in Odia
ONE
EKA
Two
DUI
THREE
TINI
FOUR
CHARI
FIVE
PANCHA
Table 1 Pronunciation of English to Odia words
Table 2 Comparison Between Different Words
Word 1
Word 2
DTW Distance
EKA
DUI
397.913
EKA
TINI
447.869
EKA
CHARI
338.086
EKA
PANCHA
323.089
DUI
TINI
412.192
DUI
CHARI
377.377
DUI
PANCHA
522.532
TINI
CHARI
350.23
TINI
PANCHA
587.635
CHARI
PANCHA
430.281
Table 3 Comparison Between Same Words
Similar Words
DTW distance
EKA
130.695
DUI
112.619
TINI
125.9196
CHARI
120.129
PANCHA
101.722
-
CONCLUSION
With MFCC and DTW, isolated word detection system is generated in MATLAB 2012b environment. System is trained by saving templates of five separate words. Results showed that saving ten templates for each word in training phase gives good results compared with five templates. Efficiency in detecting isolated words is 100percent for two syllable words compared with one syllable word. From the results above, we can infer that DTW distance between identical words is less than 150 and between different words is more than 300. So setting the threshold of 200 we can easily filter the word uttered by the user from the other words whose templates are saved in the training phase. In the future scope we can use LPC instead of MFCC for a Comparing result between MFCC and LPC and instead of DTW we can use HMM for sentence recognition and converting the sentence into its corresponding English sentence.
REFERENCE
-
Lawrence Rabiner and Biing-Hwang Juang, Fundamentals of Speech Recognition, Englewood Cliffs, NJ: Prentice Hall, pages 333-352 and 434- 450, 1993.
-
L.R.Rabiner, R.W.Schafer, Digital Processing of Speech Signals, Prentice-Hall, Englewood, cliffs, NJ1978
-
Urmila Shrawankar, Dr. Vilas Thakare TECHNIQUES FOR FEATURE EXTRACTION IN SPEECH RECOGNITION SYSTEM: A COMPARATIVE STUDY.
-
Koustav Chakraborty, Asmita Talele, Prof. Savitha Upadhya, Voice Recognition Using MFCC Algorithm, International Journal of Innovative Research in Advanced Engineering (IJIRAE), ISSN: 2349-2163, Volume 1, Issue 10, (November 2014)
-
Lindasalwa Muda, Mumtaj Begam and I. Elamvazuthi, Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques, JOURNAL OF COMPUTING, ISSN 2151-9617 VOLUME 2, ISSUE 3, MARCH 2010
-
Kashyap Patel, R.K. Prasad, Speech Recognition and Verification Using MFCC & VQ, International Journal of Emerging Science and Engineering (IJESE) ISSN: 23196378, Volume-1, Issue-7, May 2013.
-
Vikas C. Raykar, S. R. Mahadeva, Speaker localization using extraction source information in speech. IEEE Transactions on Audio and Speech Processing, vol.13, September 2005.
-
Bishnu S. ATAL, and L. R. RABINER, A Pattern Recognition Approach to VoicedUnvoicedSilence Classification with Applications to Speech Recognition, IEEE transaction on acoustics, speech, and signal processing, VOL. ASSP-24, NO. 3, JUNE 1976.
-
H.K. Palo, Mihir Narayan Mohanty, Classification of Emotional Speech of Children Using Probabilistic Neural Network, International Journal of Electrical and Computer Engineering (IJECE) Vol. 5, No. 2, April 2015, pp. 311~317 ISSN: 2088- 8708.
-
L. Rabiner, A tutorial on Hidden Markov Model and selected applications in Speech Recognition, Proceedings of the IEEE, pp 257-286, vol. 77, No. 2, 1989
-
Garima Vyas, Malay Kishore Dutta, An Integrated Spoken Language Recognition System Using Support Vector Machines, CONFERENCE PAPER, DOI:10.1109/IC3.2014.6897156.
-
B.P.Das, R. Parekh, Recognition of Isolated Words using features based on LPC, MFCC, ZCR and STE, with Neural Network Classifiers, International Journal of ModernEngineering Research, pp. 854-858, vol. 2, No.3, June 2012.