Emotion Recognition from Hindi Speech using MFCC and Sparse DTW

Er.    Vipinkumar R.    Pawar; Ms.    Nupur Patel

doi:10.17577/IJERTV4IS060003

Volume 04, Issue 06 (June 2015)

Emotion Recognition from Hindi Speech using MFCC and Sparse DTW

DOI : 10.17577/IJERTV4IS060003

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 113
Total Downloads : 667
Authors : Er. Vipinkumar R. Pawar, Ms. Nupur Patel
Paper ID : IJERTV4IS060003
Volume & Issue : Volume 04, Issue 06 (June 2015)
DOI : http://dx.doi.org/10.17577/IJERTV4IS060003
Published (First Online): 01-06-2015
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Emotion Recognition from Hindi Speech using MFCC and Sparse DTW

Er. VipinKumar R. Pawar

Head of Reasearch & Analysis Department Anitas MicroSystems(INDIA), Nashik, INDIA

Ms. Nupur Patel

B.E.(Electronics & Telecommunication Engineering) St. Francis Institute of Technology, Mumbai

Abstract-: Recently increasing attention has been directed to the study of emotional content of speech signals, and hence, many systems have been proposed to identify the emotional content of a spoken utterance. The project of Emotion Recognition from Hindi Speech address to three main aspects of speech recognition system. The first one is the choice of suitable features for speech representation. Using Sparse DTW for feature recognition has improved space efficiency and time complexity. Implementation of automatic emotion recognition system (using MATLAB) provides an accuracy of over 75% for 5 emotions namely: happy, sad, surprise, anger and neutral over a database containing large variety of speakers.

INTRODUCTION

Speech is a vocalized form of human communication. Emotions exert an incredibly powerful force on human behaviour. Emotion plays an important role in a persons approach to a particular situation at that particular time. Unable to understand a persons emotion in a particular situation may cause a failure of communication. Thus recognising the emotion becomes one of the important aspects. This project mainly aims to classify 5 emotions namely sad, happy, anger, surprise and neutral. The input signal is divided into various frames of 20ms and features are extracted from each frame using MFCC. Later on, Sparse DTW is used for classification of emotions.
LITERATURE SURVEY

An experimental study on vocal emotion expression and recognition and the development of a computer agent for emotion recognition. They used RELIEF-F algorithm for feature selection. The total average accuracy is about 70%.[1]
EMOTION RECOGNITION
3.4.1 END POINT DETECTION

An important problem in speech processing is to detect the presence of speech in a background of noise. This problem is often referred to as the endpoint detection problem.In this project End Point Detection Algorithm (EPD) is used for pre-processing of speech. An advantage of a good endpoint detecting algorithm is that proper allocation of regions of speech can substantially reduce the amount of processing required for the intended application.

This improves the performance of the decision making block and makes the system memory efficient because the templates produced by the feature extraction stage correspond to the detected speech only [2]. There are various methods for end point detection such as VAD (Voice activity detection) [3], algorithm on Mahalanobis Distance [3], algorithm on Energy Threshold [3]. For this example, the word four, it is not important to include the entire initial unvoiced interval; in fact, experience has shown that 30 ms to 50 ms of unvoiced energy is sufficient for most word recognition purposes [1].
FEATURE EXTRACTION AND RECOGNITION

When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant then the input data will be transformed into a reduced representation set of features. Transforming the input data into the set of features is called feature extraction. In order to find some statistically relevant information from incoming data, it is important to have mechanisms for reducing the information of each segment in the audio signal into a relatively small number of parameters, or features. [4].

Fig. 2. End Point detection

Any emotion from the speakers speech is represented by the large number of parameters which is contained in the speech and the changes in these parameters will result in corresponding change in emotions. There are different features present in an emotion like pitch, energy, duration format, Mel Frequency Cepstrum Coefficient (MFCC) [5] and Linear Prediction Cepstrum Coefficient (LPCC) .When there is change in emotional state there is corresponding change in speech rate, energy, and spectrum.
Normally a speech signal is not stationary, but seen from a short-time point of view it is. Typically, a speech signal is stationary in windows of 20 ms. Therefore the signal is divided into frames of 20 ms which corresponds to n samples:

Fig. 8: Normalized frequency plot of hamming window

4.1.4 MEL FILTER BANK

The Mel filtering approximates the non-linear characteristics of the human auditory system in frequency. The output is an array of filtered values, typically called mel-spectrum, each corresponding to the result of filtering the input spectrum through an individual filter. Therefore, the length of the output array is equal to the number of filters created.

MelFrequency = 2595 * log(1 + lf/700) Where, lf= linear frequency.

n = tst Ã— fs

When the signal is framed it is necessary to consider how to treat the edges of the frame. Therefore it is expedient to use a window to tone down the edges. As a consequence the samples will not be assigned the same weight in the following computations and for this reason it is prudent to use an overlap.

Fig. 6: Illustration of framing: The speech is divided into four frames

4.1.3 WINDOWING

Awindow functionis a mathematical function that is zero- valued outside of some chosen interval.In order to reduce the discontinuities of the speech signal at the edges of each frame, a tapered window is applied to each one i.e.the Hamming window. Hamming window is described as follows:

H(n)=0.54+0.46cos(2n/(n-1))

From the above figure it can be seen that the amplitude of the side lobes is smaller than the amplitude of main lobe. Thus, windowing minimizes the spectral distortion and the discontinuties at the beginning aand end of each frame.

Fig. 7: Hamming window

Fig. 4.7 Mel filter bank
i=1

Double Delta MFCC coefficients are the time derivatives of the Delta MFCC coefficients. Double Delta provides us with 39 coefficients (13 normal, 13 derivatives amd 13 double derivatives).
FEATURE RECOGNITION

5.1 SPARSE DTW

The main principle of using Sparse DTW is to reduce time and space complexity. In order to reduce space usage while avoiding any recomputations we consider the following facts:

Quantizing the input time series to exploit the similarity between the points in the two series.
Using a sparse matrix of size k, In the worst case. K = mÃ—n

If the two sequences are similar k << nÃ— m.
The warping matrix is calculated using dynamic programming and sparse matrix indexing.

5.1.1 SPARSE APPROACH

Let us consider two sequences:

S = [3, 4, 5, 3, 3] and Q = [1, 2, 2, 1, 0].

First quantize the sequences into the range [0 – 1] using Equation

sk min(sk )

Fig. 8: SM initial blocked[B]
Fig. 9: SM after unblocking the optimal cells.

Fig. 10: Unblocking upper neighbor (Shaded cell).

Then calculate the warping cost for each open cell c SM (cell c is the number from the linear order of SMs

i

Quantized Seqk =

i

max(sk ) min(sk )

cells) by finding the minimum of the costs of its lower neighbors, which are [c – 1; c – n; c – (n+1)] (black arrows

i

Where, sk denotes the ith element of the kth time series. This yields the following sequences:

S = [0, 0.5, 1.0, 0.0, 0.0] and Q = [0.5, 1.0, 1.0, 0.5, 0.0]
Next create overlapping bins governed by two parameters: bin-width and the overlapping width. For this particular example, the bin-width is 0.5. Thus 4 bins obtained are shown in table below:

Bin Number

(Bk)

Bin Bounds

Indices of

S

Indices of

Q

1

0.0-0.5

1,2,4,5

1,4,5

2

0.25-0.75

2

1,4

3

0.5-1.0

2,3

1,2,3,4

4

0.75-1.25

3

2,3

Table: Bin bounds where Bk is the kth bin.

in Figure 11 show the lower neighbors of every open cell). This cost is then added to the local distance of cell c. The above step is similar to DTW, however, we may have to open new cells if the upper neighbors at a given local cell c SM are blocked. The indices of the upper neighbors are [c + 1; c + n; c + n + 1], where n is the length of sequence S (i.e., number of rows in SM).If the Upper Neighbors = 0 for a particular cell, its upper neighbors will be unblocked. This is very useful when the algorithm traverses SM in reverse to find the final optimal path. In other words, unblocking allows the path to be connected. For example, the cell SM (5) has one upper neighbor that is cell SM (10) which is blocked (Figure 9), therefore this cell will be unblocked by calculating the EucDist(S (5), Q (2)). The value will be added to the SM which means that cell SM
1. is now an entry in SM (Figure 10). Although unblocking ads cells to SM which means the number of open cells will increase, but the overlapping in the bins
  
  boundaries allows the SMs unblocked cells to be connected mostly that means less number of unblocking operations.
  
  Figure 11 shows the final entries of the SM after calculating the warping cost of all open cells.
  
  Fig. 11: Constructing SM.
  
  Hop initially represents the linear index for the (m, n) entry of SM that is the bottom right corner of SM in Figure 12. Starting from hop = nÃ—m choose the neighbors [hopn, hop1, hop – (n+1)] with minimum warping cost and proceed recursively until the first entry of SM is reached, namely SM (1) or hop = 1. While calculating the warping path only look at the open cells, which may be fewer in number thn 3. The filled cells show the optimal warping path, which crosses the grid from the top left corner to the bottom right corner. The distance between the two time series is calculated using Equation:
  
  1
  
  (, ) = =
  1. (II)
Fig. 12: Final optimal path using (I)SparseDTW. (II)DTW.
1. DATABASE DESCRIPTION
  
  The database includes speech samples from around 50 people which were collected manually for five different emotions namely: Sad, Happy, Anger, Neutral and Surprise. These samples were collected using a software named Audacity and also by writing a program in MATLAB using Data Acquisition Toolbox. Phrases are recorded using a microphone.
  
  Various sentences used for collecting database are as follows:
  - Happy Emotion: Ajj me bhot khush hu.
  - Sad emotion: Mujhe koi yaad hi nahi karta.
  - Neutral Emotion: Me ghar ja raha hu.
  - Anger Emotion: Mujhe gussa maat dilao.
  - Surprise Emotion: Kya baat kar rahe ho!
2. CONCLUSION
  
  This implementation of automatic emotion recognition system(using matlab) provides an accuracy of over 75% for 5 emotions namely: happy, sad, surprise, anger and neutral. In this project Mel Frequency Cepstral Coefficients (MFCC) is used for feature extraction and Sparse DTW is used for feature recognition purpose.Using SparseDTW helped to improve the space efficiency and reduce the time complexity. Progress in the area relies heavily on the development of appropriate databases. The problems that usually occur while collecting database are variations in the surrounding(noise), varient speaker characteristics and acoustic confusability.
3. FUTURE SCOPE
  
  This project is an initiative to identify, explore and develop possible alternatives to improve the overall human- computer interaction. The scope for this project includes using HMM for further increasing the accuracies and reducing time complexity. Also the database can be made multilingual(including different languages together). Further this project can be converted into an android application.
4. REFERENCES

L. R. Rabiner, M. R. Sambur, An algorithm for determining the endpoints of isolated utterances, Bell System Technical Journal, 54, p. 297-315, Feb. 1975.
T. B. Amin, I. Mahmood, Speech Recognition Using Dynamic Time Warping, 2nd International Conference on Advances in Space Technologies, Proceedings of ICAST, vol. 2, pp. 74-79, November, 2008.
K. Yamamoto, F. Jabloun, K. Reinhard and A. Kawamura, Robust method for end point detection using discriminative feature extraction, IEEE Proceedings, Europe, 2006.
K. R. AidaZade, C. Ardil and S.S. Rustamov , " Investigation of Combined use of MFCC and LPC Features in Speech Recognition Systems ", World Academy of Science, Engineering and Technology, 2006.
Digital Signal Processing Mini-Project, An Automatic Speaker Recognition System, Minh N. Do, Audio Visual Communications Laboratory, Swiss Federal Institute of Technology, Lausanne, Switzerland.
L. Rabiner, and B. Juan, Fundamentals of speech recognition, Prentice Hall PTR, New Yersey, ISBN 0-13- 015157-2.
F. Dellaert, T. Polzin and A. Waibel, Recognizing emotion in speech, IEEE International Conference on Emotion and Signal Processing, pp. 1970-1973, 2004.

Bin Number (Bk)	Bin Bounds	Indices of S	Indices of Q
1	0.0-0.5	1,2,4,5	1,4,5
2	0.25-0.75	2	1,4
3	0.5-1.0	2,3	1,2,3,4
4	0.75-1.25	3	2,3

Emotion Recognition from Hindi Speech using MFCC and Sparse DTW

Leave a Reply