Comparison Between SVM & Other Classifiers For SER

Pranita N. Kulkarni; D. L. Gadhe

doi:10.17577/IJERTV2IS1457

Volume 02, Issue 01 (January 2013)

Comparison Between SVM & Other Classifiers For SER

DOI : 10.17577/IJERTV2IS1457

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 123
Total Downloads : 1087
Authors : Pranita N. Kulkarni, D. L. Gadhe
Paper ID : IJERTV2IS1457
Volume & Issue : Volume 02, Issue 01 (January 2013)
Published (First Online): 30-01-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Comparison Between SVM & Other Classifiers For SER

Pranita N.Kulkarni Prof.D.L.Gadhe

Student of M.E. Assistant Prof.

College, Aurangabad M.I.T. College, Aurangabad

Abstract

In this paper we approach to speech emotion recognition using support vector machine. Speech emotion recognition play important role in the field of Human Computer Interaction with wide range of application. To recognize the emotion from audio signal, many systems have been developed. The different classifier which is used for speech emotion recognition is reviewed in this paper. Emotion recognition is totally depends on speaker & utterance (Phrase). The support vector machine is used as classifier to classify different emotions such as anger, happiness, sadness, neutral & fear. The features extracting from these speech are the energy, pitch, linear prediction cepstrum coefficient (LPC), Mel frequency cepstrum coefficients (MFCC). This paper approaches towards the effectiveness, accuracy, simplicity of SVM classifier for designing the real time speech emotion recognition system.
1. Introduction
  
  There are many ways of communication with one another such as speech, facial expression, eye contact, body language & so on [4]. But speech is one of the fastest & most efficient methods of communication between humans. And during communication, emotion plays an extremely important role in human life. Emotions [1][2] are nothing but the psychological description of ones feelings. There are some general emotions such as anger, happiness, sadness, neutral & fear. Speech emotion recognition helps to increase human- computer interaction. In our day-to-day life, computers have very important role & this method helps in communication between humans & computers. To make the human-computer interaction more efficient & natural, it would be beneficial to give computers.
  
  Emotion recognition firstly classifies speech into categories, which are directly related to the psychological state of user. Emotional speech classification is not a easy task, it requires a set of successive operation such as voice activity detection, feature extraction, training & classification.
  
  There are various methods used in field of speech emotion recognition for classification such as linear discriminateclassifier (LDC), K-nearest neighborhood (KNN), Gaussian mixture model (GMM), hidden Markov model (HMM), neural network (NN) & support vector machine (SVM).
  
  The support vector machine is a learning algorithm addresses the general problem of learning to discriminate between positive & negative members of given n-dimensional vectors. The SVM is used for classification & regression purpose. The main idea of SVM classification is to a transform the original input set to a high dimensional feature space by using kernel function.
  
  There is various application of speech emotion recognition like emotion recognition software for call center & helps in detection of emotional state provide feedback to an operator or a supervisor, lie detection, intelligent toys, psychiatric diagnosis, e-learning environment.
2. Literature Review
  
  In speech emotion recognition, classifier recognizes the emotion in the speech. There are various types of classifier has been proposed which has advantages & disadvantages over the one anothers. Generally there are two main types of information source can be used to identify the emotion of speakers, one is the word content of utterance & acoustic features. The linear discriminate classifier (LDC) is one of most simple well known & widely used classifier that finds a hyper plane that partitions the features into two decisions region. There is 70.5 % accuracy in LDC.
  
  The K-nearest neighbor (K-NN) is amongst the simplest algorithm of all classifier. In this technique, firstly the object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its K nearest neighbors (k is a positive integer, typically small). Then classifier can classify all the utterance in the design set properly, if
  
  k equal to 1, however its performance on the test set will reduced. K-NN classifier utilizes the information of pitch and energy and attains accuracy up to 64 % for four emotional set [11].
  
  Gaussian Mixture Model (GMM) for speech emotion recognition is more suitable only when the global features are extracted from the training utterance. Also it provides a good approximation for the probability density functions by mixture of weighted Gaussians. Expectation Maximization algorithm were used for computation of mixture coefficients. Each emotion is modelled in one GMM. [7] The decision is made for the maximum like hood model. The main advantage of utilizing GMM is of the fact that any signal by default will be in Gaussian shape & forms a bell shaped curve also due to its infinite range, it is also assumed that Gaussian distribution are well effective for speech recognition in presence of noise & speech spectra, because of its symmetry GMM attains 78.77 % accuracy.
  
  Hidden Markov Model (HMM) has physical relation with speech signals production mechanism & due to this it is widely used in speech emotion recognition technique for isolation of word & emotion from speech. The HMM [9] is a doubly embedded stochastic process, which is not directly observable but
  
  has the capability of effectively modelling statistical variation in spectral features. HMM not only models the underlying speech sounds but also the temporal sequencing among the sounds. And this temporal modelling is advantageous for emotion recognition. The main limitation in building the HMM based recognition model is the features selection process. Because it is not enough that features carries information about the emotional states, but it must fit the HMM structure as well. HMM has better classification accuracy than other classifiers.
  
  Neural Network (NN) has along history in classification pattern, due to their non-linear transfer function, their self-contained feature weighting capabilities and discriminative training. Neural Network classification technique is more suitable to classify to emotion anger & neutral. [5]
  Support Vector Machine (SVM) plays an interesting role in the field of classification. Because it transfers the original features set to a high dimensional feature space by using the kernel function. Linear, polynomial, radial basis functions (RBF) are the kernel function which can be used in SVM model for large extent. SVM model show a high generalization capability due to their structural risk minimization oriented training. SVM classifier generally used in speech emotion recognition method due to their applications such as pattern recognition and classification problems. SVM classifier has correct classification rates of 89.4 %, 93.6 % & 88.9 % for male, female & gender independent cases resp. [6]
3. System Modeling
  
  Dealing with the speakers emotion is one of latest challenges in speech technologies. During speech emotion recognition, there are three different factors which want to recognize namely, speech recognition, then synthesis of emotional speech & finally emotion recognition. As we know that speech is a time varying signal, which represents the underlying patterns of emotions. To capture these time varying patterns of emotions, SVM can be effectively used [10]. The speech emotion recognition system mainly contains five modules such as follows:
  - Speech Input (Audi n signal)
  - Feature Extraction
  - Feature Selection &Labeling
  - SVM Classifier
  - Recognized Emotional Output
  Figure 3.1 Implementation of Speech Emotion Recognition System
  
  The main aim of speech emotion recognition is to automatically identify the emotional state of human being from his or her voice. And it is totally based on in-depth analysis of the generation mechanism of input speech signal. The level of naturalness of input speech signal is depend on data based used. The database as an input to the speech emotion recognition system may contain the real world emotions or the acted ones. Generally Berlin database can be used for emotion recognition. Berlin database is in German language, and consists of 535 emotional utterances recorded from 5 female & 5 male actors. Each speaker speaks at most 10 different sentences in
  
  5 different emotion such as anger, happiness, sadness, neutral & fear.
  SVM is a very efficient & simple classifier algorithm which is widely used for pattern recognition. Also it can have a very good classification performance than any other classifier. Thus we used this machine as classifier in this paper. LIBSVM is most widely used tool for SVM classification & regression. In SVM approach, the main aim of an SVM classifier is obtaining a function f(x), which determines the decision boundary or hyper plane. This hyper-plane optimally separates two classes of input data points. SVM performs a non-linear mapping from an input space to a high-dimensional space through a kernel, which is an important component for SVM learning.
  
  Figure 3.3 Linear Classification using SVM
  
  In this work, we investigated three SVM kernels
  1. Linear Kernel
  2. Polynomial Kernel
  3. Radial Basis Function (RBF) Kernel
  But here we used Radial Basis function kernel in training phase. Advantage of using Radial Basis function kernel is that it restricts training data to lie in specified boundaries. The RBF kernel has less
  
  numerical difficulties than polynomial & linear kernel.
4. Conclusion

Speech emotion recognition (SER) has become an area of active research interest in recent years. SER plays an important role in building more anthropomorphic human-computer interfaces. As a machine learning task, successful SER requires both reliable machine learning techniques and emotion discriminating features. Whilevarious learning algorithms have been proposed for SER, constructing powerful features, specifically spectral features, remains an open challenge. Emotion recognition is an important step toward implementingan emotional speech recognition system. The type and number of emotional states, extracted features, feature selection algorithm, and type of the classifier are importantfactors in the accuracy of emotion recognition systems.In this paper, support vector machine was studied for speech emotion recognition system. Speech feature were extracted from the emotional speech sample such as energy, pitch, formants, MFCC, MEDC. Automatic speech emotion recognitions are increasing now a day because it results in better interaction between human & computer.

References

Christopher. J. C. Burges, A tutorial on support vectormachines for pattern recognition, Data Mining andKnowledge Discovery, 2(2):955-974, Kluwer AcademicPublishers, Boston, 1998.
C.W Hsu, C.-C. Chang, C.-J. Lin, A Practical Guide toSupport Vector Classification, Technical Report, Departmentof Computer Science & Information Engineering, National Taiwan University, Taiwan.
S. Emerich, E. Lupu, A. Apatean, Emotions Recognitions by Speech and Facial Expressions Analysis, 17th European Signal Processing Conference, 2009.
L.R. Rabiner & B.H. Juang, Fundamentals of speechrecognition (Englewood Cliffs, NJ: Prentice- Hall, 1993).
D. Morrison, Ruili Wang; & L.C. Silva, Spoken affectclassification using neural networks, IEEE

InternationalConference on Granular Computing, 2, 2005, 583-586.
Corinna Cortes and V. Vapnik, "Support-Vector Networks", MachineLearning, 20, 1995.
Xuan-Hung et al., Speaker-dependent emotionrecognition for audio document indexing, InternationalConference on Electronics, Information, andCommunications, 2004.

[8]L.R.Rabiner and B.H.Juang. Fundamentals of Speech Recognition, Upper Saddle River; NJ:

Prentice-Hall, 1993

Albornoz E. M., Crolla M. B. and Milone D. H.Recognition of Emotions in Speech. Proceedings of17th European Signal Processing Conference, 2009.
M. E. Ayadi, M. S. Kamel, F. Karray, Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases, Pattern Recognition 44, PP.572-587, 2011.

[2] I. Chiriacescu, Automatic Emotion Analysis

[11] D. Ververidis and C. Kotropoulos, "Emotional Speech Recognition: Resources, Features and Methods", Elsevier Speech communication, vol. 48, no. 9, pp. 1162-1181, September, 2006.

Ying Wang, Shoufu Du, Yongzhao Zhan, Adaptive and Optimal Classification of Speech Emotion RecognitionFourth International Conference on Natural Computation.
O.Khalifa, S.Khan, M.R.Islam, M.Faizal and D.Dol, 2004.Text Independent Automatic Speaker Recognition.3rd International Conference on Electrical & Computer Engineering, Dhaka, Bangladesh, pp.561-564.
YL. Lin and G. Wei, Speech emotion recognition based on HMM and SVM, proceeding of fourth International conference on Machine Learning and Cybernetics,Guangzhou, 18-21 August 2005.
M. D. Skowronski and J. G. Harris, Increased MFCC Filter Bandwidth for Noise-Robust Phoneme Recognition, Proc.ICASSP-02, Florida, May 2002.
Burkhardt, Felix; Paeschke, Astrid; Rolfes, Miriam; Sendlmeier, Walter F.; Weiss, Benjamin A Database of German Emotional Speech. Proceedings of Interspeech, Lissabon, Portugal. 2005.

Comparison Between SVM & Other Classifiers For SER

Energy Features

Pitch Features

Formants Features

MelFrequency Cepstrum Coefficients (MFCC)

Pre- Emphasis

Framing

Windowing

Pre- Emphasis

Framing

Windowing

FFT

FFT

Mel Filter bank &Frequency Wrapping

Mel Filter bank &Frequency Wrapping

DCT

DCT

Log

Log

Pre-Emphasis

Framing

Windowing

Fast Fourier Transform

Mel Filter Bank & Frequency Wrapping

Logarithm

Discrete Cosine Transform

Mel Energy Spectrum Dynamic Coefficient (MEDC)

Pre- Emphasis

Pre- Emphasis

Framing

Framing

Windowing

Windowing

FFT

FFT

Mel Filter bank &Frequency Wrapping

Mel Filter bank &Frequency Wrapping

Mean Log Energies

Mean Log Energies

DCT

DCT

Mean Log Energies

Linear Prediction Cepstrum Coefficients (LPCC)

Feature Labeling

Leave a Reply