Statistical Analysis Of Emotion Detection Using Fundamental Frequency

A.Vasavi; J.Leela Mahendra; Shaik. Abdul Rahim

doi:10.17577/IJERTV1IS4095

Volume 01, Issue 04 (June 2012)

Statistical Analysis Of Emotion Detection Using Fundamental Frequency

DOI : 10.17577/IJERTV1IS4095

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 144
Total Downloads : 754
Authors : A.Vasavi, J.Leela Mahendra, Shaik. Abdul Rahim
Paper ID : IJERTV1IS4095
Volume & Issue : Volume 01, Issue 04 (June 2012)
Published (First Online): 30-06-2012
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Statistical Analysis Of Emotion Detection Using Fundamental Frequency

STATISTICAL ANALYSIS OF EMOTION DETECTION USING

FUNDAMENTAL FREQUENCY

A.Vasavi, Assistant Professor, J.Leela Mahendra, Assistant Professor and Shaik.Abdul Rahim , Assistant Professor

Department of Electronics & Instrumentation Engineering

RGMCET, Nandyal

KeywordsEmotional speech analysis, emotional speech recognition, expressive speech, intonation, pitch contour analysis.

INTRODUCTION

EMOTION plays a crucial role in day-to-day interpersonal human interactions. Recent findings have suggested that emotion is integral to our rational and intelligent decisions. It helps us to relate with each other by expressing our feelings and providing feedback. This important aspect of human interaction needs to be considered in the design of human-machine interfaces (HMIs).

Speech prosody is one of the important communicative channels that is influenced by and enriched with emotional modulation. The goal of this paper is two fold. The first is to study which

aspects of the pitch contour are manipulated during expressive speech (e.g., curvature, contour, shape, and dynamics). For this purpose, we present a novel framework based on Kullback- Leibler divergence (KLD) and logistic regression models to identify, quantify, and rank the most emotionally salient aspects of the FO contour. First, the symmetric Kullback-Leibler distance is used to compare the distributions of different pitch statistics (e.g., mean, maximum) between emotional speech and reference neutral speech. Then, a logistic regression analysis is implemented to discriminate emotional speech from neutral speech using the pitch statistics as input. These experiments provide insights about the aspects of pitch that are modulated to convey emotional goals. The second goal is to use these emotionally salient features to build robust prosody speech models to detect emotional speech Gaussian mixture models (GMMs) are trained using the most discriminative aspects of the pitch contour, following the analysis results presented in this paper.

In this paper, we have implemented a method for Statistical analysis of emotion detection using fundamental frequency. The methodology is discussed in section II. In section III, Comparisons using Symmetric KullBack- Leibler distance are described. Section IV covers Logistic Regression Analysis section V covers Emotional Discrimination Results using Neural models section VI concludes the study.
METHODOLOGY
1. Overview:
  
  The fundamental frequency or FO contour (pitch), which is a prosodic feature, provides the tonal and rhythmic properties of the speech.
  
  The fundamental frequency is also a supra-segmental speech feature, where information is conveyed over longer time scales than other segmental speech correlates such as spectral envelope features. Therefore, rather than
  
  using the pitch value itself, it is commonly accepted to estimate global statistics of the pitch contour over an entire utterance or sentence (sentence-level) such as the mean, maximum, and standard deviation.
2. Databases:
  
  In this paper, five databases are considered: one non-emotional corpus used as a neutral speech reference, and four acted emotional databases with different properties.
  
  In the first step, the speech files are scaled such that the average RMS energy of the neutral reference database (Â£ref) and the neutral subset in the emotional databases (Â£^eu) are the same for each speaker s. This normalization is separately applied for each subject in each database. The goal of this normalization is to compensate for different recording settings among the databases.
  
  S
  
  E
  
  For the analysis and the training of the models (Sections IV-VI), three emotional corpora were considered. These emotional databases were chosen to span different
  
  s Energy
  
  ref
  
  E
  
  s neu
  
  .(1)
  
  emotional categories, speakers, genders, and even languages, with the purpose to include, to some extent, the variability found in the pitch. The first database was collected at the University of Southern California (USC) using an electromagnetic artic-ulography (EMA) system. In this database, which will be referred to here on as EMA, one male and two female subjects (two of them with formal theatrical vocal training) read ten sentences five times portraying the emotions sadness, anger, and happiness, in addition to neutral state. Although this database contains articulator information, only the acoustic signals are analyzed in this study The second emotional corpus corresponds to the Emotional Prosody Speech and Transcripts database (EPSAT). This database was collected at the University of Pennsylvania and is comprised of recordings from eight professional actors (five female and three male) who were asked to read short semantically neutral utterances corresponding to dates and numbers, expressing 14 emotional categories in addition to the neutral
  
  state
  
  The third emotional corpus is the Database of German Emotional Speech (GES) which was collected at the Technical University of Berlin . This database was recorded from ten participants, five female, and five male, who were selected based on the naturalness and the emotional quality of the participant's performance in audition sessions. The emotional categories con- sidered in the database are anger, happiness, sadness, boredom, disgust, and fear, in addition to neutral state.
3. Speaker Dependent Normalization:
  
  Normalization is a critical step in emotion recognition. The goal is to eliminate speaker and recording variability while keeping the emotional discrimination. For this analysis, a two-step approach is proposed: 1) energy normalization and 2) pitch normalization.
  
  In the second step, the pitch contour is normalized for each subject (speaker-dependent normalization). The average pitch across speakers in the neutral reference database is estimated F0ref. Then, the average pitch value for the neutral set of the emotional databases is estimated for each speaker F0neu. Finally, a scaling factor (Sp0) is estimated by taking the ratio between F0ref and F0neu.Therefore, the neutral samples of each speaker in the databases will have a similar FO mean value.
  
  sF 0 F 0ref / F 0neutral (2)
  
  One assumption made in this two-step approach is that neutral speech will be available for each speaker. For real-life applications, this assumption is reasonable when either the speakers are known or a few seconds of their neutral speech can be pre- recorded.
4. Pitch Features:
The pitch contour was extracted with the Praat speech processing software, using an autocorrelation method. The analysis window was set to 40 ms with an overlap of 30 ms, producing 100 frames per second. The pitch was smoothed to remove any spurious spikes by using the corresponding option provided by the Praat software.

Describing the pitch shape for emotional modulation analysis is a challenging problem, and different approaches have been proposed. The Tones and Break Indices System (ToBI) is a well-known technique to transcribe prosody (or intonation). Although progress has been made toward automatic ToBI transcription [30], an accurate and more complete prosodic transcription requires hand labeling. Furthermore, linguistic models of intonation may not be the most appropriate labels to describe the emotions . Taylor

has proposed an alternative pitch contour parameterization called Tilt Intonation Mode! . In this approach, the pitch contou needs to be pre-

D(q / / p)

q( X ) log(q(x) / p(x)) .. (7)

x X

segmented into intonation events. However, there is no straightforward or readily available system to estimate these segments. Given these limi- tations, we follow a similar approach presented by Grabe et a!. . The voiced regions, which are automatically segmented from the pitch values, are parameterized using polynomials. This parameterization captures the local shape of the FO contour with few parameters, which provides clear physical interpretation of the curves. Here, the slope (tti), curvature, and inflexion (c3) are estimated to capture the local shape of the pitch contour by fitting a first-, second-, and third- order polynomial to each voiced region segment

y a1.x a0……………………..(3)

The first step is to estimate the distribution of the pitch features for each database, including the neutral reference corpus. For this purpose, we proposed the use of the K-means clustering algorithm to estimate the bins. This nonparametric approach was preferred since the KLD is sensitive to the bins' estimation. To compare the symmetric KLD in terms of features and emotional categories k the number of bins, was set constant for each distribution (k = 40 empirically chosen). Notice that these feature-dependent nonuniform bins were estimated considering all the databases to include the entire range spanned by the features. After the bins were calculated, the distribution (pf (d,e) )of each pitch feature (f) was estimated for each database (d), and

y b2.x2

y c3.x3

b1.x b0…………..(4)

c2.x2 c1.x c0…(5)

for each emotional category (e). Therefore, the true feature distribution for each subset is approximated by counting the number of samples assigned to each

These statistics provide insights about the local dynamics of the pitch contour. For example, while the pitch range at the sentence-level (Srange) gives the extreme value distance of the pitch contour over the entire sentence, SVmeanRange, the mean of the range of the voiced regions, will indicate whether the voiced regions have flat or inflected shape.

bin. The same procedure was used to estimate the distribution of the pitch features in the reference neutral corpus, qf ref.

f

f

The next step is to compute the symmetric KLD between the distribution of the emotional databases and the distribution estimated from the reference database Jf (d,e) (p (d,e) , q (ref) ) . This procedure is
EXPERIMENT I: COMPARISONS USING SYMMETRIC KULLBACK-LEIBLER DISTANCE

This section presents our approach to identifying and quantifying the pitch features with higher levels of emotional modulation. Instead of comparing just the mean, the distributions of the pitch features extracted from the emotional databases are compared with the distributions of the pitch features extracted from the neutral reference corpus using KLD . KLD provides a measure of the distance between two distributions. It is an appealing approach to robustly estimate the differences between the distributions of two random variables.

Since the KLD is not a symmetric metric, we propose the use of the symmetric Kullback-Leibler distance or ^-divergence, which is defined as

repeated for each database and for each emotional

category.

A good pitch feature for emotion discrimination ideally would have Jf (d,neytral) close to zero (neutral speech of the database d is similar to the reference corpus) and a high value for Jf(d,e),where e is any emotional category except the neutral state. Notice that if Jf (d,neytral) and Jf(d,e), have high values, this test would indicate that the speech from the emotional database is different from the reference database (how neutral is the neutral speech?). Likewise, if both values were similar, this feature would not be relevant for emotion discrimination. Therefore, instead of directly comparing the symmetric KLD, we propose to estimate the ratio between . Jf(d,e),and Jf (d,neytral) .That is, after matching the feature distributions with the reference feature distributions, the emotional speech is directly compared with the neutral set of the same emotional database by taking the ratio. High values

J(q, p )

D(q / / p)

D( p / /q) / 2 . (6)

of this ratio will indicate that the pitch features for

Where D(p //q) is the conventional KLD

emotional speech are different from their neutral counterparts, and therefore are relevant to discriminate emotional speech from neutral speech.

f f f

r(d ,e) J (d ,e) / J (d ,neutral ) ..(8)

The pitch features with higher values are SVmeanMin, SVmeanMax, Sdiqr, and Smean for the sentence-level features and Vrange, Vstd, Vdrange, and Vdiqr for the voiced-level features.
EXPERIMENT 2: LOGISTIC REGRESSION ANALYSIS

Logistic regression is a well-known technique to model binary or dichotomous variables. In this technique, the conditional expectation of the variable given the input variables is modeled with the specific form described in (9). After applying the logit transformation (10), the regression problem becomes linear in its parameters ( )A nice property of this technique is that the significance of the coefficients can be measured using the log-likelihood ratio test between two nested models (the input variables of one model are included in the other model). This procedure provides estimates about the discriminative power of each input feature

e 0 1×1 …. nxn

E(Y / f 1, f 2….. fn) (x) …..(9)

collect enough emotional speech data so that one can train robust and universal acoustic models of individual emotions. Therefore, it is not surprising that the models built with these individual databases (usually offline) do not easily generalize to different databases or online recognition tasks in which blending of emotions is observed .

In the first step, neutral models are built to mea- sure the degree of similarity between the input speech and the reference neutral speech. The output of this block is & fitness measure of the input speech. In the second step, these measures are used as features to infer whether the input speech is emo- tional or neutral. If the features from the expressive speech differ in any aspect from their neutral counterparts, the fitness measure will decrease. Therefore, we hypothesize that setting thresholds over these fitness measures is easier and more robust than setting thresholds over the features themselves.

FO contour is assumed to be largely independent of the specific lexical content, in contrast to spectral speech features. Therefore, a single lexical- independent model is adequate to model the selected pitch features. For this task, we propose

e

1

g(x) ln[ (x) / 1 (x)]
0 1 x1 ….

nxn

the use of univariate GMM for each pitch feature.

The maximum likelihood estimates of the

0 1×1 ….

nxn ……………………………………………(10)

parameters in the GMM 8 are computed using the expectation-maximization (EM) algorithm.

Logistic regression analysis is used witb/or-ward feature selection (FFS) to discriminate between each emotional category and neutral state (i.e., neutral- anger).
EMOTIONAL DISCRIMINATION RESULTS USING NEUTRAL MODELS

To recognize expressive speech using the acoustic likelihood scores obtained from hidden Markov models (HMMs) [6]. The models were trained with neutral (non-emotional) speech using spectral features. In this section, the ideas are extended to build neutral models for the selected sentence-and voiced-level pitch features .
1. Motivation and Proponed Approach:
  
  Automatic emotion recognition in real-life applications is a nontrivial problem due to the inherent inter-speaker variability of expressive speech. Furthermore, the emotional descriptors are not clearly established. The feature selection and the models are trained for specificdatabases with the risk of sparseness in the feature space and over- fitting. It is also fairly difficult, if not infeasible, to
  
  For a given input speech, the likelihoods of the models, Ff(Xt x/|Â©), are used as fitness measures. In the second step, a Linear Discriminate Classifier (LDC) was implemented to discriminate between neutral and expressive speech. For a given input speech, the likelihoods of the models, Ff(Xt x/|Â©), are used as fitness measures. In the second step, a Linear Discriminate Classifier (LDC) was implemented to discriminate between neutral and expressive speech.
  
  k 1 (X )2
  
  2
  
  2
  
  f f f j 2
  
  F (X x ) exp( f j )…(12)
  
  j 1 j j
  
  with
  
  K
  
  j j j j 1 j j
  
  , , }K , 0 j 1,…..K , 1
  
  j 1
2. Results:
The recognition results presented in this section are the average values over 400 realizations. Since the emotional categories are grouped

together, the number of emotional samples is higher than the neutral samples.

An important parameter of the GMM is the number of mixtures, the performance of the GMM-based pitch neutral models for different numbers of mixtures. The figure shows that the proposed approach is not sensitive to this param- eter.

Fig. 1 Statistical analysis for neutral and emotional Speech

This paper presented an analysis of different expressive pitch contour statistics with the goal of finding the emotionally salient aspects of the FO contour (pitch). For this purpose, two experiments

were proposed.

In the first experiment, the distribution of different pitch features was compared with the distribution of the features derived from neutral speech using the symmetric KLD. In the second experiment, the emotional discriminative power of the pitch features was quantified within a logistic regression framework. Both experiments indicate that dynamic statistics such as mean, maximum, minimum, and range of the pitch are the most salient aspects of expressive pitch contour. The statistics were computed at sentence and voiced region levels. The results indicate that the system based on sentence-level features outperforms the one with voiced-level statistics both in accuracy and robustness, which facilitates a turn-by-turn processing in emotion detection.

REFERENCES

J R. W. Picard, "Affective Computing," MIT Media Laboratory Perceptual Computing Section, Cambridge, MA, USA, Tech. Rep. 321, Nov. 1995.
R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias. W. Fellcnz, and J. Taylor, "Emoiion recognition in human- computer interaction." IEEE Signal Process. Mag., vol. 18, no. 1- pp. 32-80, Jan. 2001.
A. Alvarez, I. Cearreta, J. Lopez, A. Arruti,

E. Lazkano, B. Sierra, and N. Garay, "Feature subset selection based on evolutionary algorithm., for automatic emotion recognition in spoken Spanish and standard basque language." in Proc. 9th Int. Conf. Text, Speech and Dialogue (TSD 2006), Brno, Czech Republic, Sep. 2006. pp. 565-572.
D. Vervcridis and C. Kotropoulos, "Fast sequential floating forward selection applied lo emotional .speech features estimated on DES and S US AS data collect ions," in Prot.. XIV Ear. Signal Process. Conf. (EU-SiPCO'Of,), Florence, Italy, Sep. 2006, pp. 929-932.
M. Sedaaghi, C. Kotropoulos, and D. Ververidis, Using adaptive genetic algorithms to improve speech emotion recognition, in Proc. Int. Workshop Multimedia Signal Process. (MMSP07), Chania, Crete, Greece, Oct. 2007, pp. 461464.
C. Busso, S. Lee, and S. Narayanan, Using neutral speech models for emotional speech analysis, in Proc. Interspeech07Eurospeech, Antwerp, Belgium, Aug. 2007, pp. 22252228

Statistical Analysis Of Emotion Detection Using Fundamental Frequency

Leave a Reply