A Comparative Study Of LPCC And MFCC Features For The Recognition Of Assamese Phonemes

DOI : 10.17577/IJERTV2IS1421

Download Full-Text PDF Cite this Publication

Text Only Version

A Comparative Study Of LPCC And MFCC Features For The Recognition Of Assamese Phonemes

Utpal Bhattacharjee

Department of Computer Science and Engineering, Rajiv Gandhi University, Rono Hills, Doimukh, Arunachal Pradesh, India, Pin-791112

Abstract

In this paper two popular feature extraction techniques Linear Predictive Cepstral Coefficients (LPCC) and Mel Frequency Cepstral Coefficients (MFCC) have been investigated and their performances have been evaluated for the recognition of Assamese phonemes. A multilayer perceptron based baseline phoneme recognizer has been built and all the experiments have been carried out using that recognizer. In the present study, attempt has been made to evaluate the performance of the speech recognition system with different feature set in quiet environmental condition as well as at different level of noise. It has been observed that at noise free operating environment when same speaker is used for training and testing the system, the system given 100% recognition accuracy for the recognition of Assamese phones for both the feature set. However, the performance of the system degrades considerably with increase in environmental noise level.It has been observed that the performance of LPCC based system degrades more rapidly compare to MFCC based system under environmental noise condition whereas under speaker variability conditions, LPCC shows relative robustness compare to MFCC though the performance of both the systems degrades considerably.

Key Terms: Speech Recognition, LPCC, MFCC, MLP

  1. Introduction

    Automatic speech recognition is the task of recognizing the spoken word from speech signal. A survey in the robustness issues associated with automatic speech recognition has been reported by several workers [1, 2]. In our present study, the difficulties due to speaker variability and environmental factors are considered.

    A word may be uttered by the same user differently because of the difference in emotional level, health status, surrounding environment (noise/quietness) etc. Again, utterance of the same word varies due to gender, age, dialect, influence of other languages on the speaker etc. Another layer of variation is introduced by the acoustical environment where the speech recogniser operates. These variations are due to background noise, microphone, transmission channel, reverberation etc. In this paper we evaluate the performance of LPCC and MFCC feature vectors as front-end of a speech recognizer under environmental variability and speaker variability conditions.

    In the present study a Multi-layer perceptron based baseline system has been built for the recognition of Assamese phonemes. To categorize the related features into different classes and remove repeating data, Self Organized Map (SOM) has been used. The feature trajectory obtained from the phoneme signal has been reduced into six cluster centres. The reduced feature vector has been fed to the MLP based phoneme recognizer.

    Assamese is the major language in the North Easter part of India with its own unique identity, culture and language through its origins root back to Indo-European family of language. Assamese is the easternmost member of the new Indo-Aryan (NIA) subfamily of languages spoken in Assam and many part of North-Eastern India. The Assamese phonemic inventory consists of eight oral vowel phonemes, three nasalized vowel phonemes and twenty-two consonant phonemes. The phonemes of the Assamese language are given below[4]:

    Table 1(a): Vowels of Assamese language

    Vowel Type

    Position

    Front

    Central

    Back

    Oral Vowel

    High

    i

    u

    High-mid

    Mid

    e

    o

    Low-mid

    Low

    a

    Nasalized Vowel

    High

    i)

    u)

    Low

    a)

    Table 1(b): Consonants of Assamese language

    Phoneme Type

    Labial

    Alveolar

    Velar

    Glottal

    Voiceless stops

    p p

    t t

    k k

    Voiced stops

    b b

    d d

    Voiceless fricatives

    s

    x

    h

    Voiced fricatives

    z

    Nasals

    m

    n

    Approximants

    w

    Lateral

    l

    The paper is organized as follows. Section II discusses LPCC and MFCC methods for speech parameterization in details. The baseline speech recognition system is described

    in section III. In section IV we describe the experimental setup and database used. Section V is dedicated for the description of the experiments and results obtained. The paper is concluded in section VI.

  2. The LPCC and MFCC methods for Speech Parameterization

    In this paper two methods of speech parameterization namely Linear Predictive Coding Cepstral Coefficients (LPCC) and Mel Frequency Cepstral Coefficients (MFCC) have been used as front-end feature extractor. The details of both the methods have been given below:

    = (), 1 — (5)

    The gain of the all-pole filter model, G, is given by the following equation,

    = () — (6)

    Cepstral analysis refers to the process of finding the cepstrum of a speech sequence. Cepstral coefficients can be calculated from the LPC via a set of recursive procedure [5]. The cepstral coefficients obtained in this way are called Linear Predictive Cepstral Coefficients (LPCC). The recursive procedure is given below:

    0 = ln()

    1

    = +

    1

    1. Linear Predictive Cepstral Coefficients (LPCC)

      =1

      The Linear Predictive analysis is based on the

      1

      =

      >

      assumption that the shape of the vocal tract governs the nature of the sound being produced. To study the property quantitatively, the vocal tract is modeled by a digital all-pole filter [5]. The transfer function in z-domain is given by

      =1

      — (7)

      =

      =1

      =1

      1

      —(1)

    2. Mel Frequency Cepstral Coefficients (MFCC)

      Where V(z) is the vocal tract transfer function. G is the gain of the filter and {ak} is a set of autocorrelation coefficients called Linear Prediction Coefficients (LPC). The upper limit of summation, p, is the order of the all-pole filter. The set of LPC determines the characteristic of the vocal tract transfer function.

      Autocorrelation method[5] is an efficient method for evaluating the LPC set and the filter gain. It involves calculating a matrix of simultaneous equations and the autocorrelation of the windowed speech frames. The matrix

      Mel Frequency Cepstral Coefficients (MFCC) is one of the most commonly used feature extraction method in speech recognition. The tehnique is called FFT based which means that feature vectors are extracted from the frequency spectra of the windowed speech frames.

      The Mel frequency filter bank is a series of triangular bandpass filters. The filter bank is based on a non-linear frequency scale called the mel-scale. According to Stevens et al[6], a 1000 Hz tone is defined as having a pitch of 1000 mel. Below 1000 Hz, the Mel scale is approximately linear

      of equations that need to be solved is

      [0] 1 [ 1]

      1

      [1]

      to the linear frequency scale. Above the 1000 Hz reference point, the relationship between Mel scale and the linear

      [1] 2 [ 2] 2 = [2] (2)

      frequency scale is non-linear and approximately logarithmic.

      [ 1]

      [ 2]

      .

      [0]

      []

      The following equation describes the mathematical relationship between the Mel scale and the linear frequency

      Where R[n] is the autocorrelation function of a

      windowed speech signal.

      The Gain of the all-pole filter can be found by solving the

      scale

      = 1127.01 ln

      700

      700

      + 1 — (8)

      following equation

      =1

      =1

      = 0

      []

      — (3)

      The Mel frequency filter bank consist of triangular bandpass filters in such a way that lower boundary of one filter is situated at the center frequency of the previous filter

      Since the matrix on the left of Eq.(2) is a Toeplitz matrix, recursive algorithm can be used to solve the above equation. Levinson-Durbin recursive procedure [5] has been applied to solve the equation, which is given below

      0 = [0]

      1 (1)[ ]

      and the upper boundary situated in the center frequency of the next filter. A fixed frequency resolution in the Mel scale is computed, corresponding to a logarithmic scaling of the repetition frequency, using fMel = (fH mel fL mel )/ (M + 1) where fH mel is the highest frequency of the filter

      fmax

      =

      =1

      , 1

      bank on the Mel scale, computed from

      using equation

      () =

      (1)

      (8), fL mel is the lowest frequency in Mel scale, having a

      fmin

      corresponding and M is the number of filter bank. The

      () = (1) 1 , 1 1

      values considered for the parameters in the present study

      () = (1 2)(1)

      — (4)

      are: fmax =8 KHz, fmin =0 Hz and M=20. The center frequencies on the Mel scale are given by

      ( + )

      The above equation is solved recursively for i=1, 2, 3, ,

      p. When i reaches the pth iteration, the set of LPC and the filter gain are given as follows,

      = ( ) +

      + 1 , 1 M

      —(9)

      The center frequencies in Hertz, is given by

      = 700 1127 .01 1 —(10)

      Equation (10) is inserted into equation (8) to give the Mel filter bank. Finally, the MFCCs are obtained by computing the discrete cosine transform of X (m) using

      neurons to reduce the dimensionality of the feature vector while keeping enough information to achieve high recognition accuracy [9].

      1. SOM Architecture

        The SOM consists of only one real layer of neurons. The SOM is arranged in a 2-D lattice. This architecture implements similarity measure using Euclidean distance

        M

        M

        c(l)

        m1

        X (m) cos(l

        M

        (m 1 ))

        2

        — (11)

        measurement. In fact, it measures the cosine of the angle between normalized input and weight vectors. Since the SOM algorithm uses Euclidian metric to measure distances

        forl = 1, 2, 3, .., M where c(l) is the lth MFCC.

        The time derivative is approximated by a linear regression coefficient over a finite window, which is defined as

        2

        between data vectors, scaling of variables was deemed to be an important step and all input vectors has been normalized to the unity. The input vector is normalized between -1 and

        +1 before it is fed into the network. Usually, the output of the network is given by the most active neuron as the winning neuron.

        ct (l)

        K 2

        k ct k (m).G, 1 l M

        — (12)

      2. Learning Algorithm

      The objective of the learning algorithm in SOM

      where ct (l) is the lth cepstral coefficient at time t and G is a constant used to make the variances of the derivative terms equal to those with the original cepstral coefficients.

  3. Baseline Speech Recognition System

A baseline speech recognition system was developed during the present work using Multilayer perceptron to recognize the phonemes of Assamese language. To reduce the feature vector, Self Organized Map has been used. The details of Multilayer Perceptron and Self Organized Map have been given below:

    1. Self-Organizing Map (SOM)

      Kohonen [7] proposed a Neural Network (NN) architecture which can automatically generate self- organization properties during unsupervised learning process, namely, a Self-Organizing Map (SOM). All the input vectors of utterances are presented into the network sequentially in time without specifying the desired output. After enough input vectors have been presented, weight vectors from input to output nodes will specify cluster or vector centers that sample the input space such that the point density function of the vector centers tends to approximate the probability density function of the input vectors. In addition, the weight vectors will be organized such that

      neural networks is the formation of the feature map whichcaptures the essential characteristics of the p- dimensional input data and maps them on the typically 2-D feature space. The learning algorithm captures two essential aspects of the map formation, namely, competition and cooperation between neurons of the output lattice.

      Assuming Mij (t ) = { m1ij(t), m2 ij (t)., mNij (t)} as the weight vector of node (i,j) of the feature map at time instance t; i, j = 1, , M are the horizontal and vertical indices of the square grid of output nodes, N is the dimension of the input vector. Denoting the input vector at time t as X(t), the learning algorithm can be summarized as follows [8]:

      1. Initializing the weights

        Prior to training, each node's weights must be initialized. Typically these will be set to small standardized random values. The weights in the SOM in this research are initialized so that 0 < weight < 1.

      2. Calculating the winner node – Best Matching Unit (BMU)

        To determine the BMU, one method is to iterate through all the nodes and calculate the Euclidean distance between each node's weight vector and the current input vector. The node with a weight vector closest to the input vector is tagged as the BMU. The Euclidean distance is given as:

        i n

        topologically close nodes are sensitive to inputs that are physically similar in Euclidean distance. Kohonen has proposed an efficient learning algorithm for practical

        Dist

        ( Xi (t) Mij (t))2

        i 0

        — (13)

        applications. This learning algorithm has been used in the proposed system.

        Using the fact that the SOM is a Vector Quantization

        To select the node with minimum Euclidean distance to the input vector X(t):

        (VQ) scheme that preserves some of the topology in the original space [8], the basic idea behind the approach proposed in this work is to use the output of a SOM trained

        X (t) M

        ic jc

        (t)

        minX (t) Mij i, j

        (t) — (14)

        with the output of the speech processing block to obtain reduced feature vector (binary matrix) that preserve some of the behaviour of the original feature vector. The problem is now reduced to find the correct number of neurons (Dimension of SOM) for constituting the SOM. Based on the ideas stted above, the optimal dimension size of SOM has to be searched in order to ensure the SOM has enough

      3. Determining the Best Matching Unit's Local Neighborhood

        For each iteration, after the BMU has been determined, the next step is to calculate which of the other nodes are within the BMUs neighbourhood. Radius of the

        neighborhood is calculated. The area of the neighborhood shrinks over time using the exponential decay function:

        0

        0

        (t) exp t t=1,2,3, —(15)

        where 0, denotes the width of the lattice at time = 0, t is the current time-step. If a node is found to be within the neighbourhood then its weight vector is adjusted as shown in next step.

      4. Adjusting the weights

        Every node within the BMUs neighborhood including the BMU (ic, jc) has its weight vector adjusted according to the following equation:

        modified version of well-known Back Propagation Algorithm [10] has been used. To avoid the oscillations at the local minima a momentum constant has been introduced which provides optimization in the weight updating process. The algorithm is detailed below:

        1. Initialization

          The weights of each layer have been initialized to random number lies between -1 to 1.

        2. Forward computation

          j

          j

          In the forward pass the synaptic weight remain unaltered throughout the network and functional signal of the network is computed neuron-by-neuron basis. The induced local field l (n) for neuron j in layer l which is due to the functional

          signal produced by neurons of layer l-1 is given by [11]

          m0

          j ji i

          +

          +

          v(l ) (n) w(l ) (n) yl 1 (n)

          i0

          — (20)

          ( + 1) = +

          + 1 = ,

          for all other indices ,

          —(16)

          where m is the total number of inputs, excluding bias applied to neuron j. The synaptic weight wjo, corresponds to fixed input y0=+1, equals the bias bj applied to neuron j. Hence the functional signal appearing at the output neuron j of layer l is expressed as

          wheret represents the time-step and is a small variable called the learning rate, which decreases with time.

          y(l)

          j (vj (n))

          — (21)

          j

          j

          Basically, this means that the new adjusted weight for the node is equal to the old weight, plus a fraction of the

          If the neuron j is in the first hidden layer

          y(0) x (n)

          — (22)

          difference between the old weight M and the input vector j j

          X. The decay of the learning rate is calculated each iteration using the following equation:

          (t) exp t , t=1,2,3, —(17)

          wherexj(n) is the jth element of the input vector. If on the other hand, network j is in the output layer of the network, and L the depth of the network, then

          0

          y(L) o (n)

          — (23)

          j j

          Ideally, the amount of learning should fade over distance

          similar to the Gaussian decay.

          So, an adjustment is made to equation (16) which shown as equation below:

          whereoj(n) is the jth element of the output vector. The output is compared with the desired response dj(n), obtain the error signal ej(n) for the jth output neuron

          Mij (t 1) Mij (t) (t)(t)( X (t) Mij (t)) —(18)

          e j (n) d j (n) o j (n)

        3. Backward computation

        — (24)

        where represents the amount of influence a node's distance from the BMU has on its learning. (t) is given by equation below:

        The backward pass starts at the output layer by passing the error signal leftward through the network, layer by layer

        , and recursively computing the (i.e. the local gradient) for each neuron as follows:

        (t) exp

        dist 2

        2

        , t=1,2,3, —(19)

        ,

        2

        (t)

        where dist is the distance a node is from the BMU and is the width of the neighbourhood function as calculated by

        =

        +1 +1 ,

        equation (15). Additionally, also decays over time.

      5. Update timet = t + 1, add new input vector and go to Step 2.

        — (25)

      6. Continue until(t) approach a certain pre-defined value or t reach maximum iteration.

    2. Multilayer Perceptron based Phoneme

The weight updation is taking place in accordance with the following rule –

w(l) (n 1) w(l) (n) [w(l) (n 1)] (l) (n) y(l 1) (n)

Recognizer

ji ji ji

j i

— (26)

In the present study Multilayer Perceptron (MLP) has been used to design the speech recognizer to recognize the phonemes of Assamese languages. The MLP consist of input, output and three hidden layers. To train the MLP, a

where is the learning rate and is momentum constant.

It has been observed that MLP based speech recognizer work better if the input and output lies between 0

1. Therefore, the input vector has been normalized with respect to their maximum and minimum value.

A momentum constant has been used to avoid oscillation at the local minima.

The learning rate parameter has been changed gradually with each epoch number as expressed by equation given below:

and MFCC feature vectors are reduced into 6 cluster centers each.

To carry out the recognition task a MLP-based recognizer is designed with 144 input nodes, 3-hidden layers with different numbers of node and a output layer with 33 nodes corresponds to the 33 phonemes of Assamese language. Experimentally, the numbers of nodes in the three

(epochNumber)

exp

epochNumber

hidden layers have been fixed at 99, 68 and 47 respectively

0 100

— (27)

and the same configuration has been used in all the experiments.

where0 is the initial learning rate parameter.

    1. Experimental Setup and Database Used

      1. Experimental Setup

        The baseline speech recognition process has the following steps:

        1. Digitizing the speech that is to be recognized

        2. Compute the features of the speech signal

        3. Reduce the feature set using Self-Organized Map (SOM)

        4. An MLP based phoneme classifier is used to classify each set of feature corresponding to the phoneme utterance to corresponding phoneme.

          Speech is first filtered to a bandwidth of 4 KHz and then digitized at 8 KHz. sampling rate. The digitized speech is then emphasized using a simple first order digital filter with transfer function H (z) = 1 0.95 z -1. The pre-emphasized speech is then blocked into frames of length 256 samples. The objective is to block the speech signal into frame of 30 microseconds which contain 240 samples. However to make the FFT efficient, length is made multiple of 2. The frame frequency is 100 Hz.In order to remove the leakage effects and to smooth the edges, each frame is multiplied by a Hamming window as define by

          = 0.54 0.46 2 ,

          1

          0, 1 = 256 — (27)

          From each windowed speech signal two types of feature were extracted LPCC and MFCC. To obtain the LPCC, 12th order predictor is used and 12 LPCC coefficients were obtained by applying the method described in section II. Similarly, each windowed frame is passes through a bank of

          20 triangular bandpass filter and was constrained into a frequency band of 300-3400 Hz. The 0th cepstral coefficient has not been considered at it corresponds to the energy of the whole frame. To reduce the computational load only next 12 coefficients have been used in the present study. T capture the time varying nature of the speech signal, the first order derivatives of the LPCC and MFCC feature are appended with the original feature set from each frame. Thus, we get two distinct set of 24-dimensional feature vectors.

          In order to reduce the volume of data without losing the topological information, we use self-organized map (SOM) to cluster the feature vector into six clusters. The centroids of the clusters are dynamically detected. Thus both LPCC

      2. Databases

        The database used in the present study has been described below:

        Dataset-I(Clean): The dataset contain 20 utterances for each phoneme for each speaker. Speech data has been collected from 50 speakers, 27 male and 23 female. To collect the phoneme utterances, recording has been done for isolated words. The isolated words are selected in such a way that they include all the phonemes at least 20 times. The recording has been done using headphone microphone at 8 KHz sampling rate with 16 bit mono resolution. The isolated words so recorded have been manually segmented into phonemes using Praat and EasyAlign tool.

        Dataset-II (20dB SNR): The Dataset-II is a noisy version of the Dataset-I. Simulated 20dB Gaussian noise has been added digitally to the samples of Dataset-I to obtain the dataset.

        Dataset-III (15dB SNR): The Dataset-III is similar to Dataset-II expect the SNR of the simulated Gaussian noise added to the clean speech is 15 dB.

        Dataset-IV (10dB SNR): The SNR of the simulated Gaussian noise added to the clean speech is 10 dB.

    2. Experiment

      The recognizer is trained with clean speech (Dataset-I) using modified version of back-propagation algorithm as described in section II. 100 occurrences of each phoneme have been considered for training the system collected from

      10 speakers, 5 male and 5 female. Once the system is converged it is tested with the remaining phoneme occurrences of the same dataset. Testing has been done for evaluate the performance of the system when training and the testing speakers are same as well as when the speakers are different. The results of the experiments are given in Table-3.

      Table-3: Speaker Recognition using clean speech

      Feature Set

      Recognition Accuracy (%)

      Same Speaker

      Different Speaker

      LPCC

      100

      94.23

      MFCC

      100

      89.14

      In the next experiments, the same speech recognizer has been tested for speech data with different level of noise, i.e., with Dataset-II, III and IV. Experiments were carried out using speech data from the same group of speakers used for

      training the system. The performance of the speech recognition system has been reported in Table-4.

      Table 4: Performance of the speech recognition system at different level of noise

      SNR

      Feature Set

      Recognition Accuracy (%)

      20 dB

      LPCC

      73.27

      MFCC

      97.03

      15 dB

      LPCC

      59.41

      MFCC

      85.15

      10 dB

      LPCC

      47.52

      MFCC

      68.32

    3. Conclusion

From the above experiments, it has been observed that both MFCC and LPCC along with its 1st order derivatives can work as efficient parameterization of the speech signal for the recognition of the phonemes of Assamese language using MLP based recognizer. However, the performance of the system degrades considerably with the change in the training and testing conditions. It has been observed that under same environmental condition, when different set of speaker is used for training and testing the MLP based recognition, LPCC feature vector gives a recognitionaccuracy of 94.23% whereas for MFCC the recognition accuracy is 89.14%. Thus LPCC appears to give better representation of the speaker independent contents of the speech signal whereas MFCC captures some of the speaker dependent properties of the speech signal along with the speech contents. However, in noisy condition it has been observed that MFCC based system gives a relatively robust performance compare to LPCC based system. At 20dB SNR level MFCC based system gives 97.03% recognition accuracy whereas under same conditions, the recognition accuracy for the LPCC based system is 73.76%,

i.e. there is nearly 24% difference in recognition accuracy. The same trend has been observed in other two level of noise also.It has been observed that with increase in noise level the performance of the MFCC based system also degrades but the degradation in case of LPCC is sharply more than that of MFCC.

References

[1]. Picheny, M.; Nahamoo, D.; Goel, V.; Kingsbury, B.; Ramabhadran, B.; Rennie, S. J.; Saon, G.; , "Trends and advances in speech recognition," IBM Journal of Research and Development , vol.55, no.5, pp.2:1-2:18,

Sept.-Oct. 2011

[2]. Mitra, V.; Hosung Nam; Espy-Wilson, C.Y.; Saltzman, E.; Goldstein, L.; , "Articulatory Information for Noise Robust Speech Recognition," Audio, Speech, and Language Processing, IEEETransactions on , vol.19, no.7, pp.1913-1924, Sept. 2011

[3]. Lippmann, R.; Martin, E. and Paul, D.: Multi-Style Training for Robust Isolated-Word Speech Recognition, IEEE International Conference on Acoustics, Speech and Signal Processing, 705-708, April 1987.

[4]. Technology Development for Indian Language, Department of Information Technology, http://tdil.mit.gov.in.

[5]. Rabiner, L. and Schafer, R., Digital Processing of Speech Signals. Prentice Hall, Inc., Englewood Cliffs, New Jersey, 1978.

[6]. Stevens, S., Volkmann, J., and Newman, E., A Scale for the Measurement of the Psychological Magnitude Pitch. Journal of the Acoustical Society of America 8: 185190, 1937.

[7]. Kohonen, T., Self-Organizing Neural Networks – Recent Advances and Applications(Studies in Fuzziness and Soft Computing),Physica-Verlag HD , 2002.

[8]. Moosavi, SeyedVahid, and Qin Rongjun. "A New Automated Hierarchical Clustering Algorithm Based on Emergent Self Organizing Maps." Information Visualisation (IV), 2012 16th International Conference on. IEEE, 2012.

[9]. Gavat, I., Valsan, Z. and Sabac, B., Combining Self- Organizing Map and Multilayer Perceptron in a Neural System for Improved Isolated Word Recognition. Communication98. 245-255, 1998.

[10]. Gelenb, E. (Eds.): Neural Network: Advances and Applications, North-Holland, New York, 1991.

[11]. Zeidenberg, M.: Neural Network Models in Artificial Intelligence, E.Horwood, London, 1990.

Leave a Reply