Application Of Different Filters In Mel Frequency Cepstral Coefficients Feature Extraction And Fuzzy Vector Quantization Approach In Speaker Recognition

Satyanand Singh; Dr. E.G. Rajan

doi:10.17577/IJERTV2IS60882

Volume 02, Issue 06 (June 2013)

Application Of Different Filters In Mel Frequency Cepstral Coefficients Feature Extraction And Fuzzy Vector Quantization Approach In Speaker Recognition

DOI : 10.17577/IJERTV2IS60882

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 243
Total Downloads : 392
Authors : Satyanand Singh, Dr. E.G. Rajan
Paper ID : IJERTV2IS60882
Volume & Issue : Volume 02, Issue 06 (June 2013)
Published (First Online): 29-06-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Application Of Different Filters In Mel Frequency Cepstral Coefficients Feature Extraction And Fuzzy Vector Quantization Approach In Speaker Recognition

*Satyanand Singh

Associate Professor

**Dr. E.G. Rajan

Director

Abstract

Front-end or feature extractor is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but effective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components is strongly determined by the quality of the front-end. Over the years, Mel-Frequency Cepstral Coefficients (MFCC) modeled on the human auditory system has been used as a standard acoustic feature set for speech related applications. In this paper it has been shown that the inverted Mel- Frequency Cepstral Coefficients is one of the performance enhancement parameters for speaker recognition, which contains high frequency region complementary information in it and also introduces the Gaussian shaped filter (GF) and Tukey while calculation MFCC and inverted MFCC in place of traditional triangular shaped bins. The main idea is to introduce a higher amount of correlation between subband outputs. The performance of both MFCC and inverted MFCC improve with GF and Tukey over traditional triangular filter (TF) based implementation, individually as well as in combination. In this study, Fuzzy Vector Quantization (FVQ) is used for speaker modeling. Fuzzy clustering methods allow objects to belong to several clusters simultaneously, with different degrees of membership. The performances of proposed GF and Tukey Filter based MFCC and IMFCC in individual and merged mode have been verified in two standard databases POLYCOST (Telephone Speech) and TIMIT each of which has more than 130 speakers as well as self voice collected from 90 speakers.

Introduction

A speaker recognition system mainly consists of two main module, speaker specific feature extractor as a front end followed by a speaker modelling technique

for generalized representation of extracted features [1, 2]. Since long time MFCC is considered as a reliable front end for a speaker recognition application because it has coefficients that represents audio, based on perception [3, 4]. In MFCC the frequency bands are positioned logarithmically which approximated the human auditory systems response more closely than the linear spaced frequency bands of FFT or DCT. An illustrative speaker recognition system is shown in figure.1.The state of the art speaker recognition research primarily investigates speaker specific complementary information relative to MFCC. It has been observed that the performance of speaker recognition improved significantly when complementary information is merged with MFCC in feature level either by simple concatenation or by combining models scores. The main complementary information is pitch [5], residual phase [6], prosody [7], dialectical features [8] etc. These features are related with vocal chord vibration and it is very difficult to extract speaker specific information. It has been shown that complementary information can be captured easily from the high frequency part of the energy spectrum of a speech frame via reversed filter bank methodology [9]. There are some features of speaker which used to present at high frequency part of the spectrum and generally ignored by MFCC that can be captured by inverted MFCC is proposed in this paper. The complementary information captured by inverted MFCC is modelled by FVQ [10] technique. In this paper, Fuzzy Vector Quantization (FVQ) is used for speaker modelling. In many real situations, fuzzy clustering is more natural than hard clustering, as objects on the boundaries between several classes are not forced to fully belong to one of the classes, but rather are assigned membership degrees between 0 and 1 indicating their partial memberships.So the present study was undertaken with the objective of to find out the speaker recognition efficiency improving components.
Methodology

In the present investigation GF and Tukey filter were used as the averaging bins instead of triangular for calculating MFCC as well as inverted MFCC in a typical speaker recognition application [11, 12]. There are three main inspiration of using GF and Tukey filter. First inspiration is both filter can provide much smoother transition from one subbands to other preserving most of the correlation between them. Second inspiring point is the means and variances of these can be independently chosen in order to have control over the amount of overlap with neighbouring subbands. Third inspiring point is the filter design parameters for GF and Tukey can be calculated very easily from mid as well as end-points located at the base of the original TF used for MFCC and inverted MFCC. In this investigation both MFCC and inverted MFCC filter bank are realized using a moderate variance where a GFs and Tukey coverage for a subbands and the correlation is to be
Mel Frequency and Their Calculation
1. Mel-Frequency Cepstral Coefficients using triangular filters
  
  According to psychophysical studies human perception of the frequency content of sounds follows a subjectively defined nonlinear scale called the Mel scale [15, 16]. MFCC is the most commonly used acoustic features for speech/speaker recognition. MFCC is the only acoustic approach that takes human perception (Physiology and behavioral aspects of the voice production organs) sensitivity with respect to frequencies into consideration, and therefore is best for speaker recognition. The acoustic model is defined as, voice production organs) sensitivity with respect to frequencies into consideration, and therefore is best for speaker recognition. The acoustic model is defined as,
  
  balanced. Results show that GF and Tukey based MFCC and inverted MFCC perform better than the
  
  = 259510 1 +
  
  (1)
  
  700
  
  conventional TF based MFCC and inverted MFCC individually. Results are also better when GF and Tukey based MFCC & inverted MFCC is merged together their model scores in comparison to the results that are obtained by combining MFCC and inverted MFCC feature sets realized using traditional TF [13]. All the implementations have been done with FVQ [14]
  Where fmel is the subjective pitch in Mels corresponding transfer function and the actual frequency in Hz. This leads to the definition of MFCC, a baseline acoustic feature for speech and speaker recognition applications, which can be calculated as follows [17]
  n=1
  
  n=1
  
  Let y(n) Ns represent a frame of speech that is
  
  Speech Signal
  
  Training Mode
  
  Feature Extraction
  
  Speaker Modeling
  
  Speaker 1
  
  Speaker 2
  
  Speaker 3
  
  Recognition
  
  Pattern Matching
  
  Speaker N
  
  Decision Logic Decision
  
  Figure 1: Speaker recognition system
  
  preemphasized and Hamming-windowed.
  
  First, y(n) is converted to the frequency domain by an Ms-point DFT which leads to the energy spectrum,
  
  The sampling frequency Fs and flow , fhigh frequencies are in Hz while fmel is in Mels. In this work, Fs is 8 kHz. Ms is taken as 256 flow = Fs/Ms
  
  = 31.25 Hz while fhigh = Fs /2 = 4 kHz. Next, this
  
  2
  
  2
  
  filter bank is imposed on the spectrum calculated in
  
  2 = .
  
  =
  
  (2)
  
  eqn. (2). The outputs e i i=1 Q of the Mel-scaled band-pass filters can be calculated by a weighted
  
  p>boundary points are equally spaced in the Mel scale
  
  which is satisfying the definition,
  
  Where1 k Ms. this is followed by the construction of a filter bank with Q unity height TFs, uniformly spaced in the Mel scale eqn. (1). The filter response i (k) of the ith filter in the bank (figure-2) is defined as,
  
  0 1
  
  summation between respective filter response i(k) and the energy spectrum Y k 2 as
  
  2
  
  = 2 . (6)
  
  =1
  
  Finally DCT is taken on the log filter bank energies
  
  log e i i=1 and the final MFCC coefficients Cm
  
  can be written as.
  
  1
  
  1
  
  = +1
  
  1
  
  (3)
  
  1
  
  +1
  
  = 2
  
  2 1
  
  + 1 . .
  
  (7)
  
  +1
  
  2
  
  0 +1
  
  =0
  
  i=0
  
  i=0
  
  Where 1 i Q, Q is the number of filters in the bank, kbi Q+1 are the boundary points of the filters and k denotes the coefficients index in the Ms point DFT. The filter bank
  
  Where 0 m R 1, R is the desired number of cepstral features.
2. Mel-Frequency Cepstrum Coefficients using Gaussian filters.
  
  = 1
  
  The transfer function of any filter is asymmetric,
  
  + (4)
  
  + 1
  
  tapered and filter does not provide any weight outside the subband that it covers. As a result, the correlation between a subband and its nearby spectral components from adjacent subbands is lost. In this investigation a GF and Tukey is proposed, which produced gradually decaying weights at its both ends and symmetric for compensating possible loss of correlation. Referring to eqn. (3), the expression for GF can be written as [18]
  =
  
  =
  
  2
  
  2 2 (8)
  
  Figure 2: Response of a typical Mel scale filter
  
  Where kbi is a point between the ith transfer boundaries located at it base and it is considered here as a mean of the ith GF while the i is the standard deviation and can be defined as,
  
  where the function fmel () is defined in eqn. (1), Ms is the number of points in the DFT eqn. (2), Fs is the sampling frequency, flow , and fhigh are the
  
  = +1
  
  (9)
  
  mel
  
  mel
  
  low and high frequency boundaries of the filter bank and f1 is the inverse of the transformation in eqn.(1) defined as,
  
  102595
  
  102595
  
  1 = 700. 1 (5)
  
  Where is the variance controlling parameter. However, in eqn. (8) the conventional denominator
  
  i.e (2)i is dropped, as its presence is only to ensure the area under Gaussian curve is unity [19]. Moreover, omitting the term helps a GF to achieve
  
  unity as highest value at its mean,which is similar to unity height triangular shaped filter used for conventional MFCC. Note that, a TF become nonisosceles while they are mapped into from its two endskbi in base become unequal. For MFCCs ith filter, the relation becomes,
  
  +1 > 1 (10)
  
  We took the maximum spread out of these two distances i.e. kbi+1 kbi to evaluate i ensuring
  
  cepstral vector using GFs can be calculated from the filters response eqn. (8) which is as follows
  
  2
  
  = 2 (12)
  
  =1
  
  and,
  
  =1
  
  =1
  
  = 2 1 + 1 .
  
  21 . (13)
  
  2
  
  full coverage of the subband by the GF.
  
  = +1
  
  = ( +1 )/2
  
  = ( +1 )/3
  
  1 +1
  
  Figure. 3 Response of various shaped filters
  
  Figure.3, shows the plot TF and GF for different values of sigma. The figure clearly depicts that a triangular window can give some sort of tapering at its both ends but lacks also in offering of no of weights outside its coverage.
  
  In figure 4. shows the standard deviation for different values of and kbi is the centre point of al filters after that transfer functions are gradually decaying. However the Gaussian with higher variance shows larger correlation with nearby
  
  frequency component. Thus selection of is a critical part for setting the variances of GF. In the present study the value of = 2 then eqn. (9) can be written as,
  
  Here last 20 coefficient from both models are used and the value of Q=22 and R = 25 are taken.
  
  Table I shows the different values of and their coverage within the curve and outside the curve.
  
  Figure 4: standard deviation
  
  Table 1. Summary of sigma and its coverage
  
  % Within the curve
  
  % Outside the curve
  
  2
  
  95.4499736%
  
  4.5500264%
  
  3
  
  99.7300204%
  
  0.2699796%
3. Mel-Frequency Cepstrum Coefficients using Tukey Filter
  
  Based on evidence Tukey filter is a combination of the rectangular window and the Hann window [20]. In fact it is a cosine- tampered window and is defined as follows
  
  1 1 2. 1 , 2
  
  2 = +1 (11)
  
  1
  
  1
  
  + 1
  
  1
  
  +
  
  2
  
  95% of subband is covered once = 2, is selected. 1,
  
  Probability 2 (k k ) = 0.95.
  
  bi+1
  
  Therefore, = 2 can provide
  
  bi
  
  better
  
  correlation with
  
  = 1 +
  
  2 + 1 +1
  
  2 (14)
  
  nearby subbands in comparison to = 3. In this study, we have chosen = 2 to design filters for the MFCC filter bank. Thus, a balance is achieved
  
  1 1 2. ,
  
  2
  
  where significant coverage of a particular subband is ensured while allowing moderate correlation
  
  +1
  
  2
  
  +1
  
  between that subband and neighboring ones. The
  
  0, +1
  
  Where k = 0, 1 , N 1 is discrete frequency, 0 kbi1 < kbi < kbi+1 < are basic filter frequencies and unit amplitude of the filter and length of the filter is N sample.NHann is defined as
  
  = 1 . 1 + 1 (15)
  
  Where is the ratio of taper to constant section and
  
  number of filters in the bank. From eqn. (13) we can derive an expression for i k with analogous
  
  to eqn. (3) i k for the original MFCC filter bank.
  
  0 1
  
  1
  
  0 1.When = 0 , then filter corresponds to a rectangular filter. When = 1, the filter
  
  = 1
  
  1
  
  (18)
  
  corresponds to Hann filter. NRect in the Eq. (14) is
  
  +1
  
  a complement of the NHann and is defined as
  
  +1
  
  +1
  
  = .
  
  +1
  
  1
  
  + 1 (16)
  
  0 +1
  
  Figure.5, shows the plot Tukey filter for different
  
  Where l k Ms
  
  and k
  
  Q+1
  
  bi i=0
  
  values of . The figure clearly depicts that a Tukey window can give some sort of tapering at its both ends.
  
  Here inverted mel-scale is defined as follows
  
  = 2195.2860
  
  2595 10 1 + 4031 .25 19
Inverted Mel Frequency Cepstral Coefficients Calculation

Where fmel

f

700

is subjective pitch in the new scale

corresponding to f, the actual frequency in Hz .The

filter outputs e i Q in the same way as MFCC

Inverted Mel-Frequency Cepstral

from the same en

i=1

spectrum Y K 2 as

Coefficients using triangular filters

The main objective is to capture that information

2

ergy

which has been missed by original MFCC [21]. In this study the new filter bank structure is obtained simply by flipping the original filter bank around the point f = 2 kHz which is precisely the mid-point of the frequency range considered for speaker recognition applications. This flip-over is expressed mathematically as,

= + 1 (17)

= 2 . (20)

=1

DCT is taken on the log filter bank energies

i=1

i=1

m m=1

m m=1

log10 e i Q and the final inverted MFCC coefficient C R can be written as

+1 2

1

2

Where i k is the inverted Mel Scale filter

= .

response while i k is the response of the

=0

2 1

original MFCC filter bank l i Q and Q is the

+ 1 . 2

(21)
Inverted Mel-Frequency Cepstral Coefficients using Gaussian filters

1

+1

It is expected that introduction of correlation between subband outputs in inverted mel-scaled filter bank makes it more complementary than what was realized using TF. Flipping the original triangular filter bank, around 2 KHz inverts also the relation mentioned in eqn. (10) that gives

1 > +1 (22)

Here k bi is the mean of the ith GF and standard deviation can be calculated as

Figure 5: Response of Tukey Filter

= 1

(23)

Here value is chosen 2. The response of the GF for inverted MFCC filter bank and corresponding cepstral parameters can be calculated as follows;

( )2

decisions in order to obtain improved identification accuracy. The same principle has been adopted for GF and Tukey based MFCC and IMFCC also. In this context, it should be noted that our

=

2 2

2

(24)

computation of complementary information from IMFCC involves comparably lower computational complexity than higher-level features.

= 2. (25)

=1

And

1

2

The MFCC and IMFCC features vectors, containing complementary information of speakers, were supplied to a given classifiers independently and the classification results for the MFCC features and the IMFCC features were fused in order to obtain optimum decision in the process of speaker

=

+ 1 .

=0

recognition . A uniform weighted sum rule was adapted to fuse the core from two classifiers. If

. 2 1 . (26)

XMFCC denotes the classification score based on the

2

MFCC and X

IMFCC

denotes the classification score
Inverted Mel-Frequency Cepstral Coefficients using Tukey Filter

The main objective is to capture that information which has been missed by original MFCC. In this study the new filter bank structure is obtained simply by flipping the original filter bank around the point f = 2 kHz which is precisely the mid-point of the frequency range considered for speaker recognition applications. This flip-over is expressed mathematically as

1 1 2. 1 ,

based on the IMFCC, then the combined score for the mth speaker was given as

= + 1 (28) The constant value of = 0.5 was used in all cases. The speaker was determined as,

= (29)

Theoretical Background of VQ

In VQ-based approach the speaker models are formed by clustering the speakers feature vectors in K non-overlapping clusters. Each cluster is represented by a code vector Ci, which is the centroid [22]. The resulting set of code vectors

1, 2, 3, 4, , is called a

2

1 1 + 1 1 +

1,

2

codebook, and it serves as the model of the speaker. The model size (number of code vectors) is significantly smaller than the training set. The distribution of the code vectors follows the same underlying distribution as the training vectors.

= 1 +

2 + 1 1 2

(27)

Thus, the codebook effectively reduces the amount of data by preserving the essential information of

1 2. , 2

the original distribution. K-means is an iterative approach; in each successive iteration it redistributes the vectors in order to minimize the

2

2

+1

+1

distortion. The procedure is outlined below:
- Initialized the random centroids as the
  
  0, +1
  
  means of M clusters.
- Data points are associated with the nearest

Synthesis of MFCC and IMFCC

The idea of combining the classifier to enhance the decisions making process has been successful in many pattern classification problems including Speaker Identification. According to the available literature, the combination of two or more classifiers would perform better if they were supplied with information that are complementary in nature. Adopting this idea in our work, we supplied MFCC and IMFCC feature vectors, which are complementary in information content, to two classifiers respectively and finally fused their

centroids.
- The centroids are moved to the centre of their respective clusters.
- Steps b and c were repeated until a suitable level of junction has been reached, i.e the distortion is minimized.

When the distortion is minimized, redistribution does not result in any movement of vectors among the clusters. This could be used as an indicator to terminate the algorithm. Upon the convergence, the total distortion does not change as a result of redistribution.

Linde, Buzo and Gray Clustering Technique

d = (xi, yi) can be obtained and shown as shown in eqn. (31)

The acoustic vectors extracted from input speech of a speaker provide a set of training vectors. As described above, the next important step is to build a speaker-specific VQ codebook for this speaker using those training vectors. There is a well-known

1

2

2

= ( , = 2

=1

31

algorithm, namely LBG algorithm, for clustering a set of L training vectors into a set of M codebook vectors. The LBG VQ design algorithm is an iterative algorithm which alternatively solves the two optimality criteria. The algorithm requires an initial code C0. This initial codebook is obtained by the splitting method. In this method, an initial code vector is set as the average of the entire training sequence. This code vector is then split into two. The iterative algorithm is run with these two vectors as the initial codebook. The final two code vectors are split into four and the process is repeated until the desired number of code vectors is obtained.
K-means Clustering Technique

The standard k-means algorithm is a typical clustering algorithm used in data mining and which is widely used for clustering large sets of data. In 1967, Mac Queen firstly proposed the k-means
Fuzzy C-means Clustering

In speech-based pattern recognition, VQ is a widely used feature modeling and classification algorithm, since it is simple and computationally very efficient technique. FVQ reduces disadvantages of classical Vector Quantization. Unlike Linde-Buzo-Gray (LBG) and k-means algorithms, the FVQ technique follows the principle that a feature vector located between the clusters should not be assigned to only one cluster. Therefore, in FVQ each feature vector has an association with all clusters [23]. The discrete nature of hard partitioning also causes analytical and algorithmic intractability of algorithms based on analyti function values, since the function values are not differentiable. Fuzzy c- means is a clustering technique that permits one piece of data to belong to more than one cluster at the same time. It aims at minimizing the objective function defined by Equation (28).

algorithm; it was one of the most simple, non- supervised learning algorithms, which was applied to solve the problem of the well-known cluster. It is a partitioning clustering algorithm and this method are used to classify the given date objects into k different clusters iteratively, converging to a local minimum. So the results of generated clusters are compact and independent. The algorithm consists of two separate phases. The first phase selects k centers randomly, where the value k is fixed in advance. The next phase is to arrange each data object with the nearest centre. Euclidean distance is generally used to determine the distance between each data object and the cluster centers. When all the data objects are included in a cluster, the first step is completed and an early grouping is done. This process is repeated continues repeatedly until the criterion function becomes the minimum. Supposing that the target object is x, xi indicates the

average of cluster ci . The criterion function is defined in eqn. (30).

= 2 , 1 < < (32)

=1 =1

i=

i=

Where C is the number of clusters, N is the number of data elements, xi is an column vector of X, and is defined as the centroid of the ith cluster.uij is an element of U ,and denotes the membership of data element J to the ith cluster, and Si subject to the constraints uij 0,1 and C 1 uij = 1 for all j. m

is a free parameter which plays a central role in adjusting the blending degree of different clusters. If m is set to 0, J is a sum of square error criterion, and uij a Boolean membership value (either 0 or 1).

can be any norm expressing the similarity [24]. Fuzzy partitioning is carried out using an iterative optimization of the objective function with the update of membership function uij , an element of U, which denotes the membership of data element J to the Ith cluster. The cluster center Cj is derived using Equation (28) and (29).

2

= 2

(30)

1

=1

E is sum of the squared error of all objects in

= 1/

(28)

the

database. The distance of the criterion function is Euclidean distance, which is used for determining

=1

the nearest distance between each data objects and

= . /

29

cluster centre. The Euclidean distance between one

ij

ij

vector x = (x1 , x2, xn ) and another vector

y = (y1, y2 , yn ). The Euclidean distance

This iteration will stop when maxij uK+1

ij

ij

uK < , where is the termination criterion.

The algorithm for Fuzzy c-means clustering includes the steps:
1. Initialize C,N,m,U
2. Repeat
3. Minimize j, by computing :
  
  2
  
  recommendations were used while performing the annotation.
  
  7.3 Self Collected Voice Database
  
  The voice corpus was collected in unhealthy environment by using Microsoft sound recorder. A good quality head phone belongs to different parts
  
  = 1/
  
  1
  
  of India. The average duration of the training samples was 6 seconds per speaker and out of
  
  =1
  
  twenty utterances one is used for training purpose. For matching purposes remaining 19 voice corpus
  
  i=
  
  i=
4. Normalize uij by C 1 uij = 1
5. Compute Centroid Cj by using:
  
  = . /
  
  of the length 6 seconds, which was further divided into three different subsequence of the lengths 6 s (100%), 3 s (50%), 2s (33%) 1s (16 %) and
  
  0.5s(8%) .Therefore, for 70 speakers we put 70X19X5 = 6650 utterance under test and valuated
  
  Traini
  
  Traini
  
  the identification efficiency.
6. Until slightly change in U and V
7. End

A schematic description of this scheme for parallel combination of classifiers is given in figure. 6.

1

2

3

N

1

2

3

N

Speaker Modeling Technique for MFCC Features Vectors

1

2

3

N

1

2

3

N

Speaker Modeling Technique for IMFCC Features Vectors

MFCC

Pre-

IMFC

MFCC

Pre-

IMFC

Experimental setup
1. Pre-Processing Stag
  
  In this work, each frame of speech is pre-processed as follows:
  - Silence removal and end-point detection using an energy threshold criterion.
  - Pre-emphasis with 0.97 pre-emphasis factor.
  - Frame blocking with 20ms frame length,
    
    i.e = 160 samples/frame 50 overlap, and finally Hamming-windowing.
    
    The MFCC and IMFCC feature sets using triangular, GFs and Tukey are calculated.
2. POLYCON Database
  
  Sum over
  
  Sum over
  
  The database was collected through the European telephone network. The recording has been performed with ISDN cards on two XTL SUN platforms with an 8 kHz sampling rate. In this work, a closed set text independent speaker identification problem is addressed where only the mother tongue (MOT) files are used. Specified guideline [25] for conducting closed set speaker identification experiments is adhered to, i.e.
  
  MOT02 files from first four sessions are used to build a speaker model while MOT01 files from session five onwards are taken for testing. In the POLYCOST case, the English prompts are fully annotated in terms of word boundaries. The mother
  
  FVQ
  
  Matc hing
  
  MFC C
  
  Featu
  
  Sum over
  
  Sum over
  
  W
  
  Pre- Proce ssing
  
  Testi
  
  SU M
  
  IMFC C
  
  Featu
  
  1-W
  
  FVQ
  
  Matc hing
  
  tongue prompts are just labelled at the word level with no segmentation. In both case, the Speech Dat
  
  Fig. 6 Parallel classifier based SI system
  For any closed-set speaker identification problem, speaker identification accuracy is defined as follows in and we have used the same:
  
  Percentage of Identification Accuracy = No of utterance correctly identified /Total No of utterance under test.
Experimental Results

For each database, we evaluated the performance of an MFCC based classifier, an IMFCC based classifier where each feature set has been implemented using TF, GF as well as Tukey Filter.
1. Results for POLYCOST Database
  
  Table II describes identification results for various model orders of fuzzy c-means clustering VQ with TF based MFCC and IMFCC features set. The last column in the table depicts he identification accuracies for the combined scheme. The combined scheme shows significant improvements over MFCC based SI system for different model orders. Further, even the independent performance of the IMFCC based classifier is comparable to that of the MFCC based classifier. Table III represents PIA of individual MFCC, IMFCC and fused scheme when GFs are used. It is evident from the table that individual performance of each feature set improves when compared against convention TF based MFCC and IMFCC. The fused scheme also outperforms GF based single streamed MFCC as well as earlier combined scheme using TFs, which in turn shows enhancement of complementary information applying GF for realizing the filter bank. Table IV represents PIA of individual MFCC, IMFCC andfused scheme when Tukey filter were used. It is evident from the table that
  
  individual performance of each feature set improves when compared against convention Tukey based MFCC and IMFCC. The fused scheme also outperforms Tukey based single streamed MFCC as well as earlier combined scheme using TFs, which in turn shows enhancement of complementary information applying Tukey for realizing the filter bank.
  
  Table II
  
  Results (PIA) for POLYCOST database using TF based MFCC & IMFCC
  
  No. of Utterances
  
  MFCC
  
  IMFCC
  
  Combined Systems
  
  1300
  
  77.4515
  
  76.2599
  
  83.0345
  
  650
  
  79.2349
  
  78.0557
  
  84.1631
  
  Table III
  
  Results (PIA) for POLYCOST database using GF based MFCC & IMFCC
  
  No. of Utterances
  
  MFCC
  
  IMFCC
  
  Combined Systems
  
  1300
  
  78.8472
  
  77.6599
  
  84.0955
  
  650
  
  80.9019
  
  79.5862
  
  85.7586
  
  Table IV
  
  Results (PIA) for POLYCOST database using Tukey filter based MFCC & IMFCC
  
  No. of Utterances
  
  MFCC
  
  IMFCC
  
  Combined Systems
  
  1300
  
  78.5472
  
  77.4599
  
  83.3955
  
  650
  
  79.8019
  
  79.3862
  
  84.7586
  
  Results show that the complementary information supplied helps to improve the performance of MFCC in parallel classifier to a great extent for two types of filters. Thus it can be said that, compared to a single MFCC based classifier; a speaker can be modeled with the same accuracy but at a comparatively lower order model by an MFCC- IMFCC parallel classifier. It could be further concluded that GF based IMFCC provides better complementary information than TF and Turkey based IMFCC. Figure 7 shows the graphical presentation of percentage of identification accuracy of POLYCOST Database.
  
  86
  
  84
  
  82
  
  80
  
  78
  
  76
  
  74
  
  72
  
  70
  
  1300 Uttarance 650 Uttarance MFCC IMFCC COMBINED
  
  86
  
  84
  
  82
  
  80
  
  78
  
  76
  
  74
  
  72
  
  70
  
  1300 Uttarance 650 Uttarance MFCC IMFCC COMBINED
  
  TF
  
  TF
  
  GF
  
  GF
  
  TUKEY
  
  TUKEY
  
  TF
  
  TF
  
  GF
  
  GF
  
  TUKEY
  
  TUKEY
  
  Figure 7. Graphical presentation of PIA for POLYCOST Database
2. Results for Self Collected Voice Database
  
  Table V,VI and VII shows the identification accuracies for the self collected voice database for TF,GF and Tukey based filters respectively. PEA obtained using GF based filter bank improves in individual feature sets and combined scheme over various model orders. As with the result shows, it can be observed from these tables that combined scheme shows significant improvement over the baseline MFCC based system irrespective of the filter type.
  
  Table V
  
  Results (PIA) for SELF COLLECTED VOICE
  
  database using TF based MFCC & IMFCC
  
  No. of Utterances
  
  MFCC
  
  IMFCC
  
  Combined Systems
  
  6650
  
  81.9515
  
  80.3259
  
  85.2345
  
  3325
  
  83.8515
  
  82.8557
  
  86.6631
  
  Table VI
  
  Results (PIA) for SELF COLLECTED VOICE
  
  database using GF based MFCC & IMFCC
  
  No. of Utterances
  
  MFCC
  
  IMFCC
  
  Combined Systems
  
  6650
  
  82.9515
  
  81.8734
  
  88.2445
  
  3325
  
  84.8515
  
  83.8557
  
  89.6731
  
  Table VII
  
  Results (PIA) for SELF COLLECTED VOICE
  
  database using Tukey based MFCC & IMFCC
  
  No. of Utterances
  
  MFCC
  
  IMFCC
  
  Combined Systems
  
  6650
  
  82.5324
  
  81.9934
  
  87.5443
  
  3325
  
  84.3747
  
  83.3145
  
  88.4534
  
  Figure 8. shows the graphical presentation of percentage of identification accuracy of self collected voice database. The GF is performing the best among all above mentioned filters.
3. Results for TIMIT
  
  Table VIII,IX and X shows the identification accuracies for the TIMIT database for TF,GF and Tukey based filters respectively.
  
  Table VIII
  
  Results (PIA) for TIMIT database using TF based MFCC & IMFCC
  
  No. of Utterances
  
  MFCC
  
  IMFCC
  
  Combined Systems
  
  6300
  
  80.9545
  
  79.2389
  
  83.6234
  
  3150
  
  80.8978
  
  79.7695
  
  83.6598
  
  Table IX
  
  Results (PIA) for TIMIT database using GF based MFCC & IMFCC
  
  No. of Utterances
  
  MFCC
  
  IMFCC
  
  Combined Systems
  
  6300
  
  81.8976
  
  79.9876
  
  85.2386
  
  3150
  
  82.8734
  
  81.5623
  
  85.2457
  
  90
  
  88
  
  86
  
  84
  
  82
  
  80
  
  78
  
  76
  
  TF
  
  TF
  
  GF
  
  GF
  
  TUKEY
  
  TUKEY
  
  TF
  
  TF
  
  GF
  
  GF
  
  TUKEY
  
  TUKEY
  
  74
  
  6650 Uttarance 3325 Uttarances MFCC IMFCC Combined
  
  Figure 8. Graphical presentation of PIA for Self Collected Voice Database
  
  Table X
  
  Results (PIA) for TIMIT database using Tukey based MFCC & IMFCC
  
  No. of Utterances
  
  MFCC
  
  IMFCC
  
  Combined Systems
  
  6300
  
  80.9356
  
  79.3563
  
  84.9823
  
  3150
  
  81.9576
  
  80.886
  
  84.8967
  
  It could be further concluded that GF based IMFCC provides better complementary information than TF and Tukey based IMFCC.
  
  88
  
  86
  
  84
  
  82
  
  80
  
  78
  
  76
  
  TF
  
  TF
  
  GF
  
  GF
  
  TUKEY
  
  TUKEY
  
  TF
  
  TF
  
  GF
  
  GF
  
  TUKEY
  
  TUKEY
  
  74
  
  6300 3150
  
  MFCC IMFCC Combined
  
  Figure 9. Graphical presentation of PIA for TIMIT Database
  
  Figure 9. shows the graphical presentation of percentage of identification accuracy of TIMIT database. As mentioned above the GF is performing the best among all above mentioned filters.
Conclusion

Gaussian filter based mel and inverted mel scaled filter bank is proposed in this paper after getting promising accuracy by comparing result with TF and Tukey filter. An uniform variance is used to design the filter banks, which could maintain a good balance between a filters coverage area and the amount of correlation. In both the scales, cepstral vectors are obtained and are modeled separately by fuzzy c-clustering VQ method. Performance is found to be superior when the individual performance of the each new proposed feature set is compared with its corresponding baseline. The result is shown for individual cases as well as for combined feature set for three speech databases each of which contains good number of speakers. The GF and Tukey filter show the better identification accuracy compare to TF.
References

D. Gatica-Perez, G. Lathoud, J.-M. Odobez and I. Mc Cowan, Audiovisual probabilistic tracking of multiple

speakers in meetings IEEE Transactions on Speech and Audio Processing, 2007, 15(2), pp. 601616.
J. P. Cambell, Jr, Speaker Recognition A Tutorial Proceedings of the IEEE, 85(9), 1997, pp. 1437-1462.
Faundez-Zanuy M. and Monte-Moreno E, State-of- the-art in speaker recognition, Aerospace and Electronic Systems Magazine IEEE, 20(5), 2005, pp. 7-12.
K. Saeed and M. K. Nammous, Heuristic method of Arabic speech recognition in Proc. IEEE 7th Int. Conf.

DSPA, Moscow, Russia, 2005, pp. 528530
D. Olguin, P.A.Goor, and A. Pentland, Capturing individual and group behavior with wearable sensors, in Proceedings of AAAI Spring Symposium on Human Behavior Modeling 2009.
S. B. Davis and P. Mermelstein, Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences IEEE Trans. On ASSP, 28(4), 1980, pp. 357-365.
R. Vergin, B, O Shaughnessy and A. Farhat, Generalized Mel frequency Cepstral coefficients for large-vocabulary speaker independent continuous-speech recognition IEEE Trans. On ASSP,7(5), 1999 pp. 525- 532.
Chakroborty, S., Roy, A. and Saha, G, Improved Closed set Text- Independent Speaker Identification by Combining MFCC with Evidence from Flipped Filter Bank International Journal of Signal Processing, 4(2), 2007, pp. 114-122.
S.Singh and Dr. E.G Rajan, A Vector Quantization approach Using MFCC for Speaker Recognition International conference Systemic, Cybernatics and Informatics ICSCI under the Aegis of Pentagram Research Centre Hyderabad, 2007, pp. 786-790.
K. Sri Rama Murty and B. Yegnanarayana, Combining evidence from residual phase and MFCC features for speaker recognition IEEE Signal Processing Letters, 13(1),2006, pp. 52-55.
Yegnanarayana B., Prasanna S.R.M., Zachariah J.M. and Gupta C. S, Combining evidence from source suprasegmental and spectral features for a fixed-text speaker verification system, IEEE Trans. Speech and Audio Processing, 13(4), 2005, pp. 575-582.
J. Kittler, M. Hatef, R. Duin, J. Mataz, On combining classifiers IEEE Trans, Pattern Anal. Mach. Intell, 20(3), 1998, pp. 226-239.
He, J., Liu, L., Palm, G, A Discriminative Training Algorithm for VQ-based Speaker Identification , IEEE Transactions on Speech and Audio Processing, 7(3), 1999, pp. 353-356.
Laurent Besacier and Jean-Francois Bonastr, Subband architecture for automatic speaker recognition Signal Processing, 80, 2000, pp. 1245-1259.
Zheng F., Zhang, G. and Song, Z, Comparison of different implementations of MFCC , J. Computer Science & Technology 16(6), 2001, pp. 582-589.
Ganchev, T., Fakotakis, N., and Kokkinakis, G. Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task Proc. of SPECOM Patras, Greece, 2005, pp. 1191-194.
Zhen B., Wu X., Liu Z., Chi H, On the use of band pass filtering in speaker recognition, Proc. 6th Int. Conf. of Spoken Lang. Processing (ICSLP), Beijing,

China, 2000
S. Singh, Dr. E.G Rajan, P.Sivakumar, M.Bhoopathy and V.Subha, Text Dependent Speaker Recognition System in Presence Monitoring, International conference Systemic, Cybernatics and

Informatics ICSCI -under the Aegis of Pentagram Research Centre Hyderabad, 2008, pp. 550-554.
A. Papoulis and S. U. Pillai, Probability, Random variables and Stochastic Processes, Tata McGraw-Hill Edition, Fourth Edition, Chap. 4, 2002, pp. 72-122.
Oppenheim, A.V., Schafer, R.W., Buck, J.R, Discrete-Time Signal Processing, 2nd ed., Upper Saddle River,NJ, Prentice Hall, 1999
Yegnanarayana B., Prasanna S.R.M., Zachariah J.M. and Gupta C. S, Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system, IEEE Trans. Speech and Audio Processing, Vol. 13, No. 4, 2005, pp. 575-582.
S.R. Mahadeva Prasanna, Cheedella S. Gupta, B. Yegnanarayana, Extraction of speaker-specific excitation information from linear prediction residual of speech, Speech Communication, 48(10), 2006, pp. 1243- 1261.
H. S. Jayanna and S. R. M. Prasanna, Fuzzy vector quantization for speaker recognition under limited data conditions TENCON- IEEE Region 10 Conference

,2008, pp. 1 – 4.
Haipeng Wang, Xiang Zhang, Hongbin Suo, Qingwei Zhao and Y. Yan, "A novel fuzzy-based automatic speaker clustering algorithm," ISNN, 2009, pp. 639646.
H. Melin and J. Lindberg, Guidelines for experiments on the polycost database, In Proceedings of a COST 250 workshop on Application of Speaker Recognition Techniques in Telephony, 1996, pp. 59- 69, Vigo, Spain.

	% Within the curve	% Outside the curve
2	95.4499736%	4.5500264%
3	99.7300204%	0.2699796%

No. of Utterances	MFCC	IMFCC	Combined Systems
1300	77.4515	76.2599	83.0345
650	79.2349	78.0557	84.1631

No. of Utterances	MFCC	IMFCC	Combined Systems
1300	78.8472	77.6599	84.0955
650	80.9019	79.5862	85.7586

No. of Utterances	MFCC	IMFCC	Combined Systems
1300	78.5472	77.4599	83.3955
650	79.8019	79.3862	84.7586

Application Of Different Filters In Mel Frequency Cepstral Coefficients Feature Extraction And Fuzzy Vector Quantization Approach In Speaker Recognition

Figure 1: Speaker recognition system

Figure. 3 Response of various shaped filters

Figure 4: standard deviation

Table 1. Summary of sigma and its coverage

Figure 5: Response of Tukey Filter

Fig. 6 Parallel classifier based SI system

Table II

Results (PIA) for POLYCOST database using TF based MFCC & IMFCC

Table III

Results (PIA) for POLYCOST database using GF based MFCC & IMFCC

Table IV

Results (PIA) for POLYCOST database using Tukey filter based MFCC & IMFCC

Figure 7. Graphical presentation of PIA for POLYCOST Database

Table V

Results (PIA) for SELF COLLECTED VOICE

database using TF based MFCC & IMFCC

Table VI

Results (PIA) for SELF COLLECTED VOICE

database using GF based MFCC & IMFCC

Table VII

Results (PIA) for SELF COLLECTED VOICE

database using Tukey based MFCC & IMFCC

Table VIII

Results (PIA) for TIMIT database using TF based MFCC & IMFCC

Table IX

Results (PIA) for TIMIT database using GF based MFCC & IMFCC

Figure 8. Graphical presentation of PIA for Self Collected Voice Database

Table X

Results (PIA) for TIMIT database using Tukey based MFCC & IMFCC

Figure 9. Graphical presentation of PIA for TIMIT Database

Leave a Reply