- Open Access
- Total Downloads : 392
- Authors : Satyanand Singh, Dr. E.G. Rajan
- Paper ID : IJERTV2IS60882
- Volume & Issue : Volume 02, Issue 06 (June 2013)
- Published (First Online): 29-06-2013
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Application Of Different Filters In Mel Frequency Cepstral Coefficients Feature Extraction And Fuzzy Vector Quantization Approach In Speaker Recognition
*Satyanand Singh
Associate Professor
**Dr. E.G. Rajan
Director
Abstract
Front-end or feature extractor is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but effective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components is strongly determined by the quality of the front-end. Over the years, Mel-Frequency Cepstral Coefficients (MFCC) modeled on the human auditory system has been used as a standard acoustic feature set for speech related applications. In this paper it has been shown that the inverted Mel- Frequency Cepstral Coefficients is one of the performance enhancement parameters for speaker recognition, which contains high frequency region complementary information in it and also introduces the Gaussian shaped filter (GF) and Tukey while calculation MFCC and inverted MFCC in place of traditional triangular shaped bins. The main idea is to introduce a higher amount of correlation between subband outputs. The performance of both MFCC and inverted MFCC improve with GF and Tukey over traditional triangular filter (TF) based implementation, individually as well as in combination. In this study, Fuzzy Vector Quantization (FVQ) is used for speaker modeling. Fuzzy clustering methods allow objects to belong to several clusters simultaneously, with different degrees of membership. The performances of proposed GF and Tukey Filter based MFCC and IMFCC in individual and merged mode have been verified in two standard databases POLYCOST (Telephone Speech) and TIMIT each of which has more than 130 speakers as well as self voice collected from 90 speakers.
-
Introduction
A speaker recognition system mainly consists of two main module, speaker specific feature extractor as a front end followed by a speaker modelling technique
for generalized representation of extracted features [1, 2]. Since long time MFCC is considered as a reliable front end for a speaker recognition application because it has coefficients that represents audio, based on perception [3, 4]. In MFCC the frequency bands are positioned logarithmically which approximated the human auditory systems response more closely than the linear spaced frequency bands of FFT or DCT. An illustrative speaker recognition system is shown in figure.1.The state of the art speaker recognition research primarily investigates speaker specific complementary information relative to MFCC. It has been observed that the performance of speaker recognition improved significantly when complementary information is merged with MFCC in feature level either by simple concatenation or by combining models scores. The main complementary information is pitch [5], residual phase [6], prosody [7], dialectical features [8] etc. These features are related with vocal chord vibration and it is very difficult to extract speaker specific information. It has been shown that complementary information can be captured easily from the high frequency part of the energy spectrum of a speech frame via reversed filter bank methodology [9]. There are some features of speaker which used to present at high frequency part of the spectrum and generally ignored by MFCC that can be captured by inverted MFCC is proposed in this paper. The complementary information captured by inverted MFCC is modelled by FVQ [10] technique. In this paper, Fuzzy Vector Quantization (FVQ) is used for speaker modelling. In many real situations, fuzzy clustering is more natural than hard clustering, as objects on the boundaries between several classes are not forced to fully belong to one of the classes, but rather are assigned membership degrees between 0 and 1 indicating their partial memberships.So the present study was undertaken with the objective of to find out the speaker recognition efficiency improving components.
-
Methodology
In the present investigation GF and Tukey filter were used as the averaging bins instead of triangular for calculating MFCC as well as inverted MFCC in a typical speaker recognition application [11, 12]. There are three main inspiration of using GF and Tukey filter. First inspiration is both filter can provide much smoother transition from one subbands to other preserving most of the correlation between them. Second inspiring point is the means and variances of these can be independently chosen in order to have control over the amount of overlap with neighbouring subbands. Third inspiring point is the filter design parameters for GF and Tukey can be calculated very easily from mid as well as end-points located at the base of the original TF used for MFCC and inverted MFCC. In this investigation both MFCC and inverted MFCC filter bank are realized using a moderate variance where a GFs and Tukey coverage for a subbands and the correlation is to be
-
Mel Frequency and Their Calculation
-
Mel-Frequency Cepstral Coefficients using triangular filters
According to psychophysical studies human perception of the frequency content of sounds follows a subjectively defined nonlinear scale called the Mel scale [15, 16]. MFCC is the most commonly used acoustic features for speech/speaker recognition. MFCC is the only acoustic approach that takes human perception (Physiology and behavioral aspects of the voice production organs) sensitivity with respect to frequencies into consideration, and therefore is best for speaker recognition. The acoustic model is defined as, voice production organs) sensitivity with respect to frequencies into consideration, and therefore is best for speaker recognition. The acoustic model is defined as,
balanced. Results show that GF and Tukey based MFCC and inverted MFCC perform better than the
= 259510 1 +
(1)
700
conventional TF based MFCC and inverted MFCC individually. Results are also better when GF and Tukey based MFCC & inverted MFCC is merged together their model scores in comparison to the results that are obtained by combining MFCC and inverted MFCC feature sets realized using traditional TF [13]. All the implementations have been done with FVQ [14]
Where fmel is the subjective pitch in Mels corresponding transfer function and the actual frequency in Hz. This leads to the definition of MFCC, a baseline acoustic feature for speech and speaker recognition applications, which can be calculated as follows [17]
n=1
n=1
Let y(n) Ns represent a frame of speech that is
Speech Signal
Training Mode
Feature Extraction
Speaker Modeling
Speaker 1
Speaker 2
Speaker 3
Recognition
Pattern Matching
Speaker N
Decision Logic Decision
Figure 1: Speaker recognition system
preemphasized and Hamming-windowed.
First, y(n) is converted to the frequency domain by an Ms-point DFT which leads to the energy spectrum,
The sampling frequency Fs and flow , fhigh frequencies are in Hz while fmel is in Mels. In this work, Fs is 8 kHz. Ms is taken as 256 flow = Fs/Ms
= 31.25 Hz while fhigh = Fs /2 = 4 kHz. Next, this
2
2
filter bank is imposed on the spectrum calculated in
2 = .
=
(2)
eqn. (2). The outputs e i i=1 Q of the Mel-scaled band-pass filters can be calculated by a weighted
p>boundary points are equally spaced in the Mel scale
which is satisfying the definition,
Where1 k Ms. this is followed by the construction of a filter bank with Q unity height TFs, uniformly spaced in the Mel scale eqn. (1). The filter response i (k) of the ith filter in the bank (figure-2) is defined as,
0 1
summation between respective filter response i(k) and the energy spectrum Y k 2 as
2
= 2 . (6)
=1
Finally DCT is taken on the log filter bank energies
log e i i=1 and the final MFCC coefficients Cm
can be written as.
1
1
= +1
1
(3)
1
+1
= 2
2 1
+ 1 . .
(7)
+1
2
0 +1
=0
i=0
i=0
Where 1 i Q, Q is the number of filters in the bank, kbi Q+1 are the boundary points of the filters and k denotes the coefficients index in the Ms point DFT. The filter bank
Where 0 m R 1, R is the desired number of cepstral features.
-
Mel-Frequency Cepstrum Coefficients using Gaussian filters.
= 1
The transfer function of any filter is asymmetric,
+ (4)
+ 1
tapered and filter does not provide any weight outside the subband that it covers. As a result, the correlation between a subband and its nearby spectral components from adjacent subbands is lost. In this investigation a GF and Tukey is proposed, which produced gradually decaying weights at its both ends and symmetric for compensating possible loss of correlation. Referring to eqn. (3), the expression for GF can be written as [18]
=
=
2
2 2 (8)
Figure 2: Response of a typical Mel scale filter
Where kbi is a point between the ith transfer boundaries located at it base and it is considered here as a mean of the ith GF while the i is the standard deviation and can be defined as,
where the function fmel () is defined in eqn. (1), Ms is the number of points in the DFT eqn. (2), Fs is the sampling frequency, flow , and fhigh are the
= +1
(9)
mel
mel
low and high frequency boundaries of the filter bank and f1 is the inverse of the transformation in eqn.(1) defined as,
102595
102595
1 = 700. 1 (5)
Where is the variance controlling parameter. However, in eqn. (8) the conventional denominator
i.e (2)i is dropped, as its presence is only to ensure the area under Gaussian curve is unity [19]. Moreover, omitting the term helps a GF to achieve
unity as highest value at its mean,which is similar to unity height triangular shaped filter used for conventional MFCC. Note that, a TF become nonisosceles while they are mapped into from its two endskbi in base become unequal. For MFCCs ith filter, the relation becomes,
+1 > 1 (10)
We took the maximum spread out of these two distances i.e. kbi+1 kbi to evaluate i ensuring
cepstral vector using GFs can be calculated from the filters response eqn. (8) which is as follows
2
= 2 (12)
=1
and,
=1
=1
= 2 1 + 1 .
21 . (13)
2
full coverage of the subband by the GF.
= +1
= ( +1 )/2
= ( +1 )/3
1 +1
Figure. 3 Response of various shaped filters
Figure.3, shows the plot TF and GF for different values of sigma. The figure clearly depicts that a triangular window can give some sort of tapering at its both ends but lacks also in offering of no of weights outside its coverage.
In figure 4. shows the standard deviation for different values of and kbi is the centre point of al filters after that transfer functions are gradually decaying. However the Gaussian with higher variance shows larger correlation with nearby
frequency component. Thus selection of is a critical part for setting the variances of GF. In the present study the value of = 2 then eqn. (9) can be written as,
Here last 20 coefficient from both models are used and the value of Q=22 and R = 25 are taken.
Table I shows the different values of and their coverage within the curve and outside the curve.
Figure 4: standard deviation
Table 1. Summary of sigma and its coverage
% Within the curve
% Outside the curve
2
95.4499736%
4.5500264%
3
99.7300204%
0.2699796%
-
Mel-Frequency Cepstrum Coefficients using Tukey Filter
Based on evidence Tukey filter is a combination of the rectangular window and the Hann window [20]. In fact it is a cosine- tampered window and is defined as follows
1 1 2. 1 , 2
2 = +1 (11)
1
1
+ 1
1
+
2
95% of subband is covered once = 2, is selected. 1,
Probability 2 (k k ) = 0.95.
bi+1
Therefore, = 2 can provide
bi
better
correlation with
= 1 +
2 + 1 +1
2 (14)
nearby subbands in comparison to = 3. In this study, we have chosen = 2 to design filters for the MFCC filter bank. Thus, a balance is achieved
1 1 2. ,
2
where significant coverage of a particular subband is ensured while allowing moderate correlation
+1
2
+1
between that subband and neighboring ones. The
0, +1
Where k = 0, 1 , N 1 is discrete frequency, 0 kbi1 < kbi < kbi+1 < are basic filter frequencies and unit amplitude of the filter and length of the filter is N sample.NHann is defined as
= 1 . 1 + 1 (15)
Where is the ratio of taper to constant section and
number of filters in the bank. From eqn. (13) we can derive an expression for i k with analogous
to eqn. (3) i k for the original MFCC filter bank.
0 1
1
0 1.When = 0 , then filter corresponds to a rectangular filter. When = 1, the filter
= 1
1
(18)
corresponds to Hann filter. NRect in the Eq. (14) is
+1
a complement of the NHann and is defined as
+1
+1
= .
+1
1
+ 1 (16)
0 +1
Figure.5, shows the plot Tukey filter for different
Where l k Ms
and k
Q+1
bi i=0
values of . The figure clearly depicts that a Tukey window can give some sort of tapering at its both ends.
Here inverted mel-scale is defined as follows
= 2195.2860
2595 10 1 + 4031 .25 19
-
-
Inverted Mel Frequency Cepstral Coefficients Calculation
Where fmel
f
700
is subjective pitch in the new scale
corresponding to f, the actual frequency in Hz .The
filter outputs e i Q in the same way as MFCC
-
Inverted Mel-Frequency Cepstral
from the same en
i=1
spectrum Y K 2 as
Coefficients using triangular filters
The main objective is to capture that information
2
ergy
which has been missed by original MFCC [21]. In this study the new filter bank structure is obtained simply by flipping the original filter bank around the point f = 2 kHz which is precisely the mid-point of the frequency range considered for speaker recognition applications. This flip-over is expressed mathematically as,
= + 1 (17)
= 2 . (20)
=1
DCT is taken on the log filter bank energies
i=1
i=1
m m=1
m m=1
log10 e i Q and the final inverted MFCC coefficient C R can be written as
+1 2
1
2
Where i k is the inverted Mel Scale filter
= .
response while i k is the response of the
=0
2 1
original MFCC filter bank l i Q and Q is the
+ 1 . 2
(21)
-
Inverted Mel-Frequency Cepstral Coefficients using Gaussian filters
1
+1
It is expected that introduction of correlation between subband outputs in inverted mel-scaled filter bank makes it more complementary than what was realized using TF. Flipping the original triangular filter bank, around 2 KHz inverts also the relation mentioned in eqn. (10) that gives
1 > +1 (22)
Here k bi is the mean of the ith GF and standard deviation can be calculated as
Figure 5: Response of Tukey Filter
= 1
(23)
Here value is chosen 2. The response of the GF for inverted MFCC filter bank and corresponding cepstral parameters can be calculated as follows;
( )2
decisions in order to obtain improved identification accuracy. The same principle has been adopted for GF and Tukey based MFCC and IMFCC also. In this context, it should be noted that our
=
2 2
2
(24)
computation of complementary information from IMFCC involves comparably lower computational complexity than higher-level features.
= 2. (25)
=1
And
1
2
The MFCC and IMFCC features vectors, containing complementary information of speakers, were supplied to a given classifiers independently and the classification results for the MFCC features and the IMFCC features were fused in order to obtain optimum decision in the process of speaker
=
+ 1 .
=0
recognition . A uniform weighted sum rule was adapted to fuse the core from two classifiers. If
. 2 1 . (26)
XMFCC denotes the classification score based on the
2
MFCC and X
IMFCC
denotes the classification score
-
Inverted Mel-Frequency Cepstral Coefficients using Tukey Filter
The main objective is to capture that information which has been missed by original MFCC. In this study the new filter bank structure is obtained simply by flipping the original filter bank around the point f = 2 kHz which is precisely the mid-point of the frequency range considered for speaker recognition applications. This flip-over is expressed mathematically as
1 1 2. 1 ,
based on the IMFCC, then the combined score for the mth speaker was given as
= + 1 (28) The constant value of = 0.5 was used in all cases. The speaker was determined as,
= (29)
-
Theoretical Background of VQ
In VQ-based approach the speaker models are formed by clustering the speakers feature vectors in K non-overlapping clusters. Each cluster is represented by a code vector Ci, which is the centroid [22]. The resulting set of code vectors
1, 2, 3, 4, , is called a
2
1 1 + 1 1 +
1,
2
codebook, and it serves as the model of the speaker. The model size (number of code vectors) is significantly smaller than the training set. The distribution of the code vectors follows the same underlying distribution as the training vectors.
= 1 +
2 + 1 1 2
(27)
Thus, the codebook effectively reduces the amount of data by preserving the essential information of
1 2. , 2
the original distribution. K-means is an iterative approach; in each successive iteration it redistributes the vectors in order to minimize the
2
2
+1
+1
distortion. The procedure is outlined below:
-
Initialized the random centroids as the
0, +1
means of M clusters.
-
Data points are associated with the nearest
-
-
Synthesis of MFCC and IMFCC
The idea of combining the classifier to enhance the decisions making process has been successful in many pattern classification problems including Speaker Identification. According to the available literature, the combination of two or more classifiers would perform better if they were supplied with information that are complementary in nature. Adopting this idea in our work, we supplied MFCC and IMFCC feature vectors, which are complementary in information content, to two classifiers respectively and finally fused their
centroids.
-
The centroids are moved to the centre of their respective clusters.
-
Steps b and c were repeated until a suitable level of junction has been reached, i.e the distortion is minimized.
-
When the distortion is minimized, redistribution does not result in any movement of vectors among the clusters. This could be used as an indicator to terminate the algorithm. Upon the convergence, the total distortion does not change as a result of redistribution.
-
Linde, Buzo and Gray Clustering Technique
d = (xi, yi) can be obtained and shown as shown in eqn. (31)
The acoustic vectors extracted from input speech of a speaker provide a set of training vectors. As described above, the next important step is to build a speaker-specific VQ codebook for this speaker using those training vectors. There is a well-known
1
2
2
= ( , = 2
=1
31
algorithm, namely LBG algorithm, for clustering a set of L training vectors into a set of M codebook vectors. The LBG VQ design algorithm is an iterative algorithm which alternatively solves the two optimality criteria. The algorithm requires an initial code C0. This initial codebook is obtained by the splitting method. In this method, an initial code vector is set as the average of the entire training sequence. This code vector is then split into two. The iterative algorithm is run with these two vectors as the initial codebook. The final two code vectors are split into four and the process is repeated until the desired number of code vectors is obtained.
-
K-means Clustering Technique
The standard k-means algorithm is a typical clustering algorithm used in data mining and which is widely used for clustering large sets of data. In 1967, Mac Queen firstly proposed the k-means
-
Fuzzy C-means Clustering
In speech-based pattern recognition, VQ is a widely used feature modeling and classification algorithm, since it is simple and computationally very efficient technique. FVQ reduces disadvantages of classical Vector Quantization. Unlike Linde-Buzo-Gray (LBG) and k-means algorithms, the FVQ technique follows the principle that a feature vector located between the clusters should not be assigned to only one cluster. Therefore, in FVQ each feature vector has an association with all clusters [23]. The discrete nature of hard partitioning also causes analytical and algorithmic intractability of algorithms based on analyti function values, since the function values are not differentiable. Fuzzy c- means is a clustering technique that permits one piece of data to belong to more than one cluster at the same time. It aims at minimizing the objective function defined by Equation (28).
algorithm; it was one of the most simple, non- supervised learning algorithms, which was applied to solve the problem of the well-known cluster. It is a partitioning clustering algorithm and this method are used to classify the given date objects into k different clusters iteratively, converging to a local minimum. So the results of generated clusters are compact and independent. The algorithm consists of two separate phases. The first phase selects k centers randomly, where the value k is fixed in advance. The next phase is to arrange each data object with the nearest centre. Euclidean distance is generally used to determine the distance between each data object and the cluster centers. When all the data objects are included in a cluster, the first step is completed and an early grouping is done. This process is repeated continues repeatedly until the criterion function becomes the minimum. Supposing that the target object is x, xi indicates the
average of cluster ci . The criterion function is defined in eqn. (30).
= 2 , 1 < < (32)
=1 =1
i=
i=
Where C is the number of clusters, N is the number of data elements, xi is an column vector of X, and is defined as the centroid of the ith cluster.uij is an element of U ,and denotes the membership of data element J to the ith cluster, and Si subject to the constraints uij 0,1 and C 1 uij = 1 for all j. m
is a free parameter which plays a central role in adjusting the blending degree of different clusters. If m is set to 0, J is a sum of square error criterion, and uij a Boolean membership value (either 0 or 1).
can be any norm expressing the similarity [24]. Fuzzy partitioning is carried out using an iterative optimization of the objective function with the update of membership function uij , an element of U, which denotes the membership of data element J to the Ith cluster. The cluster center Cj is derived using Equation (28) and (29).
2
= 2
(30)
1
=1
E is sum of the squared error of all objects in
= 1/
(28)
the
database. The distance of the criterion function is Euclidean distance, which is used for determining
=1
the nearest distance between each data objects and
= . /
29
cluster centre. The Euclidean distance between one
ij
ij
vector x = (x1 , x2, xn ) and another vector
y = (y1, y2 , yn ). The Euclidean distance
This iteration will stop when maxij uK+1
ij
ij
uK < , where is the termination criterion.
The algorithm for Fuzzy c-means clustering includes the steps:
-
Initialize C,N,m,U
-
Repeat
-
Minimize j, by computing :
2
recommendations were used while performing the annotation.
7.3 Self Collected Voice Database
The voice corpus was collected in unhealthy environment by using Microsoft sound recorder. A good quality head phone belongs to different parts
= 1/
1
of India. The average duration of the training samples was 6 seconds per speaker and out of
=1
twenty utterances one is used for training purpose. For matching purposes remaining 19 voice corpus
i=
i=
-
Normalize uij by C 1 uij = 1
-
Compute Centroid Cj by using:
= . /
of the length 6 seconds, which was further divided into three different subsequence of the lengths 6 s (100%), 3 s (50%), 2s (33%) 1s (16 %) and
0.5s(8%) .Therefore, for 70 speakers we put 70X19X5 = 6650 utterance under test and valuated
Traini
Traini
the identification efficiency.
-
Until slightly change in U and V
-
End
-
A schematic description of this scheme for parallel combination of classifiers is given in figure. 6.
1
2
3
N
1
2
3
N
Speaker Modeling Technique for MFCC Features Vectors
1
2
3
N
1
2
3
N
Speaker Modeling Technique for IMFCC Features Vectors
MFCC
Pre-
IMFC
MFCC
Pre-
IMFC
-
Experimental setup
-
Pre-Processing Stag
In this work, each frame of speech is pre-processed as follows:
-
Silence removal and end-point detection using an energy threshold criterion.
-
Pre-emphasis with 0.97 pre-emphasis factor.
-
Frame blocking with 20ms frame length,
i.e = 160 samples/frame 50 overlap, and finally Hamming-windowing.
The MFCC and IMFCC feature sets using triangular, GFs and Tukey are calculated.
-
-
POLYCON Database
Sum over
Sum over
The database was collected through the European telephone network. The recording has been performed with ISDN cards on two XTL SUN platforms with an 8 kHz sampling rate. In this work, a closed set text independent speaker identification problem is addressed where only the mother tongue (MOT) files are used. Specified guideline [25] for conducting closed set speaker identification experiments is adhered to, i.e.
MOT02 files from first four sessions are used to build a speaker model while MOT01 files from session five onwards are taken for testing. In the POLYCOST case, the English prompts are fully annotated in terms of word boundaries. The mother
FVQ
Matc hing
MFC C
Featu
Sum over
Sum over
W
Pre- Proce ssing
Testi
SU M
IMFC C
Featu
1-W
FVQ
Matc hing
tongue prompts are just labelled at the word level with no segmentation. In both case, the Speech Dat
Fig. 6 Parallel classifier based SI system
-
TIMIT Database
The TIMIT speech corpus consists of 630 speakers (438 male and 192 female). For each speaker only one recording session was used. The speech data was recorded in a sound booth and contains fixed text sentences read by speakers and recorded over a fixed wideband channel. TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States. The speakers used American English. The main limitation of the TIMIT corpus is that the speech is recorded only during one session for each speaker, therefore the data does not reflect time related variations in speech characteristics. Moreover, the clean wideband speech environment in TIMIT has an ideal character and does not simulate the real world condition appearing in typical speaker recognition applications.
-
Score Calculation
For any closed-set speaker identification problem, speaker identification accuracy is defined as follows in and we have used the same:
Percentage of Identification Accuracy = No of utterance correctly identified /Total No of utterance under test.
-
-
-
Experimental Results
For each database, we evaluated the performance of an MFCC based classifier, an IMFCC based classifier where each feature set has been implemented using TF, GF as well as Tukey Filter.
-
Results for POLYCOST Database
Table II describes identification results for various model orders of fuzzy c-means clustering VQ with TF based MFCC and IMFCC features set. The last column in the table depicts he identification accuracies for the combined scheme. The combined scheme shows significant improvements over MFCC based SI system for different model orders. Further, even the independent performance of the IMFCC based classifier is comparable to that of the MFCC based classifier. Table III represents PIA of individual MFCC, IMFCC and fused scheme when GFs are used. It is evident from the table that individual performance of each feature set improves when compared against convention TF based MFCC and IMFCC. The fused scheme also outperforms GF based single streamed MFCC as well as earlier combined scheme using TFs, which in turn shows enhancement of complementary information applying GF for realizing the filter bank. Table IV represents PIA of individual MFCC, IMFCC andfused scheme when Tukey filter were used. It is evident from the table that
individual performance of each feature set improves when compared against convention Tukey based MFCC and IMFCC. The fused scheme also outperforms Tukey based single streamed MFCC as well as earlier combined scheme using TFs, which in turn shows enhancement of complementary information applying Tukey for realizing the filter bank.
Table II
Results (PIA) for POLYCOST database using TF based MFCC & IMFCC
No. of Utterances
MFCC
IMFCC
Combined Systems
1300
77.4515
76.2599
83.0345
650
79.2349
78.0557
84.1631
Table III
Results (PIA) for POLYCOST database using GF based MFCC & IMFCC
No. of Utterances
MFCC
IMFCC
Combined Systems
1300
78.8472
77.6599
84.0955
650
80.9019
79.5862
85.7586
Table IV
Results (PIA) for POLYCOST database using Tukey filter based MFCC & IMFCC
No. of Utterances
MFCC
IMFCC
Combined Systems
1300
78.5472
77.4599
83.3955
650
79.8019
79.3862
84.7586
Results show that the complementary information supplied helps to improve the performance of MFCC in parallel classifier to a great extent for two types of filters. Thus it can be said that, compared to a single MFCC based classifier; a speaker can be modeled with the same accuracy but at a comparatively lower order model by an MFCC- IMFCC parallel classifier. It could be further concluded that GF based IMFCC provides better complementary information than TF and Turkey based IMFCC. Figure 7 shows the graphical presentation of percentage of identification accuracy of POLYCOST Database.
86
84
82
80
78
76
74
72
70
1300 Uttarance 650 Uttarance MFCC IMFCC COMBINED
86
84
82
80
78
76
74
72
70
1300 Uttarance 650 Uttarance MFCC IMFCC COMBINED
TF
TF
GF
GF
TUKEY
TUKEY
TF
TF
GF
GF
TUKEY
TUKEY
Figure 7. Graphical presentation of PIA for POLYCOST Database
-
Results for Self Collected Voice Database
Table V,VI and VII shows the identification accuracies for the self collected voice database for TF,GF and Tukey based filters respectively. PEA obtained using GF based filter bank improves in individual feature sets and combined scheme over various model orders. As with the result shows, it can be observed from these tables that combined scheme shows significant improvement over the baseline MFCC based system irrespective of the filter type.
Table V
Results (PIA) for SELF COLLECTED VOICE
database using TF based MFCC & IMFCC
No. of Utterances
MFCC
IMFCC
Combined Systems
6650
81.9515
80.3259
85.2345
3325
83.8515
82.8557
86.6631
Table VI
Results (PIA) for SELF COLLECTED VOICE
database using GF based MFCC & IMFCC
No. of Utterances
MFCC
IMFCC
Combined Systems
6650
82.9515
81.8734
88.2445
3325
84.8515
83.8557
89.6731
Table VII
Results (PIA) for SELF COLLECTED VOICE
database using Tukey based MFCC & IMFCC
No. of Utterances
MFCC
IMFCC
Combined Systems
6650
82.5324
81.9934
87.5443
3325
84.3747
83.3145
88.4534
Figure 8. shows the graphical presentation of percentage of identification accuracy of self collected voice database. The GF is performing the best among all above mentioned filters.
-
Results for TIMIT
Table VIII,IX and X shows the identification accuracies for the TIMIT database for TF,GF and Tukey based filters respectively.
Table VIII
Results (PIA) for TIMIT database using TF based MFCC & IMFCC
No. of Utterances
MFCC
IMFCC
Combined Systems
6300
80.9545
79.2389
83.6234
3150
80.8978
79.7695
83.6598
Table IX
Results (PIA) for TIMIT database using GF based MFCC & IMFCC
No. of Utterances
MFCC
IMFCC
Combined Systems
6300
81.8976
79.9876
85.2386
3150
82.8734
81.5623
85.2457
90
88
86
84
82
80
78
76
TF
TF
GF
GF
TUKEY
TUKEY
TF
TF
GF
GF
TUKEY
TUKEY
74
6650 Uttarance 3325 Uttarances MFCC IMFCC Combined
Figure 8. Graphical presentation of PIA for Self Collected Voice Database
Table X
Results (PIA) for TIMIT database using Tukey based MFCC & IMFCC
No. of Utterances
MFCC
IMFCC
Combined Systems
6300
80.9356
79.3563
84.9823
3150
81.9576
80.886
84.8967
It could be further concluded that GF based IMFCC provides better complementary information than TF and Tukey based IMFCC.
88
86
84
82
80
78
76
TF
TF
GF
GF
TUKEY
TUKEY
TF
TF
GF
GF
TUKEY
TUKEY
74
6300 3150
MFCC IMFCC Combined
Figure 9. Graphical presentation of PIA for TIMIT Database
Figure 9. shows the graphical presentation of percentage of identification accuracy of TIMIT database. As mentioned above the GF is performing the best among all above mentioned filters.
-
-
Conclusion
Gaussian filter based mel and inverted mel scaled filter bank is proposed in this paper after getting promising accuracy by comparing result with TF and Tukey filter. An uniform variance is used to design the filter banks, which could maintain a good balance between a filters coverage area and the amount of correlation. In both the scales, cepstral vectors are obtained and are modeled separately by fuzzy c-clustering VQ method. Performance is found to be superior when the individual performance of the each new proposed feature set is compared with its corresponding baseline. The result is shown for individual cases as well as for combined feature set for three speech databases each of which contains good number of speakers. The GF and Tukey filter show the better identification accuracy compare to TF.
-
References
-
D. Gatica-Perez, G. Lathoud, J.-M. Odobez and I. Mc Cowan, Audiovisual probabilistic tracking of multiple
speakers in meetings IEEE Transactions on Speech and Audio Processing, 2007, 15(2), pp. 601616.
-
J. P. Cambell, Jr, Speaker Recognition A Tutorial Proceedings of the IEEE, 85(9), 1997, pp. 1437-1462.
-
Faundez-Zanuy M. and Monte-Moreno E, State-of- the-art in speaker recognition, Aerospace and Electronic Systems Magazine IEEE, 20(5), 2005, pp. 7-12.
-
K. Saeed and M. K. Nammous, Heuristic method of Arabic speech recognition in Proc. IEEE 7th Int. Conf.
DSPA, Moscow, Russia, 2005, pp. 528530
-
D. Olguin, P.A.Goor, and A. Pentland, Capturing individual and group behavior with wearable sensors, in Proceedings of AAAI Spring Symposium on Human Behavior Modeling 2009.
-
S. B. Davis and P. Mermelstein, Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences IEEE Trans. On ASSP, 28(4), 1980, pp. 357-365.
-
R. Vergin, B, O Shaughnessy and A. Farhat, Generalized Mel frequency Cepstral coefficients for large-vocabulary speaker independent continuous-speech recognition IEEE Trans. On ASSP,7(5), 1999 pp. 525- 532.
-
Chakroborty, S., Roy, A. and Saha, G, Improved Closed set Text- Independent Speaker Identification by Combining MFCC with Evidence from Flipped Filter Bank International Journal of Signal Processing, 4(2), 2007, pp. 114-122.
-
S.Singh and Dr. E.G Rajan, A Vector Quantization approach Using MFCC for Speaker Recognition International conference Systemic, Cybernatics and Informatics ICSCI under the Aegis of Pentagram Research Centre Hyderabad, 2007, pp. 786-790.
-
K. Sri Rama Murty and B. Yegnanarayana, Combining evidence from residual phase and MFCC features for speaker recognition IEEE Signal Processing Letters, 13(1),2006, pp. 52-55.
-
Yegnanarayana B., Prasanna S.R.M., Zachariah J.M. and Gupta C. S, Combining evidence from source suprasegmental and spectral features for a fixed-text speaker verification system, IEEE Trans. Speech and Audio Processing, 13(4), 2005, pp. 575-582.
-
J. Kittler, M. Hatef, R. Duin, J. Mataz, On combining classifiers IEEE Trans, Pattern Anal. Mach. Intell, 20(3), 1998, pp. 226-239.
-
He, J., Liu, L., Palm, G, A Discriminative Training Algorithm for VQ-based Speaker Identification , IEEE Transactions on Speech and Audio Processing, 7(3), 1999, pp. 353-356.
-
Laurent Besacier and Jean-Francois Bonastr, Subband architecture for automatic speaker recognition Signal Processing, 80, 2000, pp. 1245-1259.
-
Zheng F., Zhang, G. and Song, Z, Comparison of different implementations of MFCC , J. Computer Science & Technology 16(6), 2001, pp. 582-589.
-
Ganchev, T., Fakotakis, N., and Kokkinakis, G. Comparative Evaluation of Various MFCC Implementations on the Speaker Verification Task Proc. of SPECOM Patras, Greece, 2005, pp. 1191-194.
-
Zhen B., Wu X., Liu Z., Chi H, On the use of band pass filtering in speaker recognition, Proc. 6th Int. Conf. of Spoken Lang. Processing (ICSLP), Beijing,
China, 2000
-
S. Singh, Dr. E.G Rajan, P.Sivakumar, M.Bhoopathy and V.Subha, Text Dependent Speaker Recognition System in Presence Monitoring, International conference Systemic, Cybernatics and
Informatics ICSCI -under the Aegis of Pentagram Research Centre Hyderabad, 2008, pp. 550-554.
-
A. Papoulis and S. U. Pillai, Probability, Random variables and Stochastic Processes, Tata McGraw-Hill Edition, Fourth Edition, Chap. 4, 2002, pp. 72-122.
-
Oppenheim, A.V., Schafer, R.W., Buck, J.R, Discrete-Time Signal Processing, 2nd ed., Upper Saddle River,NJ, Prentice Hall, 1999
-
Yegnanarayana B., Prasanna S.R.M., Zachariah J.M. and Gupta C. S, Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system, IEEE Trans. Speech and Audio Processing, Vol. 13, No. 4, 2005, pp. 575-582.
-
S.R. Mahadeva Prasanna, Cheedella S. Gupta, B. Yegnanarayana, Extraction of speaker-specific excitation information from linear prediction residual of speech, Speech Communication, 48(10), 2006, pp. 1243- 1261.
-
H. S. Jayanna and S. R. M. Prasanna, Fuzzy vector quantization for speaker recognition under limited data conditions TENCON- IEEE Region 10 Conference
,2008, pp. 1 – 4.
-
Haipeng Wang, Xiang Zhang, Hongbin Suo, Qingwei Zhao and Y. Yan, "A novel fuzzy-based automatic speaker clustering algorithm," ISNN, 2009, pp. 639646.
-
H. Melin and J. Lindberg, Guidelines for experiments on the polycost database, In Proceedings of a COST 250 workshop on Application of Speaker Recognition Techniques in Telephony, 1996, pp. 59- 69, Vigo, Spain.