- Open Access
- Total Downloads : 208
- Authors : Dr. G. Indumathi, V. S. Hewitt, V. Rajavel
- Paper ID : IJERTV4IS030556
- Volume & Issue : Volume 04, Issue 03 (March 2015)
- DOI : http://dx.doi.org/10.17577/IJERTV4IS030556
- Published (First Online): 21-03-2015
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Cepstral Domain Voice Conversion Based on Constrained Transformations
Dr. G. Indumathi,
Mepco Schlenk Engineering College,
Sivaksi, India.
V. S. Hewitt,
Mepco Schlenk Engineering College,
Sivakasi,India.
-
Rajavel,
Mepco Schlenk Engineering College,
Sivakasi,India.
AbstractThis paper proposes a method for voice conversion in cepstral domain. The method basically involves two steps: Bilinear Frequency Warping and Amplitude Scaling. Frequency warping is done so that spectrum of source speaker is moved towards their image in the target speakers spectrum. Amplitude scaling is done to compensate for the warping inaccuracy. The use of bilinear function is to warp the signal without any significant decrease in the quality scores. Fuzzy logic is applied to the amplitude scaling process in order to improve the perceptional quality. This method despite its simplicity, achieves similar performance scores to the previously available methods based on Gaussian Mixture Model.
KeywordsVoice Conversion, Gaussian Mixture Model, bilinear function, frequency warping, fuzzy amplitude scaling.
-
INTRODUCTION
Voice Conversion is the process of modifying the characteristics of source speaker in such a way that it is perceived as the voice of target speaker [1]. Till now, voice conversion systems have mainly focused on spectral characteristics and some others operate on prosodic level too. A voice conversion system involves two phases, training phase and conversion phase. During the training phase, the voice conversion system learns a function to transform the source speakers acoustic data to target speakers acoustic data. Usually this involves database of involved speakers speech signals. Analyzing the training data, the parameters corresponding to speaker identity are extracted from source as well as target speech. During the conversion phase, the function learned is used to transform any new input utterances from the source speaker i.e. the source acoustic data to be mapped to the target acoustic data. Finally the converted speech i.e. source message in the voice of target speaker is synthesized [2]. The applications of Voice Conversion is in the entertainment industries mainly dubbed movies, gaming, karaoke, voice masking for chat rooms, customization of speaking devices [3]etc.
At present, most of the method to perform voice conversion mainly uses Gaussian mixture model. Gaussian mixture models are used for statistical modeling of speech data [4]-[6]. This method uses Gaussian mixture model tosegregate the speech data into components and determine the posterior probability of each components of data. Frequency domain transformation is preferred over time domain transforms because the frequency transform does not remove any part of source spectrum thereby preserving the quality of the converted speech. In the line of work of our method, first frequency warping based on piecewise linear frequency warping function along with energy conversion filter is used [7]. In [8], for instance, the lowest- distortion FW path was calculated from discretized spectra via dynamic programming, which is known as dynamic FW [9]. Further improvement lead to the usage of amplitude scaling instead of energy conversion filter that resulted in improved quality scores. At the next level Bilinear frequency warping functions are used which ensures computational simplicity. In our proposed method, fuzzy logic is applied to the amplitude scaling to improve the overall conversion performance [10]. The general block diagram of voice conversion is given in Fig.1.
Source voice
Target voice
Analysis
Training
Mapping
Conversion
Synthesis
Fig. 1. General block diagram of voice conversion
-
DESCRIPTION OF THE METHOD
-
Bilinear Warping Functions
Bilinear functions are parametric Frequency Warping functions are applied to perform vocal tract length normalization (VTLN), used in both speech recognition [11] and conversion. Bilinear functions depends on one single parameter
z1
-
Frequency Warping Factor
The frequency warping factor is used to determine the warping matrix used in the conversion phase. In practice the value of k
may not be sufficiently close to zero and then the approximation in eqn 3 is not valid. This happens in case of gross-gender voice conversion, where value is not sufficiently small. Therefore an iterative process that yields
z1
1 z1
||<1 (1)
an increasingly accurate solution in any gender to gender conversion is considered.
Given a p-dimensional cepstral vector x, it has been proven [12]-[14] that the cepstral vector y that corresponds to the frequency warped version of the spectrum represented by x is given by
1 2 2 3
y W x, W 3 1 4 2 4
Step 1: initialize k=0 for all k.
Step 2: for the current k, calculate a set of warped vectors
n
{zn}, zn W ( x , ) xn where the warping matrix is given by expression (2)
Step 3: calculate the differential warping factors {k}
needed to make the warped source vector (zn)
(2)
closer to target vector (yn). The differential
The dependence between the warping matrix W
warping vector can be obtained from the distance matrix (D) and error vector (e) as explained in [16]-[17]
warping factor is strongly nonlinear. However, it was
observed in [14] that when is sufficiently closed to zero
p( ) (x )d (z )
1 1 1
p( ) (x )d (z )
m 1 1
the higher powers of can be neglected and this
D
dependence becomes linear.
Npm
p( ) (x
)d (z )
p( ) (x
)d (z )
1 2 0 0
1 N N m N N
1 3
0
[ ]T
m1 1 m
1 1
N N
W 0 2
1 4
e [( y
-
z )T
( y
-
z )T ]T
(6)
0 0 3 1
Np1
(3)
From the distance matrix and error vector the
optimal value of warping factor would be
Then, the warping operation is equivalent to
opt
(DT D)1 DT e
(7)
y x d(x)
(4)
Step 4: accumulate {k} into the current {k}. According to [38], this can be done as follows.
where d(x) is the vector whose ith element is given by
(updated) k k
(8)
d(x)[i] (i1).x[i1] (i1).x[i1],i 1…..p (5)
1 k .k
The block diagram for bilinear frequency warping and fuzzy amplitude scaling process are given in Fig. 2
New Source Utterance
Source speech Database
Preprocessing
Preprocessing
GMM Fitting
GMM Fitting
k
Step 5: if the updated k value in the previous step did not show insignificant change ( i.e |k|<0.001 for all k), exit. Otherwise go to step 2.
Using this iterative method between any pair of speakers the conversion error is minimized after each iteration until the process is converged. During the conversion phase, from the obtained warping factors {k}, the precise bilinear frequency warping matrix is used.
C.Conversion Phase
Target speech Database
Determining FW factor
Bilinear Frequency Warping
After obtaining the frequency warping factor kfor each of the Gaussian component of ,the expression of conversion function as,
y W ( x, ) x
(9)
Fuzzy Amplitude Scaling
where the warping matrix W is given by expression (2), and (x,) the result of combining the basic warping
Converted speech
factors of all the components of , is given by
m
Fig 2. Block diagram of Bilinear Frequency Warping and Fuzzy Amplitude Scaling.
(x, ) pk (x)k
k 1
(10)
C.Amplitude Scaling Vectors
After the warping factor {k} is determined, the value of amplitude scaling vector is calculated in such a way that the error between warped and target vectors are minimized as in [2].
(b) || r s(x , ) ||2 , r y W x (11)
-
Rule base engine
Rule base engine refers to the decision matrix of fuzzy knowledge base composed of expert IF<antecedents>THEN<conclusions> rule. It takes into account the range of amplitude values and frequency response and decides the range of the output amplitude
n n n n
n
( xn , ) n
values [8]. All the possibilities are taken into account and
the corresponding output amplitude ranges are defined as in
This means calculating the least squares solution of the system P.S= R. where,
Table 1.
Table.1 Decision matrix
Amp
Freq.
Low
Medium
High
Low
Low
High
Low
Medium
Medium
Low
High
High
Medium
Low
low
p( ) (x ) p( ) (x )
1 1 m 1
P
N m
p( ) (x ) p( ) (x )
1 N m N
S [s
s ]T
m p
1 m
R [r r ]T
N p 1 N
(12)
-
Defuzzification:
Defuzzification involves the process of transposing the
The solution of the system is the optimal value of the amplitude scaling vector.
fuzzy outputs to real outputs. The output amplitude values are determined by reverse mapping the truth values of the
Sopt
(PT P)1PT R
(13)
membership function to the corresponding amplitude values.
-
-
Fuzzy Amplitude Scaling
Fuzzy rule applied to the amplitude scaling process mainly involves three steps, Fuzzification, Rule base engine and Defuzzification. The warped source signal and the frequency response of the target signal are the inputs to the fuzzy amplitude scaling system.
i.Fuzzification
Fuzzification refers to the process of converting real value to fuzzy value. The warped source signal is assumed to have three membership functions, low, medium and
high amplitude ranges. The frequency response of the target signal is assumed to have three membership functions, low, medium and high frequency ranges. Trapezoidal membership functions are considered for the inputs of fuzzy system. The fuzzy mapping function is shown in Fig. 3
Fig. 3 The input signal amplitude is mapped as functions of degree of truth values. For example, amplitude of 0.24 has a degree of truth value of
0.22 and 0.48 in the medium and high range membership function.
-
-
RESULTS AND DISCUSSIONS
Experimental Procedure
This section accounts various results that expose the performance aspects of Bilinear Frequency Warping and Fuzzy Amplitude Scaling method. The speech data used in the experiment were created by Zabaware Text-to-Speech software considering different source and target speakers. Equal number of training sentences is used in the training phase. The sampling frequency of the signals is 11.025 kHz. The Gaussian Mixture Models used in our experiments had 5 mixtures with full covariance matrices. The parameters of the model used in the EM algorithm are initialized to get consistent result.
Accuracy of Frequency Warping Factor
At first 25 parallel training sentences are given to the conversion system. The obtained value of frequency warping factor is in the order of 10-5. With the value in that range sometimes it is not feasible to extract time domain signal from cepstral coefficients. Then the system is trained with 50 parallel sentences. The warping factor reduces to the range of 10-7. With this range it is possible to recover the signal but the amplitude range gets disrupted. When the system is trained with 100 parallel data, the value of warping factor falls to the range of 10-8. With this value the time domain signal is easily extracted and suitable for further processing.
Accuracy of Fuzzy Amplitude scaling
The performance of Fuzzy Amplitude Scaling depends on the defined range and shape of the membership function. Different shapes results in varied results. Trapezoid membership function is found to have better performance in comparison with the triangular membership function.
-
CONCLUSION
-
This paper has proposed a voice conversion method based on bilinear frequency warping and fuzzy amplitude scaling. The method can be implemented in cepstral domain. The conversion function parameters are trained to the most accurate level by an iterative process from a few parallel data. This method achieves more computational efficiency. Moreover, the average conversion performance is good in comparison with the state-of-the-art statistical methods. The
-
E.Moulines and Y. Sagisaka, Voive conversion: State of the art and perspectives, Speech Commun. Special Issue, vol. 16, no. 2, 1995.
-
Daniel Erro, Eva Navas, and Inma Hernaez, Parametric voice conversion based on bilinear frequency warping plus amplitude scaling, IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 3, pp. 556-565, Mar. 2013.
-
A. Kain and M. Macon, Spectral voice conversion for text-to- speech syntesis, in Proc. ICASSP, 1998,vol. 1,pp. 285-288.
-
Y. Stylianuou, O.Cappe, and E. Moulines, Continuous probabilistic transform for voice conversion, IEEE Trans, Speech Audio Process., vol. 6, no. 2, pp.131-142, Mar.1998.
-
A. Kain, High resolution voice transformation, Ph.D. dissertation, Oregon Health & Science Uniov., Portland, 2001.
-
T. Toda, A. W. black, and K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 8, pp. 2222- 2235, Nov. 2007.
-
D. Erro, A. Moreno, and A. Bonafonte,Voice conversion based on weighted frequency warping, IEEE Trans. Audio, Speech, Lang. Process., vol, 18, no. 5,pp,922-931, Jul, 2010.
quality of the converted speech is increased without worsening the conversion performance. Subjective evaluation shows that there is a good trade-off between quality and simplicity.
REFERENCES
-
H. Valbret, E. Moulines,and J.P. Tubach, Voice transformation using PSOLA techniques, Speech Commun., vol. 1, pp. 145-148, 1992.
-
E. Godoy, O. Rosec, and T. Chonavel, Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or non-parallel corpora, IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 4, pp,1313-1323, May 2012.
-
Rafael Alcala, Jesus Alcala-Fdez, Maria Jose Gacto and Francisco Herrera,Genetic Lateral and Amplitude Tuningof Membership Functions for Fuzzy Systems,International Conference on Machine Intelligence, Tozeur Tunisia, pp.589-595, 2005
-
P. Zhan and A. Waibel, Vocal tract length normalization for large vocabulary continuous speech recognition, 1997, CMU Computer Science Tech. Rep.,
-
J. McDonough and W. Byrne, Speaker adaptation with all-pass transforms, in Proc. ICASSP, 1999, pp.757-760.
-
M. Pitz and H.ney, Vocal tract normalization equals linear transformation in cepstral space. IEEE Trns. Speech Audio Process., vol. 13, no. 5,pp. 930-944, Sep. 2005.
-
T. Emori and K. Shinoda, Rapid vocal tract length normalization using maximum likelihood estimation, in Proc. Eurospeech, 2001, pp. 1649-1652.