Cepstral Domain Voice Conversion Based on Constrained Transformations

Dr.   G.   Indumathi; V.   S.   Hewitt; V.   Rajavel

doi:10.17577/IJERTV4IS030556

Volume 04, Issue 03 (March 2015)

Cepstral Domain Voice Conversion Based on Constrained Transformations

DOI : 10.17577/IJERTV4IS030556

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 75
Total Downloads : 208
Authors : Dr. G. Indumathi, V. S. Hewitt, V. Rajavel
Paper ID : IJERTV4IS030556
Volume & Issue : Volume 04, Issue 03 (March 2015)
DOI : http://dx.doi.org/10.17577/IJERTV4IS030556
Published (First Online): 21-03-2015
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Cepstral Domain Voice Conversion Based on Constrained Transformations

Dr. G. Indumathi,

Mepco Schlenk Engineering College,

Sivaksi, India.

V. S. Hewitt,

Mepco Schlenk Engineering College,

Sivakasi,India.

Rajavel,

Mepco Schlenk Engineering College,

Sivakasi,India.

AbstractThis paper proposes a method for voice conversion in cepstral domain. The method basically involves two steps: Bilinear Frequency Warping and Amplitude Scaling. Frequency warping is done so that spectrum of source speaker is moved towards their image in the target speakers spectrum. Amplitude scaling is done to compensate for the warping inaccuracy. The use of bilinear function is to warp the signal without any significant decrease in the quality scores. Fuzzy logic is applied to the amplitude scaling process in order to improve the perceptional quality. This method despite its simplicity, achieves similar performance scores to the previously available methods based on Gaussian Mixture Model.

KeywordsVoice Conversion, Gaussian Mixture Model, bilinear function, frequency warping, fuzzy amplitude scaling.
1. INTRODUCTION
  
  Voice Conversion is the process of modifying the characteristics of source speaker in such a way that it is perceived as the voice of target speaker [1]. Till now, voice conversion systems have mainly focused on spectral characteristics and some others operate on prosodic level too. A voice conversion system involves two phases, training phase and conversion phase. During the training phase, the voice conversion system learns a function to transform the source speakers acoustic data to target speakers acoustic data. Usually this involves database of involved speakers speech signals. Analyzing the training data, the parameters corresponding to speaker identity are extracted from source as well as target speech. During the conversion phase, the function learned is used to transform any new input utterances from the source speaker i.e. the source acoustic data to be mapped to the target acoustic data. Finally the converted speech i.e. source message in the voice of target speaker is synthesized [2]. The applications of Voice Conversion is in the entertainment industries mainly dubbed movies, gaming, karaoke, voice masking for chat rooms, customization of speaking devices [3]etc.
  
  At present, most of the method to perform voice conversion mainly uses Gaussian mixture model. Gaussian mixture models are used for statistical modeling of speech data [4]-[6]. This method uses Gaussian mixture model tosegregate the speech data into components and determine the posterior probability of each components of data. Frequency domain transformation is preferred over time domain transforms because the frequency transform does not remove any part of source spectrum thereby preserving the quality of the converted speech. In the line of work of our method, first frequency warping based on piecewise linear frequency warping function along with energy conversion filter is used [7]. In [8], for instance, the lowest- distortion FW path was calculated from discretized spectra via dynamic programming, which is known as dynamic FW [9]. Further improvement lead to the usage of amplitude scaling instead of energy conversion filter that resulted in improved quality scores. At the next level Bilinear frequency warping functions are used which ensures computational simplicity. In our proposed method, fuzzy logic is applied to the amplitude scaling to improve the overall conversion performance [10]. The general block diagram of voice conversion is given in Fig.1.
  
  Source voice
  
  Target voice
  
  Analysis
  
  Training
  
  Mapping
  
  Conversion
  
  Synthesis
  
  Fig. 1. General block diagram of voice conversion
2. DESCRIPTION OF THE METHOD
  1. Bilinear Warping Functions
    
    Bilinear functions are parametric Frequency Warping functions are applied to perform vocal tract length normalization (VTLN), used in both speech recognition [11] and conversion. Bilinear functions depends on one single parameter
    
    z1
  2. Frequency Warping Factor
    
    The frequency warping factor is used to determine the warping matrix used in the conversion phase. In practice the value of k
    
    may not be sufficiently close to zero and then the approximation in eqn 3 is not valid. This happens in case of gross-gender voice conversion, where value is not sufficiently small. Therefore an iterative process that yields
    
    z1
    
    1 z1
    
    ||<1 (1)
    
    an increasingly accurate solution in any gender to gender conversion is considered.
    
    Given a p-dimensional cepstral vector x, it has been proven [12]-[14] that the cepstral vector y that corresponds to the frequency warped version of the spectrum represented by x is given by
    
    1 2 2 3
    
    y W x, W 3 1 4 2 4
    
    Step 1: initialize k=0 for all k.
    
    Step 2: for the current k, calculate a set of warped vectors
    
    n
    
    {zn}, zn W ( x , ) xn where the warping matrix is given by expression (2)
    
    Step 3: calculate the differential warping factors {k}
    
    needed to make the warped source vector (zn)
    
    (2)
    
    closer to target vector (yn). The differential
    
    The dependence between the warping matrix W
    
    warping vector can be obtained from the distance matrix (D) and error vector (e) as explained in [16]-[17]
    warping factor is strongly nonlinear. However, it was
    
    observed in [14] that when is sufficiently closed to zero
    
    p( ) (x )d (z )
    
    1 1 1
    
    p( ) (x )d (z )
    
    m 1 1
    
    the higher powers of can be neglected and this
    
    D
    
    dependence becomes linear.
    
    Npm
    
    p( ) (x
    
    )d (z )
    
    p( ) (x
    
    )d (z )
    
    1 2 0 0
    
    1 N N m N N
    
    1 3
    
    0
    
    [ ]T
    
    m1 1 m
    
    1 1
    
    N N
    
    W 0 2
    
    1 4
    
    e [( y
    - z )T
      
      ( y
    - z )T ]T
    (6)
    
    0 0 3 1
    
    Np1
    
    (3)
    
    From the distance matrix and error vector the
    
    optimal value of warping factor would be
    
    Then, the warping operation is equivalent to
    
    opt
    
    (DT D)1 DT e
    
    (7)
    
    y x d(x)
    
    (4)
    
    Step 4: accumulate {k} into the current {k}. According to [38], this can be done as follows.
    
    where d(x) is the vector whose ith element is given by
    
    (updated) k k
    
    (8)
    
    d(x)[i] (i1).x[i1] (i1).x[i1],i 1…..p (5)
    
    1 k .k
    
    The block diagram for bilinear frequency warping and fuzzy amplitude scaling process are given in Fig. 2
    
    New Source Utterance
    
    Source speech Database
    
    Preprocessing
    
    Preprocessing
    
    GMM Fitting
    
    GMM Fitting
    
    k
    
    Step 5: if the updated k value in the previous step did not show insignificant change ( i.e |k|<0.001 for all k), exit. Otherwise go to step 2.
    
    Using this iterative method between any pair of speakers the conversion error is minimized after each iteration until the process is converged. During the conversion phase, from the obtained warping factors {k}, the precise bilinear frequency warping matrix is used.
    
    C.Conversion Phase
    
    Target speech Database
    
    Determining FW factor
    
    Bilinear Frequency Warping
    
    After obtaining the frequency warping factor kfor each of the Gaussian component of ,the expression of conversion function as,
    
    y W ( x, ) x
    
    (9)
    
    Fuzzy Amplitude Scaling
    
    where the warping matrix W is given by expression (2), and (x,) the result of combining the basic warping
    
    Converted speech
    
    factors of all the components of , is given by
    
    m
    
    Fig 2. Block diagram of Bilinear Frequency Warping and Fuzzy Amplitude Scaling.
    
    (x, ) pk (x)k
    
    k 1
    
    (10)
    
    C.Amplitude Scaling Vectors
    
    After the warping factor {k} is determined, the value of amplitude scaling vector is calculated in such a way that the error between warped and target vectors are minimized as in [2].
    
    (b) || r s(x , ) ||2 , r y W x (11)
    1. Rule base engine
      
      Rule base engine refers to the decision matrix of fuzzy knowledge base composed of expert IF<antecedents>THEN<conclusions> rule. It takes into account the range of amplitude values and frequency response and decides the range of the output amplitude
      
      n n n n
      
      n
      
      ( xn , ) n
      
      values [8]. All the possibilities are taken into account and
      
      the corresponding output amplitude ranges are defined as in
      
      This means calculating the least squares solution of the system P.S= R. where,
      
      Table 1.
      
      Table.1 Decision matrix
      
      Amp
      
      Freq.
      
      Low
      
      Medium
      
      High
      
      Low
      
      Low
      
      High
      
      Low
      
      Medium
      
      Medium
      
      Low
      
      High
      
      High
      
      Medium
      
      Low
      
      low
      
      p( ) (x ) p( ) (x )
      
      1 1 m 1
      
      P
      
      N m
      
      p( ) (x ) p( ) (x )
      
      1 N m N
      
      S [s
      
      s ]T
      
      m p
      
      1 m
      
      R [r r ]T
      
      N p 1 N
      
      (12)
    2. Defuzzification:
    Defuzzification involves the process of transposing the
    
    The solution of the system is the optimal value of the amplitude scaling vector.
    
    fuzzy outputs to real outputs. The output amplitude values are determined by reverse mapping the truth values of the
    
    Sopt
    
    (PT P)1PT R
    
    (13)
    
    membership function to the corresponding amplitude values.
  3. Fuzzy Amplitude Scaling
  Fuzzy rule applied to the amplitude scaling process mainly involves three steps, Fuzzification, Rule base engine and Defuzzification. The warped source signal and the frequency response of the target signal are the inputs to the fuzzy amplitude scaling system.
  
  i.Fuzzification
  
  Fuzzification refers to the process of converting real value to fuzzy value. The warped source signal is assumed to have three membership functions, low, medium and
  
  high amplitude ranges. The frequency response of the target signal is assumed to have three membership functions, low, medium and high frequency ranges. Trapezoidal membership functions are considered for the inputs of fuzzy system. The fuzzy mapping function is shown in Fig. 3
  
  Fig. 3 The input signal amplitude is mapped as functions of degree of truth values. For example, amplitude of 0.24 has a degree of truth value of
  
  0.22 and 0.48 in the medium and high range membership function.
3. RESULTS AND DISCUSSIONS
  
  Experimental Procedure
  
  This section accounts various results that expose the performance aspects of Bilinear Frequency Warping and Fuzzy Amplitude Scaling method. The speech data used in the experiment were created by Zabaware Text-to-Speech software considering different source and target speakers. Equal number of training sentences is used in the training phase. The sampling frequency of the signals is 11.025 kHz. The Gaussian Mixture Models used in our experiments had 5 mixtures with full covariance matrices. The parameters of the model used in the EM algorithm are initialized to get consistent result.
  
  Accuracy of Frequency Warping Factor
  
  At first 25 parallel training sentences are given to the conversion system. The obtained value of frequency warping factor is in the order of 10-5. With the value in that range sometimes it is not feasible to extract time domain signal from cepstral coefficients. Then the system is trained with 50 parallel sentences. The warping factor reduces to the range of 10-7. With this range it is possible to recover the signal but the amplitude range gets disrupted. When the system is trained with 100 parallel data, the value of warping factor falls to the range of 10-8. With this value the time domain signal is easily extracted and suitable for further processing.
  
  Accuracy of Fuzzy Amplitude scaling
  
  The performance of Fuzzy Amplitude Scaling depends on the defined range and shape of the membership function. Different shapes results in varied results. Trapezoid membership function is found to have better performance in comparison with the triangular membership function.
4. CONCLUSION

This paper has proposed a voice conversion method based on bilinear frequency warping and fuzzy amplitude scaling. The method can be implemented in cepstral domain. The conversion function parameters are trained to the most accurate level by an iterative process from a few parallel data. This method achieves more computational efficiency. Moreover, the average conversion performance is good in comparison with the state-of-the-art statistical methods. The

E.Moulines and Y. Sagisaka, Voive conversion: State of the art and perspectives, Speech Commun. Special Issue, vol. 16, no. 2, 1995.
Daniel Erro, Eva Navas, and Inma Hernaez, Parametric voice conversion based on bilinear frequency warping plus amplitude scaling, IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 3, pp. 556-565, Mar. 2013.
A. Kain and M. Macon, Spectral voice conversion for text-to- speech syntesis, in Proc. ICASSP, 1998,vol. 1,pp. 285-288.
Y. Stylianuou, O.Cappe, and E. Moulines, Continuous probabilistic transform for voice conversion, IEEE Trans, Speech Audio Process., vol. 6, no. 2, pp.131-142, Mar.1998.
A. Kain, High resolution voice transformation, Ph.D. dissertation, Oregon Health & Science Uniov., Portland, 2001.
T. Toda, A. W. black, and K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 8, pp. 2222- 2235, Nov. 2007.
D. Erro, A. Moreno, and A. Bonafonte,Voice conversion based on weighted frequency warping, IEEE Trans. Audio, Speech, Lang. Process., vol, 18, no. 5,pp,922-931, Jul, 2010.

quality of the converted speech is increased without worsening the conversion performance. Subjective evaluation shows that there is a good trade-off between quality and simplicity.

REFERENCES
H. Valbret, E. Moulines,and J.P. Tubach, Voice transformation using PSOLA techniques, Speech Commun., vol. 1, pp. 145-148, 1992.
E. Godoy, O. Rosec, and T. Chonavel, Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or non-parallel corpora, IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 4, pp,1313-1323, May 2012.
Rafael Alcala, Jesus Alcala-Fdez, Maria Jose Gacto and Francisco Herrera,Genetic Lateral and Amplitude Tuningof Membership Functions for Fuzzy Systems,International Conference on Machine Intelligence, Tozeur Tunisia, pp.589-595, 2005
P. Zhan and A. Waibel, Vocal tract length normalization for large vocabulary continuous speech recognition, 1997, CMU Computer Science Tech. Rep.,
J. McDonough and W. Byrne, Speaker adaptation with all-pass transforms, in Proc. ICASSP, 1999, pp.757-760.
M. Pitz and H.ney, Vocal tract normalization equals linear transformation in cepstral space. IEEE Trns. Speech Audio Process., vol. 13, no. 5,pp. 930-944, Sep. 2005.
T. Emori and K. Shinoda, Rapid vocal tract length normalization using maximum likelihood estimation, in Proc. Eurospeech, 2001, pp. 1649-1652.

Amp Freq.	Low	Medium	High
Low	Low	High	Low
Medium	Medium	Low	High
High	Medium	Low	low

Cepstral Domain Voice Conversion Based on Constrained Transformations

Leave a Reply