Cepstral Domain Voice Conversion Based on Constrained Transformations

DOI : 10.17577/IJERTV4IS030556

Download Full-Text PDF Cite this Publication

Text Only Version

Cepstral Domain Voice Conversion Based on Constrained Transformations

Dr. G. Indumathi,

Mepco Schlenk Engineering College,

Sivaksi, India.

V. S. Hewitt,

Mepco Schlenk Engineering College,

Sivakasi,India.

  1. Rajavel,

    Mepco Schlenk Engineering College,

    Sivakasi,India.

    AbstractThis paper proposes a method for voice conversion in cepstral domain. The method basically involves two steps: Bilinear Frequency Warping and Amplitude Scaling. Frequency warping is done so that spectrum of source speaker is moved towards their image in the target speakers spectrum. Amplitude scaling is done to compensate for the warping inaccuracy. The use of bilinear function is to warp the signal without any significant decrease in the quality scores. Fuzzy logic is applied to the amplitude scaling process in order to improve the perceptional quality. This method despite its simplicity, achieves similar performance scores to the previously available methods based on Gaussian Mixture Model.

    KeywordsVoice Conversion, Gaussian Mixture Model, bilinear function, frequency warping, fuzzy amplitude scaling.

    1. INTRODUCTION

      Voice Conversion is the process of modifying the characteristics of source speaker in such a way that it is perceived as the voice of target speaker [1]. Till now, voice conversion systems have mainly focused on spectral characteristics and some others operate on prosodic level too. A voice conversion system involves two phases, training phase and conversion phase. During the training phase, the voice conversion system learns a function to transform the source speakers acoustic data to target speakers acoustic data. Usually this involves database of involved speakers speech signals. Analyzing the training data, the parameters corresponding to speaker identity are extracted from source as well as target speech. During the conversion phase, the function learned is used to transform any new input utterances from the source speaker i.e. the source acoustic data to be mapped to the target acoustic data. Finally the converted speech i.e. source message in the voice of target speaker is synthesized [2]. The applications of Voice Conversion is in the entertainment industries mainly dubbed movies, gaming, karaoke, voice masking for chat rooms, customization of speaking devices [3]etc.

      At present, most of the method to perform voice conversion mainly uses Gaussian mixture model. Gaussian mixture models are used for statistical modeling of speech data [4]-[6]. This method uses Gaussian mixture model tosegregate the speech data into components and determine the posterior probability of each components of data. Frequency domain transformation is preferred over time domain transforms because the frequency transform does not remove any part of source spectrum thereby preserving the quality of the converted speech. In the line of work of our method, first frequency warping based on piecewise linear frequency warping function along with energy conversion filter is used [7]. In [8], for instance, the lowest- distortion FW path was calculated from discretized spectra via dynamic programming, which is known as dynamic FW [9]. Further improvement lead to the usage of amplitude scaling instead of energy conversion filter that resulted in improved quality scores. At the next level Bilinear frequency warping functions are used which ensures computational simplicity. In our proposed method, fuzzy logic is applied to the amplitude scaling to improve the overall conversion performance [10]. The general block diagram of voice conversion is given in Fig.1.

      Source voice

      Target voice

      Analysis

      Training

      Mapping

      Conversion

      Synthesis

      Fig. 1. General block diagram of voice conversion

    2. DESCRIPTION OF THE METHOD

      1. Bilinear Warping Functions

        Bilinear functions are parametric Frequency Warping functions are applied to perform vocal tract length normalization (VTLN), used in both speech recognition [11] and conversion. Bilinear functions depends on one single parameter

        z1

      2. Frequency Warping Factor

        The frequency warping factor is used to determine the warping matrix used in the conversion phase. In practice the value of k

        may not be sufficiently close to zero and then the approximation in eqn 3 is not valid. This happens in case of gross-gender voice conversion, where value is not sufficiently small. Therefore an iterative process that yields

        z1

        1 z1

        ||<1 (1)

        an increasingly accurate solution in any gender to gender conversion is considered.

        Given a p-dimensional cepstral vector x, it has been proven [12]-[14] that the cepstral vector y that corresponds to the frequency warped version of the spectrum represented by x is given by

        1 2 2 3

        y W x, W 3 1 4 2 4

        Step 1: initialize k=0 for all k.

        Step 2: for the current k, calculate a set of warped vectors

        n

        {zn}, zn W ( x , ) xn where the warping matrix is given by expression (2)

        Step 3: calculate the differential warping factors {k}

        needed to make the warped source vector (zn)

        (2)

        closer to target vector (yn). The differential

        The dependence between the warping matrix W

        warping vector can be obtained from the distance matrix (D) and error vector (e) as explained in [16]-[17]

        warping factor is strongly nonlinear. However, it was

        observed in [14] that when is sufficiently closed to zero

        p( ) (x )d (z )

        1 1 1

        p( ) (x )d (z )

        m 1 1

        the higher powers of can be neglected and this

        D

        dependence becomes linear.

        Npm

        p( ) (x

        )d (z )

        p( ) (x

        )d (z )

        1 2 0 0

        1 N N m N N

        1 3

        0

        [ ]T

        m1 1 m

        1 1

        N N

        W 0 2

        1 4

        e [( y

        • z )T

          ( y

        • z )T ]T

        (6)

        0 0 3 1

        Np1

        (3)

        From the distance matrix and error vector the

        optimal value of warping factor would be

        Then, the warping operation is equivalent to

        opt

        (DT D)1 DT e

        (7)

        y x d(x)

        (4)

        Step 4: accumulate {k} into the current {k}. According to [38], this can be done as follows.

        where d(x) is the vector whose ith element is given by

        (updated) k k

        (8)

        d(x)[i] (i1).x[i1] (i1).x[i1],i 1…..p (5)

        1 k .k

        The block diagram for bilinear frequency warping and fuzzy amplitude scaling process are given in Fig. 2

        New Source Utterance

        Source speech Database

        Preprocessing

        Preprocessing

        GMM Fitting

        GMM Fitting

        k

        Step 5: if the updated k value in the previous step did not show insignificant change ( i.e |k|<0.001 for all k), exit. Otherwise go to step 2.

        Using this iterative method between any pair of speakers the conversion error is minimized after each iteration until the process is converged. During the conversion phase, from the obtained warping factors {k}, the precise bilinear frequency warping matrix is used.

        C.Conversion Phase

        Target speech Database

        Determining FW factor

        Bilinear Frequency Warping

        After obtaining the frequency warping factor kfor each of the Gaussian component of ,the expression of conversion function as,

        y W ( x, ) x

        (9)

        Fuzzy Amplitude Scaling

        where the warping matrix W is given by expression (2), and (x,) the result of combining the basic warping

        Converted speech

        factors of all the components of , is given by

        m

        Fig 2. Block diagram of Bilinear Frequency Warping and Fuzzy Amplitude Scaling.

        (x, ) pk (x)k

        k 1

        (10)

        C.Amplitude Scaling Vectors

        After the warping factor {k} is determined, the value of amplitude scaling vector is calculated in such a way that the error between warped and target vectors are minimized as in [2].

        (b) || r s(x , ) ||2 , r y W x (11)

        1. Rule base engine

          Rule base engine refers to the decision matrix of fuzzy knowledge base composed of expert IF<antecedents>THEN<conclusions> rule. It takes into account the range of amplitude values and frequency response and decides the range of the output amplitude

          n n n n

          n

          ( xn , ) n

          values [8]. All the possibilities are taken into account and

          the corresponding output amplitude ranges are defined as in

          This means calculating the least squares solution of the system P.S= R. where,

          Table 1.

          Table.1 Decision matrix

          Amp

          Freq.

          Low

          Medium

          High

          Low

          Low

          High

          Low

          Medium

          Medium

          Low

          High

          High

          Medium

          Low

          low

          p( ) (x ) p( ) (x )

          1 1 m 1

          P

          N m

          p( ) (x ) p( ) (x )

          1 N m N

          S [s

          s ]T

          m p

          1 m

          R [r r ]T

          N p 1 N

          (12)

        2. Defuzzification:

        Defuzzification involves the process of transposing the

        The solution of the system is the optimal value of the amplitude scaling vector.

        fuzzy outputs to real outputs. The output amplitude values are determined by reverse mapping the truth values of the

        Sopt

        (PT P)1PT R

        (13)

        membership function to the corresponding amplitude values.

      3. Fuzzy Amplitude Scaling

      Fuzzy rule applied to the amplitude scaling process mainly involves three steps, Fuzzification, Rule base engine and Defuzzification. The warped source signal and the frequency response of the target signal are the inputs to the fuzzy amplitude scaling system.

      i.Fuzzification

      Fuzzification refers to the process of converting real value to fuzzy value. The warped source signal is assumed to have three membership functions, low, medium and

      high amplitude ranges. The frequency response of the target signal is assumed to have three membership functions, low, medium and high frequency ranges. Trapezoidal membership functions are considered for the inputs of fuzzy system. The fuzzy mapping function is shown in Fig. 3

      Fig. 3 The input signal amplitude is mapped as functions of degree of truth values. For example, amplitude of 0.24 has a degree of truth value of

      0.22 and 0.48 in the medium and high range membership function.

    3. RESULTS AND DISCUSSIONS

      Experimental Procedure

      This section accounts various results that expose the performance aspects of Bilinear Frequency Warping and Fuzzy Amplitude Scaling method. The speech data used in the experiment were created by Zabaware Text-to-Speech software considering different source and target speakers. Equal number of training sentences is used in the training phase. The sampling frequency of the signals is 11.025 kHz. The Gaussian Mixture Models used in our experiments had 5 mixtures with full covariance matrices. The parameters of the model used in the EM algorithm are initialized to get consistent result.

      Accuracy of Frequency Warping Factor

      At first 25 parallel training sentences are given to the conversion system. The obtained value of frequency warping factor is in the order of 10-5. With the value in that range sometimes it is not feasible to extract time domain signal from cepstral coefficients. Then the system is trained with 50 parallel sentences. The warping factor reduces to the range of 10-7. With this range it is possible to recover the signal but the amplitude range gets disrupted. When the system is trained with 100 parallel data, the value of warping factor falls to the range of 10-8. With this value the time domain signal is easily extracted and suitable for further processing.

      Accuracy of Fuzzy Amplitude scaling

      The performance of Fuzzy Amplitude Scaling depends on the defined range and shape of the membership function. Different shapes results in varied results. Trapezoid membership function is found to have better performance in comparison with the triangular membership function.

    4. CONCLUSION

This paper has proposed a voice conversion method based on bilinear frequency warping and fuzzy amplitude scaling. The method can be implemented in cepstral domain. The conversion function parameters are trained to the most accurate level by an iterative process from a few parallel data. This method achieves more computational efficiency. Moreover, the average conversion performance is good in comparison with the state-of-the-art statistical methods. The

  1. E.Moulines and Y. Sagisaka, Voive conversion: State of the art and perspectives, Speech Commun. Special Issue, vol. 16, no. 2, 1995.

  2. Daniel Erro, Eva Navas, and Inma Hernaez, Parametric voice conversion based on bilinear frequency warping plus amplitude scaling, IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 3, pp. 556-565, Mar. 2013.

  3. A. Kain and M. Macon, Spectral voice conversion for text-to- speech syntesis, in Proc. ICASSP, 1998,vol. 1,pp. 285-288.

  4. Y. Stylianuou, O.Cappe, and E. Moulines, Continuous probabilistic transform for voice conversion, IEEE Trans, Speech Audio Process., vol. 6, no. 2, pp.131-142, Mar.1998.

  5. A. Kain, High resolution voice transformation, Ph.D. dissertation, Oregon Health & Science Uniov., Portland, 2001.

  6. T. Toda, A. W. black, and K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 8, pp. 2222- 2235, Nov. 2007.

  7. D. Erro, A. Moreno, and A. Bonafonte,Voice conversion based on weighted frequency warping, IEEE Trans. Audio, Speech, Lang. Process., vol, 18, no. 5,pp,922-931, Jul, 2010.

    quality of the converted speech is increased without worsening the conversion performance. Subjective evaluation shows that there is a good trade-off between quality and simplicity.

    REFERENCES

  8. H. Valbret, E. Moulines,and J.P. Tubach, Voice transformation using PSOLA techniques, Speech Commun., vol. 1, pp. 145-148, 1992.

  9. E. Godoy, O. Rosec, and T. Chonavel, Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or non-parallel corpora, IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 4, pp,1313-1323, May 2012.

  10. Rafael Alcala, Jesus Alcala-Fdez, Maria Jose Gacto and Francisco Herrera,Genetic Lateral and Amplitude Tuningof Membership Functions for Fuzzy Systems,International Conference on Machine Intelligence, Tozeur Tunisia, pp.589-595, 2005

  11. P. Zhan and A. Waibel, Vocal tract length normalization for large vocabulary continuous speech recognition, 1997, CMU Computer Science Tech. Rep.,

  12. J. McDonough and W. Byrne, Speaker adaptation with all-pass transforms, in Proc. ICASSP, 1999, pp.757-760.

  13. M. Pitz and H.ney, Vocal tract normalization equals linear transformation in cepstral space. IEEE Trns. Speech Audio Process., vol. 13, no. 5,pp. 930-944, Sep. 2005.

  14. T. Emori and K. Shinoda, Rapid vocal tract length normalization using maximum likelihood estimation, in Proc. Eurospeech, 2001, pp. 1649-1652.

Leave a Reply