Signal Separation for Robust Speech Recognition Based on Phase Difference Information Obtained in the Frequency Domain

DOI : 10.17577/IJERTV1IS7369

Download Full-Text PDF Cite this Publication

Text Only Version

Signal Separation for Robust Speech Recognition Based on Phase Difference Information Obtained in the Frequency Domain

  1. Bhanu prakesh Reddy @ Dr.S.A.K.Jilani , Ph.d # S.Nanda kishor$

    @PG Student, Department of ECE, Madanapalli Institute of Technology and Science, MADANAPALLI

    # PROFESSOR, Department of ECE, Madanapalli Institute of Technology and Science, MADANAPALLI.

    $Associate Professor, S.S.N Engineering College, ONGOLE

    Abstract

    In this paper, we present a new two- microphone approach that improves speech recognition accuracy when speech is masked by other speech. The algorithm improves on previous systems that have been successful in separating signals based on differences in arrival time of signal components from two microphones. The present algorithm differs from these efforts in that the signal selection takes place in the frequency domain. We observe that additional smoothing of the phase estimates over time and frequency is needed to support adequate speech recognition performance. We demonstrate that the algorithm described in this paper provides better recognition accuracy than time-domain-based signal separation algorithms, and at less than 10 percent of the computation cost. Index Terms: Robust speech recognition, signal separation,

    time delay analysis, phase difference analysis Speech recognition systems have signicantly improved in the past decades but noise robustness and computational complexity remain critical issues. A number of algorithms have shown improvements for stationary noise (e.g. [1, 2]).

    Nevertheless, improvement in non-stationary noise remains a difcult issue (e.g. [3]). In these environments, auditory processing [4] and missing-feature-based approaches [5] are promising. An alternative approach is signal separation based on analysis of differences in arrival time (e.g. [6, 7, 8]). It is well documented that the human binaural system bears remarkable ability in speech separation (e.g. [8]). Many models have been developed that describe various binaural phenomena (e.g. [9, 10]), typically based on interaural time difference (ITD), interaural phase difference (IPD), interaural intensity difference (IID), or changes of interaural correlation.

    The Zero Crossing Amplitude Estimation (ZCAE) algorithm was recently introduced by Park [7] which is similar in some respects to work by Srinivasan et al. [6]. These algorithms (and similar ones by other researchers) typically analyze incoming speech in bandpass channels and attempt to identify the subset of time-frequency components for which the ITD is close to the nominal ITD of the desired sound source (which is presumed to be known a priori). The signal to be recognized is reconstructed from only the subset

    of good time-frequency components. This selection of good components is frequently treated in the computational auditory scene analysis (CASA) literature as a multiplication of all components by a binary mask that is nonzero for only the desired signal components. Although ZCAE provides impressive performance even at low SNRs, it is very computationally intensive, which makes it unsuitable for hand-held devices.

    The goals of this work are twofold. First, we would like to obtain improvements in word error rate (WER) for speech recognition systems that operate in real world environments that include noise and reverberation. We also would like to develop a computationally efcient algorithm than can run in real time in embedded systems. In the present ZCAE algorithm much of the computation is taken up in the bandpass ltering operations. We found that computational cost could be signicantly reduced by estimating the ITD through examination of the phase difference between the two sensors in the frequency domain. We describe in the sections below how the binary mask is obtained using frequency information. We also discuss the duration and shape of the analysis windows, which can contribute to further improvements in WER.

    The rest of the paper is organized as follows: Sec. 3 describes our algorithm at a general level. We propose our time-frequency weighting scheme in Sec. 3. Experimental results are discussed in Sec.4, and we summarize our work in Sec. 5.

    1. Phase-difference-based binary time-frequency mask estimation

      Our work on signal separation is motivated by binaural speech processing. Sound sources are localized and separated by the human binaural system primarily through the use of ITD information at low frequencies and IID information at higher frequencies, with the crossover point between these two mechanisms considered to be based on the physical distance between the two ears and the need to avoid spatial aliasing (which would occur when the ITD between two signals exceeds half a wave length). In our work we focus on the use of ITD cues and avoid spatial aliasing by placing the two microphones closer together than occurs anatomically. When multiple sound sources are presented, it is generally assumed that humans attend to the desired signal by attending only to information at the ITD corresponding to the desired sound source.

      Our processing approach, which we refer to as Phase Difference Channel Weighting (PDCW), crudely emulates human binaural processing, and is summarized in Fig. 1. Briey, the system rst performs a short-time Fourier transform (STFT) which decomposes the two input signals in time and in frequency. ITD is estimated indirectly by comparing the phase information from the two microphones at each frequency, and the time- frequency mask identifying the subset of ITDs that are close to the ITD of the target speaker is identied. A set of channels is developed by

      weighting this subset of time

      frequency components using a series of Gamma tone functions, and the time domain signal is obtained by the overlap-add method. As noted above, the principal novel feature in this paper is the use of intramural phase information in the frequency domain rather than ITD, IPD, or IID information in the time domain to obtain the binary mask.

      Consider the two signals that are input to the system which we refer to as xL [n] and xR [n]. We assume that the location of the desired target signal is known a priori and without loss of generality we assume its ITD to be equal to zero. For mathematical convenience, we refer to the number of interfering sources as L, with (l) being their respective ITDs. Note that both L and (l) are unknown. With the above formulations, the signals are the microphones are

      with x0 [n] representing the target signal, xl (l = 0) representing interfering signals, xL and xR ,

      respectively, representing the signals at the left and right microphones. The corresponding short- time Fourier transforms can be represented as

      where w[n] is a nite-duration Hamming window, k indicates one of N frequency bins, with positive frequency samples corresponding to wk = 2k/N for 0 wk N/2 1. In our work N equals 512 for 26.5-ms windows and 2048 for 75-ms windows. Note that even though (1) indicates that signals at the microphones are identical except for a time delay, it is more appropriate that we consider the time delays associated with each frequency component of the signal. Correspondingly, we replace the frequency- independent ITD parameter in (1) by the frequency-dependent ITD parameter d(k, m) in (4). Next, we assume that a specic time- frequency bin (k0 , m0 ), is dominated by a single sound source l. This leads to

      where the source l dominates the time-frequency bin (k0 , m0 ). This leads to a simple binary

      decision concerning whether the time-frequency bin(k0 , m0 ) belongs to the target speaker or not. The frequency-dependent ITD d(k, m) for a particular time-frequency bin (k0 , m0 )

      In other words, only time-frequency bins for which |d(k0 , m0 )| < are presumed to belong to the target speaker. We are presently using a value of 0.01 for the oor constant . The mask µ(k, m) in (8) is applied to X(k, m), the averaged signal spectrogram from the two channels, and speech is reconstructed from the X(k, m) where

    2. Smoothed phase-difference-based binary mask estimation

      While the basic procedure described in Sec. 2 provides signals that are audibly separated, the phase estimates are generally too noisy to provide useful speech recognition accuracy. In this section we discuss the implementation of two methods that smooth the estimates over frequency and time.

        1. Gammatone channel weighting

          As noted above, the estimates produced by Eq. (8) are generally noisy and must be smoothed. To achieve smoothing along frequency, we use a gammatone weighting that functions in a similar fashion to that of the familiar triangular weighting in MFCC features. Specically, we obtain the gammatone channel weighting coefcients w(i, m) according to the equation

          where µ(k, m) is the original binary mask that is obtained using (8). With this weghting we effectively map the ITD for each of the 256 original frequencies to an ITD for what we refer to as one of I = 40 channels. Each of these channels is associated with Hi , the frequency response of one of a set of gammatone lters with center frequencies distributed according to the

          Figure 2: Sample spectrograms illustrating the effects of PDCW processing. (a) original clean speech, (b) noise-corrupted speech, (c) reconstructed (enhanced) speech (d) the time- frequency mask obtained with (8) (e) gammatone channel weighting obtained from the time- frequency mask in (11) (e) nal frequency weighting shown in (12) (f) enhanced speech spectrogram using the entire PDCW algorithm Equivalent Rectangular Bandwidth (ERB) scale [11]. The nal spectrum weighting is obtained using the gammatone mask µg

        2. The effect of the window length

      In conventional speech coding and speech recognition systems, we generally use a length of approximately 20 to 30 ms for the Hamming window w[n] in order to capture effectively the temporal uctuations of speech signals. Nevertheless, longer observation durations are usually better for estimating environmental parameters. Using the procedures described below

      in Sec. 4,we considered the effect of window length on recognition accuracy. These results obtained with PDCW described Subsection 3 and

      3.1 are summarized in Fig. 3, which indicate that best performance is achieved with window length of about 75 ms.In the experiments described below we Hamming windows of duration 75 ms with 37.5 ms between successive frames.

      Figure 3: The dependence of word recognition accuracy (100% W ER) on the window length, using an SIR of 10dB and various reverberation times. The lled symbols at 0 ms represent baseline results obtained with a single microphone.

    3. Experimental Results

      In this section, we present experimental results for two different environmental conditions. In the rst condition, we simulate different reverberant environments, where the target is masked by an interfering speaker. We used the Room Impulse Response (RIR) software [12] for simulating the effects of room reverberation. We

      assumed a room of dimensions 5×4×3 m, a distance between the microphone and the speaker of 2 m, with the microphone located at the center of the room. We assumed that the target source is located along the perpendicular bisector of the line between two microphones, and that the masker is

      45 degrees to one side. The target and noise signals are digitally added after simulating the reverberation effects. The two microphones are placed 4 cm apart from one another. We used sphinx fe included in Sphinxbase 0.4.1 for speech feature extraction, SphinxTrain 1.0 for speech recognition training, and Sphinx3.8 for decoding, all of which are readily available in Open Source form. We used subsets of 1600 utterances and 600 utterances, respectively, from the DARPA Resource Management (RM1) database for training and testing.

      Fig. 4 compares word recognition accuracy for several of the algorithms discussed in the paper. ZCAE refers to the timedomain algorithm described in [7] with binary masking, as the better- performing continuous-masking does not work in environments with reverberation or more than one masking source.PD refers to the algorithm described in Secs. 2 and 3 of this paper with the 75-ms analysis window but without the gammatone frequency weighting, and PDCW refers to the complete algorithm including the gammatone channel weighting (CW) described in Sec. 3.1 with the 75-ms analysis window. To see the effects of the window length, we also present the PD results with the conventional 25-ms

      analysis window as well. As can be seen, the PDCW (and to a lesser extent the PD) algorithm provides lower WER than ZCAE, and the superiority of PDCW over ZCAE increases as the amount of reverberation increases.

      In our second set of experiments, we still assume that the distance between the two microphones is the same, but we added noise recorded in real environments with real two microphone hardware in locations such as a public market, a food court, a city street and a bus stop with background speech.

      Fig. 4(d) illustrates these experimental results. Again we observe that PDCW (and to a lesser extent PD) provides much better performance than ZCAE for all conditions.

      We also proled the run times of implementations in C of the PDCW and ZCAE algorithms on two machines. The PDCW algorithms ran in only 9.03% of the time required to run the ZCAE algorithm on an 8-CPU Xeon E5450 3-GHz system, and in only 9.68% of the time to run the ZCAE.

      Figure 4: Speech recognition accuracy using different algorithms (a) in the presence of an interfering speech source as a function of SNR in the absence of reverberation, (b,c) in the presence of reverberation and speech interference, as indicated, and (d) in the presence of natural real- world noise.algorithm on an embedded system with an ARM11 667-Mhz processor using a vector oating point unit. The major reason for the speedup is that in ZCAE the signal must be passed through a bank of 40 lters while PDCW requires only two FFTs and one IFFT for each feature frame. A MATLAB version of PDCW with sample audio les is available at http://www.cs.cmu.edu/robust/archive/algorithms

      /PDCW IS2009.The code in this directory was used to obtain the results described in this paper.

    4. Conclusions

      In this work, we present a speech separation algorithm, PDCW,based on ITD that is inferred from phase information. The algorithm uses gammatone weighting and longer analysis windows.This algorithm is quite computationally efcient and shows signicant improvement in

      recognition

      accuracy

      under

      practical

      environmental

      conditions

      of

      noise and

      reverberation.

    5. Acknowledgements

      This study was supported by NSF Grant IIS- 0420866. The authors are thankful to David- Huggins Daines, Hyung-Joon Lim,and Umpei Kurokawa for helpful discussions and for

      providing the noise data.

    6. References

1] R. Singh, R. M. Stern, and B. Raj, Signal and feature compensation methods for robust speech recognition, in Noise Reduction in Speech Applications, G. M. Davis, Ed. CRC Press, 2002, pp. 219244.

2] R. Singh, B. Raj, and R. M. Stern, Model compensation and matched condition methods for robust speech recognition, in Noise Reduction in Speech Applications, G. M.Davis,Ed. CRC Press, 2002, pp. 245275.

3] B. Raj, V. N. Parikh, and R. M. Stern, The effects of back-ground music on speech recognition accuracy, in Proc. IEEE Int. Conf. Acoust., Speech and Signal Processing, vol. 2, Apr. 1997, pp. 851854.

4] C. Kim, Y.-H. Chiu, and R. M. Stern, Physiologically motivated synchrony-based processing for robust automatic speech recognition, in INTERSPEECH-2006, Sept. 2006, pp. 19751978.

  1. B. Raj and R. M. Stern, Missing-Feature Methods for Robust Automatic Speech Recognition, Speech Communication, vol. 22, no. 5, pp. 101116, Sept. 2005.

  2. S. Srinivasan, M. Roman, and D. Wang, Binary and ratio time-frequency masks for robust speech recognition,Speech Comm., vol. 48, pp. 14861501, 2006.

  3. H. Park, and R. M. Stern, Spatial separation of speech signals using amplitude estimation based on interaural comparisons of zero crossings, Speech Communication, vol. 51, no. 1, pp. 1525, Jan. 2009.

  4. R. M. Stern, E. Gouvea, C. Kim, K. Kumar, and H .Park, Binaural and multiple-microphone signal processing motivated by auditory perception, in Hands-Free Speech Communication and Microphone Arrys, 2008, May. 2008.

  5. R. M. Stern and C. Trahiotis, Models of binaural interaction, in Hearing, B. C. J. Moore, Ed. Academic Press, 2002, pp. 347386.

  6. H. S. Colburn and A. Kulkarni, Models of sound localization, in Sound Source Localization,

    A. N. Popper and R.

  7. B. .C. J. Moore and B. R. Glasberg, A revision of Zwickers loudness model, Acustica – Acta Acustica,vol. 82.

  8. S. G. McGovern, A model for room acoustics, http://2pi.us/rir.html.

Leave a Reply