Automatic Speech Recognition for Telephone Voice Dialling in Yoruba

T.S. Ibiyemi; A.G. Akintola

doi:10.17577/IJERTV1IS4039

Volume 01, Issue 04 (June 2012)

Automatic Speech Recognition for Telephone Voice Dialling in Yoruba

DOI : 10.17577/IJERTV1IS4039

Download Full-Text PDF Cite this Publication

Open Access
[post-views]
Total Downloads : 921
Authors : T.S. Ibiyemi , A.G. Akintola
Paper ID : IJERTV1IS4039
Volume & Issue : Volume 01, Issue 04 (June 2012)
Published (First Online): 30-06-2012
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Automatic Speech Recognition for Telephone Voice Dialling in Yoruba

T.S. Ibiyemi A.G. Akintola

Dept. Electrical & Electronics Engineering Dept. of Computer Science University of Ilorin, Ilorin, Nigeria University of Ilorin, Ilorin, Nigeria

Abstract

Human Computer Interaction is largely by electromechanical devices. These interaction media are grossly not users friendly and often risky and life threaten in some applications such as driving and phone dialling. This paper presents our work in telephone auto-dialing in yorÃ¹bÃ¡ language. The speech recognition algorithm used was coded in C language and run on a Pentium duo core 2.6 GHz 2 GB RAM PC with a gsm set and a multimedia headset attached to the PC. The experiments yielded 94% speaker recognition rate, and 82% phone sentence recognition rate.

Introduction

Human Computer Interaction, HCI, is largely by electromechanical devices such as keyboard, mouse, joystick, printer, and monitor. These interaction media are grossly not users friendly and often risky and life threaten in some applications such as driving and phone dialling. A natural and better human- machine interaction to eliminate this fatal risk of driving and phoning is voice dialling of phone. However, telephone voice dialling by anybody who has access to the phone is not good enough. Hence, there is a need to be able to authenticate authorised users of the phone. In order to extend this technology to the grass root, this speaker authentication prior to telephone voice auto-dialling is implemented in YorÃ¹bÃ¡ language.

YorÃ¹bÃ¡ language is one of the three dominant local languages spoken in Nigeria by about 22 million people. It is a tonal language, that is, the tone of pronunciation of a yorÃ¹bÃ¡ word determines the meaning of that word. This is unlike non-tonal languages where the spelling of a word suffices to infer its meaning. The problem is further compounded by the fact YorÃ¹bÃ¡ language is full of homographic words. Homographic words are words with the same spelling but having different meanings depending on their pronunciation tones. YorÃ¹bÃ¡ language has 25 letter alphabets (a , b, d, e, , f, g, gb, h, i , j , k, l, m, n, o, , p, r, s, , t, u, w, y ), 7 of

them are vowels (a,e, , i, o, , u ), the remaining 18 are consonants ( b, d, f, g, gb, h, j , k, l, m, n, p, r, s, , t,w, y ) . They are three tone levels, namely, low tone, mid tone, and high tone. The low and high tones are represented by grave accent symbol (`), and acute accent symbol ( Â´ ) respectively. Mid tone has no accent symbol. Accent symbol , where required, is only allowed on vowel in a yorÃ¹bÃ¡ syllable. The ten yorÃ¹bÃ¡ numerals are (ofo, en , eji, ta, rin, Ã rÃºn, Ã¨fÃ , eje, j, san) equivalent of the ten numerals (0,1,2,3,4,5,6,7,8,9) respectively. Mobile telephone numbers in Nigeria consist of 11 digits drawn from these numerals with the first digit always digit 0 for calls within the country.

Man-machine interface by speech is a new paradigm shift which is natural and most users friendly. This new paradigm shift is made possible by automatic speech recognition, ASR, system.
Automatic Speech Recognition System

Fig.1 shows the basic structure of an automatic speech recognition system [1, 2]. The air pressure variation caused by speech is transformed to electrical analogue signal by a microphone. The analogue signal output of the microphone is then fed into an analogue to digital converter for conversion into digital samples. These speech samples are pre- processed to put them in a suitable form for easy extraction of the discriminating characteristics inherent in the samples represented as feature vectors. There are two phases to an ASR system, namely, training phase, and query or operational phase. During training phase, the feature vectors that characterized each word or/and speaker are stored as that words or speakers reference template in a database.

After the training phase, the system is ready to be deployed for speech recognition. The word to be recognized passes through the microphone, preprocessing and feature extraction units as the training session. However, the extracted feature vectors, this time, are not stored but compared with each words feature vectors in the reference template

database . The output of the comparison is compared to a threshold for decision making on if the word is recognized or not. The different words used in the

Let si , i 1,2,, N bea framespeechsamples; N no.of samples

training phase constitute the vocabulary of the words

that can be recognized.

1 N

s

2

E i

N i 1

(1)

Frame Zero Crossing Rate:

z 1 N sgn(s j ) sgn(s j 1)

N j 2 2

where:

sgn(s( )) 1

1

, s( ) 0

, s( ) 0

(2)

Energy Upper Threshold:
Hardware Model

The hardware model of this ASR system consists of a PC with an in-built sound card, a microphone,

Tu where:

1 L E

i

L i 1

(3)

and an rs232 gsm set. A speaker makes an utterance

in yorÃ¹bÃ¡ which is captured and traduced from

Ei i

th frame energy

sound-wave to electrical analogue signal by the microphone. The output the microphone is fed into the in-built sound card on the PC for conversion into digital form. During training the digitised speech data are stored for offline processing. The software often used for driving the sound card allows some

L no. frames assumed as noise,10 inourcase

Energy Lower Threshold:

flexibility in configuration of bitrate, number of audio channels, number of bits per sample. But during the operational phase, speech data is captured

Tl 0.25 *Tu

(4)

online, that is no intermediate storage of speech data is required.

Zero Crossing Rate Threshold:

T 1 L z (5)
Software Model

The software model of the automatic speech recognition consists of series of algorithms for realising the speech recognition as next described

z

where:

zi

L i 1

i

i

th framezcr
[3,4,5,6].
1. Pre-processing
  
  The pre-processing implementation consists of the following steps:
  1. Voice Activity Detection, VAD
    
    The VAD, also known as words endpoint detection, removes the inter-word silence periods. The energy
    
    L no. framesassumedasnoise, 10inourcase
  2. Pre-Emphasis Filter
    
    The high pass FIR filter implemented has a transfer function of:
    
    and zero crossing rate, ZCR, of a word is used in conjunction with thresholds to segment the activity area of the word from the silent background. The
    
    H (z) 1
    
    where: a
    
    az 1
    
    0.95
    
    (6a)
    
    VAD algorithm is defined as: Frame Energy:
    
    The time response of this filter is:
    
    (n)
    
    s(n)
    
    s(n 1)
    
    (6b)
    
    magnitude output of filter bank on mel-frequency scale. In order to convert the obtained mel-scaled power spectrum to time domain, an the inverse discrete cosine transform, DCT, is taken. The output
  3. Frame Blocking
    
    The partitioning of a word samples into short blocks in order to make the signal stationary is based on eqn.7:
    
    of the inverse DCT are the mel frequency cepstral coefficients. The algorithm for obtaining the MFCCs is described in eqn(9) to eqn(13).
    
    Apply DFT to each of the windowed speech signal of
    
    xi, j
    
    y((K
    
    M ).i
    
    j) , j
    
    0,1,, K
    
    1; i
    
    0,1,, L 1;
    
    eqn(8):
    
    where:
    
    N no. of samplesper word
    
    (7)
    
    N 1
    
    Y n) w(k)x(k)e j 2 kn / N , n 0,1,
    
    , N 1
    
    (9)
    1. ( N M )
      
      K M
      
      no. of frames
      
      k 0
      
      Get the power spectrum of eqn(9):
      
      per word
    2. no. of overlappedsamples
    per frames
    
    Y (n)
    
    ( (Yreal
    
    (n)2 )
    
    (Yimag
    
    (n)2
    
    )2,
    
    K no. of samplesper frame
    
    Yreal
    
    (n)2
    
    Yimag
    
    (n)2
    
    (10)
  4. Windowing
  Frame blocking using rectangular window as defined in eqn (7) lead to ringing frequency response also known as Gibbs phenomenon as a result of sharp edges. Hence, Hamming window which is a raised cosine window is used to smooth the edges.
  
  n 0,1,, N 1
  
  Convert power spectrum of eqn(10) in linear frequency to power spectrum in mel-scale frequency:
  
  N 1
  
  Hamming window is defined by eqn (8):
  
  Pmel (m)
  
  where:
  
  Y (k).H (k, m) , m
  
  k 0
  
  0,1,, L
  
  y(n)
  
  x(n).w(n)
  
  (8)
  
  L no. of mel filters
  
  where:
  
  2 n
  
  0 , f (k ) fc (m 1)
  
  f (k ) f (m 1)
  
  w(n) 0.54 0.46 cos
  
  ,0 n N 1
  
  c , fc (m 1) f (k) fc (m)
  
  N 1 H (k, m)
  
  fc (m) fc (m 1)
  
  f (k ) f (m 1)
  
  (11a)
  
  w(n)
  
  Hanmmin g window
  
  c , fc (m) f (k ) fc (m 1)
  
  x(n)
  
  speech frame samples
  
  fc (m) fc (m 1)
  
  0 , f (k ) fc (m 1)
2. Feature Extraction
  
  The characterisation of speech signal data for the purpose of simultaneous recognition of what is said and who said it, that is, speech recognition and speaker verification respectively is most efficiently
  
  f (k)
  
  where:
  
  f min
  
  f
  
  k
  
  f max
  
  N
  
  1 . f
  
  f min
  
  1
  
  , k 1,2,, N
  
  (11b)
  
  handled by mel frequency cepstral coefficients, MFCC. This is because MFCC characterisation is very similar to that of the human aural perception. Hence, MFCCs are used as feature vectors in this
  
  f min
  
  f max
  
  Minimum speech frequency Maximumspeech frequency
  
  work for simultaneous speaker authentication and speech recognition [5,6,7,8,9].
  
  fc (m)
  
  where:
  
  fmin
  
  m 1 . f
  
  , m 1,2,, L
  
  The process of obtaining the MFCCs involves transformation of the windowed frame of speech data from time domain to frequency domain and then
  
  f (m)
  
  f
  
  2595log10
  
  fmax fmin
  
  f (m)
  
  700
  
  1 , m
  
  1,2,, L
  
  (11c)
  
  back to time domain after processing. Firstly, the L 1
  
  spectrum magnitude of the windowed speech signal data is obtained on a linear scale frequency by FFT. This output magnitude is converted power magnitude which is convolved with the frequency response
  
  fmin fmax f
  
  f
  
  M inim umm el frequency M axim umm el frequency
  
  speech frequency
  
  m el speech frequency
  
  Obtain the logarithm of the mel scale power spectrum:
  
  1 N
  
  c1 xi
  
  N i 1
  
  (13a)
  
  Pmel (m) log10 Pmel (m) , m 1,2,, L (12)
  
  -Set 0.01
  
  Convert logarithm mel frequency power spectrum to time domain cepstral coefficients (mfcc) using inverse DCT:
  
  -Set m = 1
  
  -Set n(h) 0 , h 1,2,, M
  
  -Calculate distortion, D:
  
  L
  
  Ci Pmel ( j) cos
  
  .i( j
  
  0.5) / M ,
  
  D 1 N k
  
  xi, j
  
  2
  
  c1, j
  
  j 1
  
  i 1,2,, M
  
  (13)
  
  N.k i 1 j 1
  
  where: M
  
  no. of coefficients,
  
  step1: Double Codebook Size by Splitting
  
  L no. of mel
  
  filters
  
  for j
  
  1 to m do
3. Vector Quantisation, VQ
  
  It is usual in speaker and speech recognition involving multiple utterances of words to generate very large number of feature vectors per word during the training phase. Hence, the total number of feature vectors can easily become unmanageable in term of storage as templates or in matching computation. These two problems can render speech recognition in embedded application unrealisable. Hence, it is
  
  (
  
  ci 1
  
  cm i 1
  
  )
  
  m 2m
  
  ci ci , i
  
  .ci
  
  .ci
  
  1,2,, m
  
  (13b)
  
  imperative to use vector quantisation as data compression method. [3,4,5] . The VQ problem definition is:
  
  VQP roblemD efinition:
  
  for i
  
  ( for j
  
  step2: Distribute feature vectors by clustering
  
  -Set D D
  
  1 to N do
  
  1 to M do
  
  G iven:T
  
  X1 , X 2 ,, X N
  
  featurevectors;
  
  ( j*
  
  arg min j xi
  
  2
  
  c j )
  
  & no.of desiredcodevectosr M
  
  s * xi ;
  
  Find C
  
  c1 , c2 ,, cM
  
  codevectosr
  
  n( j* )
  
  n( j* ) 1
  
  j
  
  & thecodevectosr' s partitionregions )
  
  (13c)
  
  P s1 , s2 ,, sM
  
  such that average
  
  distortionD
  
  ave
  
  is min im ised
  
  step3: Update Centroids/Codevectors
  
  This problem is solved, in our case, using LBG-VQ
  
  1 n j
  
  c j n
  
  s j (h)
  
  , j 1,2,, M
  
  (13d)
  
  algorithm of Fig. 2.
  
  _start_LBG_VQalgorithm
  
  {
  
  j h 1
  
  step4: Calculate new Distortion
  
  step0: Codebook Initialisation
  
  -Input N, M, Xi = { xi,1, xi,2,..,xi,k} , i=1,2…,N ; k=dimension of feature vector
  
  D 1 N M
  
  k
  
  xi, j
  
  2
  
  ch, j
  
  (13e)
  
  (* Ntotal no. feature vectors in training set; X training set feature vectors
  
  Mno. of codevectors/vocabulary
  
  N.M.k i 1 h 1 j 1
  
  words *)
  
  -Calculate 1-codevector codebook:
  
  if D D D
  
  Then goto step2 otherwisegoto step5
  
  appropriately populated during training. An adaptive threshold is determined for each of the codevector, that is each word utterance, during training. A simple Euclidean distance measure is used in matching a
  
  step5: Repeat until desired number of codewords
  
  if (m M ) Then goto step1 otherwisestep6
  
  step6: Output codebook
  
  Output c j , j 1,2,, M
  
  step7: stop
  
  }_end_LBG_VQalgorithm.
4. Recognition
  
  The recognition phase is implemented by simple Euclidean distance measure and empirically determined threshold as given in eqn(14):
  
  test utterance with the templates in the codebooks. The computed distance is compared with the stored thresholds in determining the speaker and recognition of the spoken sentence for auto-dialling of the gsm set. The system is coded in C language and run on a Pentium duo core 2.6 GHz with 2GB RAM on board. The PC has a gsm set and a multimedia headset attached to the PC. However, the final system will be an embedded front-end interfaced to a gsm set. The experiments yielded 94% speaker recognition rate, and 82% phone sentence recognition rate.
  
  di arg min
  
  2 2 , i
  
  1,2,, N
  
  R
  
  Q
  
  i
  
  if di T
  
  Re cognisedotherwise
  
  (14)
  
  Not Re cognised.
Experiment and Result

Speech contains more information than what is said but also includes information on the speaker, accent, gender, and age group. Hence, one single process suffices to authenticate the speaker and to recognise the word. Some experiments were conducted using 20 native YorÃ¹bÃ¡ speakers to pronounce each of the 13 words in the vocabulary 10 times. These 2,600 words were all used for training. For the recognition phase, the 20 speakers that participated in the training pronounced the telephone sentence (pÃ¨ fnÃº

+ 11 digit phone number of their choice) and the word (gbÃ© fnÃº) once in YorÃ¹bÃ¡. Each word is sampled at the rate of 8000 samples per second with each sampled quantised into 16 bits. Fig.2. shows speech waveforms of a Nigerian mobile phone number 08034265239 pronounced as a sentence in YorÃ¹bÃ¡. The pre-emphasis filter coefficient used is

0.97 at a framed window of 256 samples with 128 samples overlap. The MFCC feature extraction method is used having 20 mel filter bank and 16 dimensional feature vectors.

There are 20 codebooks, with one codebook for each speaker. Each codebook has 13 codebook-lets, each codebooklet represents each word of the vocabulary. A codebooklet contains 10 codevectrs representing the 10 utterances per word per speaker. The codebooks, codebooklets, and codevectors are
Conclusion

A users friendly human computer interaction based on speech recognition for telephone auto-dialling in YorÃ¹bÃ¡ was developed. The speech recognition algorithm used was coded in C language and run on a Pentium duo core 2.6 GHz 2 GB RAM PC with a gsm set and a multimedia headset attached to the PC. The experiments yielded 94% speaker recognition rate, and 82% phone sentence recognition rate. Though the system was developed on a PC, the target would be an embedded front-end unit interfaced to a gsm set.
Acknowledgement

We acknowledge with great appreciation the generous research and development grant received from Federal Government of Nigeria through the STEP-B project to execute this work.
References

Lipeika Antanas, Lipeikiene Joana, Telksnys Laimutis, Development of Isolated word Speech Recognition,

Informatica, vol.13, no.1, 2002, pp. 37-46
E-Hocine Bourouba, et al, Isolated Words Recognition System Based onHybrid Approach DTW/GHMM, Informatica 30, 2006, pp. 373-384
Allam Musa, MareText Independent Speaker Identification based on K-Means Algorithm, International Journal on Electrical engineering and Informatics, vol.3, no.1, 2011, pp100-108
Srinivasan A., Speaker Identification and Verification using Vector Quantisation and Mel Frequency Cepstral coefficients, Research Journal of Applied Sciences, Engineering and technology, vol.4, no.1, 2012, pp. 33-40
Satyahad Singh, and Rajan E.G., MFCC VQ based Speaker Recognition and its Accuracy Affecting Factors, International Journal of Computer Applications, vol. 21, no.6, 2011, pp.1-6
Kekre H.B., and Vaishali Kulkarni, Performance Comparison of speaker Recognition using Vector Quantization by LBG and KFCG, International Journal of applications, vol.3, no.10, 2010, pp.32-37
Rashidul Hasan, Mustafa Jamil, Golam Rabbani, Saifur Rahman, Speaker Identification using Mel Frequency Cepstral Coefficients, Proc. 3rd International Conference on Electrical and Computer engineering, ICECE 2004, 28- 30 December, Dhaka Bangladesh, 2004, pp. 565-568
Linde Y., Buzo A., Gray R.M., An Algorithm for Vector Quantiser Design, IEEE Trans on Communications, vol. COM-28, no. 1, 1980, pp. 84-95
Wael Al-Sawalmeh, Khaled Daqrouq, Omar Daoud, Abdel-Rahman Al-Qawasmi, Speaker Identification System based Mel Frequency and Wavlet Transform using Neural Network Classifier, European Journal of scientific Research, vol.41, no. 4, 2010, pp. 515-525
Srinivassan A., 2011, Speech Recognition using Hidden Markov Model, Applied Mathematical Science, vol.5, no. 79, 2011, pp. 3943-3948

Automatic Speech Recognition for Telephone Voice Dialling in Yoruba

Introduction

Leave a Reply