A Novel Approach for Speech Recognition

T Lakshmi Narayana; Allu Venkatesh; B Rupendra Reddy; Gollapudi Baji Babu; Uppu Madhu Sai Lohith

doi:10.17577/IJERTCONV4IS18020

NCACSPV - 2016 (Volume 4 - Issue 18)

A Novel Approach for Speech Recognition

DOI : 10.17577/IJERTCONV4IS18020

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 64
Total Downloads : 4
Authors : T Lakshmi Narayana, Allu Venkatesh, B Rupendra Reddy, Gollapudi Baji Babu, Uppu Madhu Sai Lohith
Paper ID : IJERTCONV4IS18020
Volume & Issue : NCACSPV – 2016 (Volume 4 – Issue 18)
Published (First Online): 24-04-2018
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

A Novel Approach for Speech Recognition

(Ph.D)

T Lakshmi Narayana1 ,

IEEE Member, Assistant Professor,

Dept.of ECE,ALIET,Vijayawada.

Allu Venkatesp, B.Tech.Student, Dept. of ECE, ALIET,Vijayawada.

B Rupendra Reddy3, B.Tech.Student, Dept.of ECE, ALIET,Vijayawada.

Gollapudi Baji Babu4,

B.Tech.Student,

Dept.of ECE,ALIET,Vijayawada.

Uppu Madhu Sai Lohitp,

B.Tech.Student,

Dept. of ECE,ALIET,Vijayawada.

Abstract This paper describes an approach of software and hardware control using a robust speech recognition system. Speech Recognition is the process of automatically recognizing the spoken words of a person with unique characters. The Software and hardware in this system can be controlled by using isolated speech recognition. The major steps in speech recognition system design are Feature detection, Feature extraction and Feature matching. For Feature detection and extraction, the algorithm used was Mel Frequency Cepstral coefficients (MFCC) and Feature matching was done by using the algorithm called Dynamic Time Wrapping (DTW). The external hardware can be controlled by interfacing with speech recognition system through a programmable logic device like Arduino board. The main difficulty in any recognition system design is large database size, here in this proposed system the database size is reduced by detection of region of interest, other than processing the entire speech signal.

Keywords MFCC, DTW, PLD, Region of Interest(ROI).

INTRODUCTION :

SPEECH RECOGNITION SYSTEM:

Speech recognition refers to the study of speech signals and its processing methods. Speech processing usually processed in digital representation. User gives a predefined voice instruction to the system and the system understand this command and execute the required function.

Most the speech recognition systems are classified as Isolated and Continuous systems. Isolated speech having only one word or utterance where continuous speech allows user to speak naturally. Continuous speech means a continuous utterance without any pause between the utterances.

Speech Recognition systems can be further classified as Speaker dependent or Speaker independent systems. Speaker dependent systems recognizes speech from only one particular person, where as Speaker independent systems recognizes speech from any person. Speech Recognition is done by training the data base with fixed number of isolated words as a different classes and each class will be trained with unique word with different utterances [1].

Speech Database

Matching

Test Speech Database

Speech Database

Matching

Test Speech Database

Matched Or Unmatched

Fig 1: Basic Speech Recognition System
IMPORTANCE:

The main purpose of the Speech recognition system is to make the digital systems human friendly.
ELEMENTS TO DESIGN SPEECH RECOGNITION SYSTEMS:
1. Feature extraction:
  
  Feature extraction was done by train the data base with the limited number of isolated words by using different algorithms and find their features.
2. Feature matching:
Feature matching was done by finding similarities between the trained data base features with the features of test speech by using specified algorithm.
FEATURE EXTRACTION ALGORITHMS:
1. Linear predictive coding coefficients:
  
  LPCC analysis is an effective method to estimate the main parameters of speech signals. The conclusion extracted was that an all-pole filter, H(z), is a good approximation to estimate the speech signals. Its transfer function was described. In this way, from the filter parameters the speech samples could be synthesized by a difference equation. Thus, the speech signals resulting can be seen as linear combination of the previous p samples. Therefore, the speech production model can be often called linear prediction model, or the autoregressive model [2].
  
  Pre-emphasis
  
  Pre-emphasis
  
  speech
  
  Speech
  
  Pre-emphasis
  
  Pre-emphasis
  
  Framing
  
  Framing
  
  Winnowing
  
  Winnowing
  
  Discrete Fourier Transform
  
  Fast Fourier Transform
  
  Fast Fourier Transform
  
  Autocorrelation Durbin Recursion
  
  Cepstral Recursion
  
  Mel scale
  
  Logarithmic scale
  
  Logarithmic scale
  
  Discrete Cosine Transform
  
  Discrete Cosine Transform
  
  LPCC
  
  Fig 2: LPC coefficients extraction process
2. Mel Frequency Cepstrum Coefficients:
  
  MFCC is a beneficial approach for speech recognition. Figure 3 illustrates the complete process to extract the MFFC vectors from the speech signal. It is to be emphasized that the process of MFCC extraction is applied over each frame of speech signal independent.
  
  The difference between the cepstrum and the Mel frequency cepstrum is that in the MFC, the frequency bands are positioned logarithmically (on the Mel scale) which approximates the human auditory system's response more closely than the linearly-spaced frequency bands obtained directly from the FFT or DCT [3].
  - Smooth the magnitude spectrum such that the pitch of a speech signals is generally not presented in MFCCs.
  - Reduce the size of the features involved.
MFCC

Fig 3: MFC coefficients extraction process

FEATURE MATCHING:

Linear Time Warping (LTW) :

Linear Time Warping is the method of calculating Euclidean distance. Euclidean distance or Euclidean metric means ordinary i.e. straight-line distance between two points in Euclidean space. The process of finding Euclidean distance can be shown as

Distance(D)= (| x-y|)

x= [1,1,4,2,5] & y= [2,1,3,2,2]
x= 1 1 4 2 5

y= 1 2 3 2 2

d= 0 1 1 0 3

Here x can be taken as test samples and y can be taken as data samples. Data samples are more than the test samples hence it is difficult to find the distance in this technique. For accurate distance measurement we are approaching DTW technique.
Dynamic Time Warping:

Dynamic time warping is an algorithm for measuring similarity between two temporal sequences which may vary in time or speed. It is an optimal match optimal match between two sequences with certain restrictions. For example, similarities in walking patterns can be measured by using DTW.

Distance(D)= (| x-y|)

6	5	5	2	4	1
4	3	3	0	2	1
2	1	1	2	0	3
2	1	1	2	0	3
3	2	2	1	1	2
2	1	0	3	1	4
1	0	1	2	0	3
	1	1	4	2	5

6	5	5	2	4	1
4	3	3	0	2	1
2	1	1	2	0	3
2	1	1	2	0	3
3	2	2	1	1	2
2	1	0	3	1	4
1	0	1	2	0	3
	1	1	4	2	5

x = (1,1,4,2,5) and y= (1,2,3,2,2,4,6)

Fig 4: Tabular representation of Distance between x & y

In the above figure we can find the shortest distance between two samples of different length which cannot be calculated by the Euclidean distance.

CONTROL DESIGN

Here after extracting the features of trained data base of different classes of speech signal matching of features should be done with the features of test speech to find the distance. After finding distance of shortest path software and hardware devices such as computer, mobile and PLDs such as Arduino, FPGA, DSP boards etc.

value above the threshold can be treated as voiced portion of the signal.

c) Feature extraction.

Feature extraction can be extracted by using MFCC algorithm, this can be implemented by the following steps

PRE EMPHASIS:

The digitalized speech has a dynamic range and suffering from additive noise. In order reduce this range and spectrally flatten the speech signal, pre emphasis is applied.

FRAME BLOCKING:

Audio signals are continuously changing so the speech signal can be split into small frames such that each frame can be analyzed in short time instead of analyzing the entire signal at once. The frame size is of 0-20 ms. Overlapping is applied on each frame because windowing applied on each frame.

WINDOWING:

Windowing is used to avoid discontinuities in the speech signal and distortion in underlying spectrum [4]. In speech recognition the most commonly used window shape is the hamming window. The choice window is the trade of between different factors [5].

W(n)=0.54-0.46cos(2n/N-1)

0 n N-1

FAST FOURIER TRANSFORM:

FFT is used to convert each frame of N samples from the time domain into frequency domain. FFT is a fast algorithm to implement the Discrete Fourier Transform(DFT) [6].
IMPLEMENTATION OF SPEECH SYSTEM:
1. Block diagram:
  
  1
  
  () = ()
  
  =0
  
  2
  
  , = 0,1,2 1
  
  Speech
  
  Speech Recogni tion system
  
  PLD
  
  (Arduino)
  
  Hard ware
  
  MEL-FREQUENCY SCALE:
  
  Mel frequency scale relates to the perceived frequency than the normal frequency. Human ear perception of frequency contents of sounds doesnt follow a linear scale. Therefore, for each tone with an actual frequency f, measured in Hz, a particular pitch is measured on a scale is called Mel
  
  Fig 5: Speech Recognition System
  1. Design Aspects:
    1. Type of Input Speech.
      
      CONTINUOUS SPEECH.
      
      Continuous speech allows user to speak naturally. Continuous speech means a continuous utterance without any pause between the utterances.
      
      ISOLATED SPEECH.
      
      Isolated speech means having only one utterance
    2. Separation of voiced and unvoiced portions.
By assigning a threshold value to the input isolated speech we can separate the voiced and unvoiced portions. After assigning the envelop value the voiced and unvoiced portions can be separated by finding the absolute value or magnitude value of the input isolated speech. The value less than the threshold can be treated as unvoiced portion and the signal

scale. Mel scale is linear spacing below 1000 Hz and logarithmic spacing above 1000 Hz. To compute mels for given frequency f in Hz, can be calculated as [7]
Mel(f)=S(k)=2595*log10(1+f/700)

By using Mel frequency scales we can smoothen the magnitude spectrum and reduces the size of features involved

LOGARITHM:

Log compress the featured values to match more closely to human audible range. Log can improve the robustness than Mel scale by allows us to use cepstral mean subtraction.

CEPSTRUM:

This is the final step in the feature extraction, in this we can convert the log Mel spectrum back into time domain. This result is called as Mel Frequency Cepstral Coefficients (MFCC). The cepstral representation of the speech spectrum provides a good representation of the local spectral properties

of the signal for the given frame analysis. Because the Mel spectrum coefficients (and so their logarithm) are real numbers, we can convert them to the time domain using the Discrete Cosine Transform (DCT) [7]
Recognition system by using interfacing devices like RS232 etc. and they can be controlled by using a programmer by dumping the program into the hardware devices to operate required functions in the hardware by the isolated words.

() = (log ())cos[( 1) ] , = 1,2 .

=1
1. Feature matching.
  
  2
RESULT ANALYSIS:

Feature matching is done by using Dynamic Time Warping algorithm. It is used to find the shortest path between the MFCC coefficients of different classes of speech signal with the MFCC coefficients of test speech.

Unlike Linear Time Warping (LTW) which compares two- time series based on linear mapping of the two temporal dimensions, Dynamic Time Warping (DTW) allows a nonlinear warping alignment of one signal to another by minimizing the distance between the two as shown in Fig [8].

Fig 6: comparison between LTW and DTW

To find the distance between two samples of A= {x1, x2.} and B= {y1, y2.} can be give as.

Distance (D)= (| x1-y1|, |x2-y2|) as shown in the fig.

Fig 7: Graphical representation to calculate shortest path using DTW
1. Control design
By using the feature matching technique, shortest distance can be calculated by using DTW algorithm, then we can assign some software control commands by using isolated speech to the software devices to control them and the hardware devices were interfaced with the Speech

The results which are obtained after the design of software controlled Speech recognition system can be shown below as

Plot 1: Isolated speech

Plot 2: separation of voiced & unvoiced portions

Plot 3: MFCC extraction plots

A sample Program to controlling hardware led using Arduino Uno board

{

delete(instrfind({'Port'},{'COM10'})); a = arduino('com10',uno);

for i = 1:10

a.pinMode(13,'OUTPUT'); a.digitalWrite(13,1);

pause(0.5);
1. B. Gold and N. Morgan, Speech and Audio Signal Processing, John Wiley and Sons, New York, NY, 2000.
2. C. Becchetti and Lucio Prina Ricotti, Speech Recognition, John Wiley and Sons, England, 1999
3. E. Karpov, Real Time Speaker Identification, Master`s thesis, Department of Computer Science, University of Joensuu, 2003.
4. MFCC and its applications in speaker recognition Vibha Tiwari, Deptt. of Electronics Engg., Gyan Ganga Institute of Technology and Management, Bhopal, (MP) INDIA (Received 5 Nov., 2009, Accepted 10 Feb., 2010).
5. J. Deller, J. Proakis, and J. Hansen, Discrete Time Processing of Speech Signals, Prentice Hall, NJ, USA, 1993.
end

a.pinMode(13,'OUTPUT'); a.digitalWrite(13,0);

pause (0.5);

Venkatesh Allu was born in A.P, India. He is pursuing the B. Tech degree in Electronics &

% end communication with Arduino

clear a

communications Engineering from Jawaharlal Nehru Technological University, Kakinada.

}
CONCLUSION:

The main aim of this project as to recognize isolated speech using MFCC and DTW techniques. The feature extraction was done by MFCC and the feature matching was done with the help of DTW technique. And by using the speech recognition system output the hardware devices are controlled by using Arduino board type interface. The main advancement in this paper is the speech recognition system is able to extract the voice and un-voice portion from the input speech signal, due to this the processing data size is reduced and the processing time is also reduced.
ACKNOWLEDGMENT

We thank director Fr. Dr. A. Francis Xavier SJ, Principal Dr. O. Mahesh, Head of the Department Mr. M. Rama Krishna, and our project guide Mr. Thalluri Lakshmi Narayana of Electronics and Communication Engineering in Andhra Loyola Institute of Engineering and Technology for their esteemed support and guidance in successful completion of our project.

Rupendra Reddy Bammu was born in A.P, India. He is pursuing the B. Tech degree in Electronics & communications Engineering from Jawaharlal Nehru Technological University, Kakinada.

Baji Babu Gollapudi was born in A.P, India. He received the Diploma in Electronics & communications Engineering from State Board of Technical Education Training. He is pursuing the B. Tech degree in Electronics & communications Engineering from Jawaharlal Nehru Technological University, Kakinada.
REFERENCES

Chadawan Ittichaichareon, Siwat Suksri and Thaweesak Yingthawornsuk Speech Recognition using MFCC International Conference on Computer Graphics, Simulation and Modeling (ICGSM'2012) July 28-29, 2012 Pattaya (Thailand)
Gray Jr., A.H. & Markel, J. D. (1976), Distance Measures for Speech Processing, IEEE Transactions on Acoustics, Speech and Signal Processing, issue 5, pp. 380-391, Oct 1976.
Brookes, M., Voicebox: Speech Processing Toolbox for Matlab [on line], Imperial College, London, available on the World Wide Web: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebo x/voicebox.

Madhu Sai Lohith Uppu was born in A.P, India. He is pursuing the B. Tech degree in Electronics & communications Engineering from Jawaharlal Nehru Technological University, Kakinada.

A Novel Approach for Speech Recognition

Leave a Reply