





































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
It is about speech recognition
Typology: Thesis
1 / 45
This page cannot be seen from the preview
Don't miss anything!






































A THESIS SUBMITTED IN PARTIAL FULFILLMENT FOR THE REQUIREMENTS OF BACHELORS IN TECHNOLOGY BY
&
UNDER THE GUIDANCE OF
This is tocertify that the TMS320C6713” submitted by and MR.AKASH SAHOO (108EI010) of Bachelors in Technology COMMUNICATION and National Institute ofTechnology them under mysupervision.
DATE:
that the thesis titled , ” Real Time SpeakerRecognition on submitted by MR.ABHIJIT TRIPATHY (108EC013) AKASH SAHOO (108EI010) in partial fulfillments of the requirements Bachelors in Technology degree in ELECTRONICS AND and ELECTRONICS AND INSTRUMENTATION Technology,Rourkela is an authentic work carried out by supervision.
PROF. Ayas Kanta Swain
Department of Electronics and Communication
Real Time SpeakerRecognition on MR.ABHIJIT TRIPATHY (108EC013) requirements ELECTRONICS AND TRUMENTATION , Rourkela is an authentic work carried out by
Ayas Kanta Swain
and Communication
We would like to take this wonderful opportunity of doing BTech Project on “Real Time Speaker Recognition on TMS320C6713” which has given us a lot of learning and practical exposure. First of all,we would like to express our gratitude to our guide, prof.T.K. Dan who gave us the opportunity to work on this project.He gave us independence indoing whateverwe liked .He really inspired us throughout the project.Secondly, we would like to thank Prof A.K. Swain who has been the guiding force in our project.His constant motivation has been of great help to us. We would also like to thank Prof. U.K. Sahoo who has helped us with signalprocessingconcepts on which we needed clarification. Lastly we would like to thank Mr.Sumit Kumar Sao , Mr. Kanhu Charan Bhuyan,Mr .Venkat ,Mr Karupannan who have provided us with all the things that we needed in Embedded Systems Lab.
ABHIJIT TRIPATHY AKASHSAHOO
108EC013 108EI
1.1 Introduction
Speaker recognition is the process of automatically recognizing who is speaking on the basis of the information contained in the speech signal. Speaker recognition is carried out in two phases.First phase comprises the training phase whereas, the second phase comprises thetesting phase. In the training phase the speaker has to utter something so as to provide speech samples.These samples enable the system to build a model of that speaker.In the testing phase ,the input speech is matched with the stored reference models and a decision is made. Speaker recognition can be said as a biometric system which validates a person’s claim to an identity based on the features extracted from the speech samples. The training and testing phase of speaker recognition system makes it possible to use the speaker’s voice to verify his/her identity and control access to services such as voice dialing,telephone banking,telephone shopping,voice mail and other computer access services that are voice activated.
1.2 Principles of Speaker Recognition
Speaker recognition can be classified into two categories: identification and verification. Speaker identification is the process of determining which registered speaker provides the utterance.
speech^ Input extraction^ Feature
Referencemodel (Speaker #1)
Similarity
Referencemodel (Speaker #N)
Similarity
Maximumselection Identificationresult (Speaker ID)
Figure 1. depicting SI
Speaker verification on the other hand is a process of determining whethe the claim of the speaker is true or not.
Figure 2 depicting SV
However we find out that speaker recognition is a difficult task. Speaker recognition assumes that human voice exhibits characteristics that are unique to the speaker. This principle is nature of speech signal. Speech signals in training and testing can vary greatly due to time dependent changing of voice , health conditions ,acoustic noise variations in the environment etc..
1.3 Speech’s Feature
The feature extraction process converts the speech waveform into a parametric representation for analysis.Feature extraction is also named as signal processing front end. The speech signal is a slowly time varying signal.Speech characteristics are fairly stationary over short period’s of time i.e. between 5 to 100 ms.However over longer periods of time i.e. over 200ms ,the characteristics vary due to the different speech spoken. Therefore STFT(short time fourier transform) is used. A wide range of possibilities exist representing the speech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel
Speaker verification on the other hand is a process of determining whethe the claim of the speaker is true or not.
Figure 2 depicting SV
However we find out that speaker recognition is a difficult task. Speaker recognition assumes that human voice exhibits characteristics that are unique to the speaker. This principle is challenged by the highly variant nature of speech signal. Speech signals in training and testing can vary greatly due to time dependent changing of voice , health conditions ,acoustic iations in the environment etc..
Speech’s Feature Extraction The feature extraction process converts the speech waveform into a parametric representation for analysis.Feature extraction is also named as signal processing front end. The speech signal is a slowly time varying signal.Speech are fairly stationary over short period’s of time i.e. between 5 to 100 ms.However over longer periods of time i.e. over 200ms ,the characteristics vary due to the different speech spoken. Therefore STFT(short time fourier transform) is used. of possibilities exist representing the speech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel
Speaker verification on the other hand is a process of determining whether
However we find out that speaker recognition is a difficult task. Speaker recognition assumes that human voice exhibits characteristics that challenged by the highly variant nature of speech signal. Speech signals in training and testing can vary greatly due to time dependent changing of voice , health conditions ,acoustic
The feature extraction process converts the speech waveform into a parametric representation for analysis.Feature extraction is also named as
The speech signal is a slowly time varying signal.Speech are fairly stationary over short period’s of time i.e. between 5 to 100 ms.However over longer periods of time i.e. over 200ms ,the characteristics vary due to the different speech spoken. Therefore
of possibilities exist representing the speech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-
Figure 3 depicting truncation
1.4.2 Frame Blocking In frame blocking step,we block the input ,continuous speech sample into a finite number of frames.Consider we block the speech into frames with M data samples with adjacent frames separated by N sampled datas.Here,the first frame consists of N data elements and the second frame begins with the Nth data sample of the first frame .The second frame overlaps the first frame by N-M samples and the third frame overlaps the second frame by N-M samples and so on.
After frame blocking we go for windowing. In windowing we multiply each frame by the window function so as to minimize the signal discontinuities at the beginning and end of each frame. The idea is to set the signal value to zero in the beginning and end of each frame.Window function w(n) is defined for 0 ≤ ᡦ ≤ ᡀ − 1,where N denotes the number of elements in each frame. We used a Hamming window for minimizing the interference from side
lobes.The Hamming window has the formula ᡵ䙦ᡦ䙧 = 0.54 − cos 䙦⡰どぁ〕⡹⡩䙧
,for 0 ≤ ᡦ ≤ ᡀ − 1.
FIGURE 4 depicting a Hamming Window
Then we compute the fast Fourier transform which maps each frame onto the frequency domain.The FFT actually reduces the number of calculations required to compute the DFT of a given set of samples.For a set of N samples the STFT is defined as ᡐ䙦ᠷ䙧 = ∑ 〕⡹⡩⡨ ᡶ䙦ᡦ䙧ᡗ⡹〷⡰ゕ〸ぁ/〕,where 0 ≤K≤N- Here the X(K) values are complex numbers and we consider only their absolute values.The sequence of X(K) defined as{ᡐ〸} is related to frequency as per the following formula. Positive frequencies 0≤F≤Fs/2 correspond to 0 ≤n≤N/2- Negative frequencies -Fs/2≤F<0 correspond to N/2≤n≤N- Where Fs is the sampling frequency. This FFT calculation is also called as spectrum or periodogram.
.
1.5.3 MFCC Cepstral Coefficients
Figure 6 – MFCC Steps
The above figure shows the block for computing the MFCC (mel frequency cepstrum coefficients).If convert the log mel spectrum back to time the result we get are called MFCCs.Thecepstral representation of speech gives a good
(^00 1000 2000 3000 4000 5000 6000 )
1
2
Mel-spaced filterbank
Frequency (Hz)
Figure5showing a mel filter bank for 20 mel coefficients
representation of the local spectral properties of the signal for a given frame analysis. The mel spectrum coefficients being real quantities are convertible to time domain using discrete cosine transform. If the absolute of the fft coefficients are given as Sk where 0 ≤k≤K-1 then the mel cepstrum coefficients are given as Cn ,where
Cn=∑ 〒〸⢀⡩ log(Sk)cos䙶ᡦ䙦ᡣ − ⡩⡰䙧 ゕ〸䙷, ᡦ = 0,1, … ᠷ − 1
We exclude the first value since it represents the mean of the signal and is of little importance.
Figure 7 showing acoustic vectors corresponding to 5th^ and 6th^ filters of mel filter bank.
The figure below shows a conceptual diagram depicting the recognition process. In the figure two speakers and two dimensions of the acoustic space are shown. The circles refer to the acoustic vectors from the speaker 1 whereas the triangles are from speaker 2.In the training phase , using the clustering algorithm described above , vector quantization codebook specific to a speaker is generated. The resulting codewords ,centroids are shown by black circles for speaker 1 and black tri corresponding to the smallest total distortion is known as the utterance’s original speaker.
Figure 7 The algorithm used to cluster the training is LindeBuzo Gray (LBG) algorithm. Let’s illustrate the algorithm to cluster a set of L training and M codebook vectors.
The figure below shows a conceptual diagram depicting the recognition process. In the figure two speakers and two dimensions of the acoustic space are shown. The circles refer to the acoustic vectors from the speaker 1 es are from speaker 2.In the training phase , using the clustering algorithm described above , vector quantization codebook specific to a speaker is generated. The resulting codewords ,centroids are shown by black circles for speaker 1 and black triangles for speaker 2. The speaker corresponding to the smallest total distortion is known as the utterance’s
Figure 7 depicting recognition process The algorithm used to cluster the training is LindeBuzo Gray (LBG) s illustrate the algorithm to cluster a set of L training and M
Firstly, generate a single vector code book .this is the centroid of the complete
The figure below shows a conceptual diagram depicting the recognition process. In the figure two speakers and two dimensions of the acoustic space are shown. The circles refer to the acoustic vectors from the speaker 1 es are from speaker 2.In the training phase , using the clustering algorithm described above , vector quantization codebook specific to a speaker is generated. The resulting codewords ,centroids are shown angles for speaker 2. The speaker corresponding to the smallest total distortion is known as the utterance’s
The algorithm used to cluster the training is LindeBuzo Gray (LBG) s illustrate the algorithm to cluster a set of L training and M
Firstly, generate a single vector code book .this is the centroid of the complete
Where, 1 ≤n≤size of code book
c e n tr o id^ F in d
S p lit e a c hc e n tr o id
C lu s te rv e c to r s
c e n tr o id s^ F in d C o m p u te D(d is to r tio n )
D D'^ −D < ε
S to p
D ’ = D
m = 2 * m
N o
Y e s
Y e s
m < M N o
Figure 8 depicting LBG algorithm