Final thesis, Thesis of Electronics engineering

It is about speech recognition

Typology: Thesis

2014/2015

Uploaded on 12/22/2015

bhargav_sonu
bhargav_sonu 🇬🇧

4.5

(2)

1 document

1 / 45

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
REAL TIME SPEAKER
RECOGNITION
on
TMS320C6713
A THESIS SUBMITTED IN PARTIAL FULFILLMENT
FOR THE REQUIREMENTS OF
BACHELORS IN TECHNOLOGY
BY
ABHIJIT TRIPATHY 108EC013
&
AKASH SAHOO 108EI010
UNDER THE GUIDANCE OF
PROF. A.K Swain
DEPARTMENT OF ELECTRONICS AND
COMMUNICATION
NATIONAL INSTITUTE OF TECHNOLOGY, ROURKELA
2008-2012
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d

Partial preview of the text

Download Final thesis and more Thesis Electronics engineering in PDF only on Docsity!

REAL TIME SPEAKER

RECOGNITION

on

TMS320C

A THESIS SUBMITTED IN PARTIAL FULFILLMENT FOR THE REQUIREMENTS OF BACHELORS IN TECHNOLOGY BY

ABHIJIT TRIPATHY 108EC

&

AKASH SAHOO 108EI

UNDER THE GUIDANCE OF

PROF. A.K Swain

DEPARTMENT OF ELECTRONICS AND

COMMUNICATION

NATIONAL INSTITUTE OF TECHNOLOGY, ROURKELA

NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA

This is tocertify that the TMS320C6713” submitted by and MR.AKASH SAHOO (108EI010) of Bachelors in Technology COMMUNICATION and National Institute ofTechnology them under mysupervision.

DATE:

NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA

CERTIFICATE

that the thesis titled , ” Real Time SpeakerRecognition on submitted by MR.ABHIJIT TRIPATHY (108EC013) AKASH SAHOO (108EI010) in partial fulfillments of the requirements Bachelors in Technology degree in ELECTRONICS AND and ELECTRONICS AND INSTRUMENTATION Technology,Rourkela is an authentic work carried out by supervision.

PROF. Ayas Kanta Swain

Department of Electronics and Communication

NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA

Real Time SpeakerRecognition on MR.ABHIJIT TRIPATHY (108EC013) requirements ELECTRONICS AND TRUMENTATION , Rourkela is an authentic work carried out by

Ayas Kanta Swain

and Communication

    1. Introduction
      • 1.1 introduction to speaker recognition
      • 1.2 principles of speaker recognition
      • 1.3 speech’s feature extrcation
      • 1.4 preprocessing - 1.4.1 truncation - 1.4.2 frame blocking - 1.4.3 windowing
    • 1.4.4 STFT - 1.5 feature extraction - 1.6 feature matching - 1.7 Gaussian mixture modeling - 1.8 model description - 1.9 maximum likelihood estimations
  • 2 DSP platform
  • 2.1 introduction
  • 2.2 DSP arch.
  • 2.3 TMS320C6713
  • 3.programing with ccs
    1. Simulink on TMS320C6713
  • 5Conclusions
  • 6.References

ACKNOWLEDGEMENT

We would like to take this wonderful opportunity of doing BTech Project on “Real Time Speaker Recognition on TMS320C6713” which has given us a lot of learning and practical exposure. First of all,we would like to express our gratitude to our guide, prof.T.K. Dan who gave us the opportunity to work on this project.He gave us independence indoing whateverwe liked .He really inspired us throughout the project.Secondly, we would like to thank Prof A.K. Swain who has been the guiding force in our project.His constant motivation has been of great help to us. We would also like to thank Prof. U.K. Sahoo who has helped us with signalprocessingconcepts on which we needed clarification. Lastly we would like to thank Mr.Sumit Kumar Sao , Mr. Kanhu Charan Bhuyan,Mr .Venkat ,Mr Karupannan who have provided us with all the things that we needed in Embedded Systems Lab.

ABHIJIT TRIPATHY AKASHSAHOO

108EC013 108EI

MOTIVATION

Suppose we are having audio data which is recorded

every day for several years. Out of these data we want to

find out which one corresponds to a particular

speaker.This can be done by using a speaker recognition

system. Consider a case of video conferencing, where

we want to focus a camera on a particular person. This

can be easily implemented if a speaker recognition system

exists.

Automatic speaker verification and automatic

speaker identification systems are the most economical

solution to the problem of unauthorized computer access

and other digital equipments hacking involving

communications. As we know that almost all computers

have microphone and telephone networks, so.speaker

recognition can be implemented in software .This reduces

hardware complexities .Moreover speaker recognition

systems can be made pretty robust to noise, mimicry and

environmental changes.

LITERATURE SURVEY

A lot of work on speaker recognition has already

been done in industry, technological labs and educational

universities. For example AT&T labs have synthesized

speaker recognition systems.The NIPPON Telephone

and Telegraph Company have their own speaker

recognition systems. MIT of USA and National Tsing

Hua university of Taiwan, Texas Instruments also

have conducted testing of various speaker recognition

systems. Automatic speaker identification systems

applications include access control, telephone banking

and telephone credit cards. ITT, Lernout & Hauspie,T-

Netix etc. are known for their automatic speaker

verification systems.

It is estimated that cyber crimes involving 3-4 million

dollars occur in banks in USA every year.Imagine how

much can we save by using speaker recognition systems

to check these fraudulent transactions.Products like

SPRINTS’s Foncard which uses TI’s speaker verification

engine are used to check the occurrence of these kinds of

scams.

1.1 Introduction

Speaker recognition is the process of automatically recognizing who is speaking on the basis of the information contained in the speech signal. Speaker recognition is carried out in two phases.First phase comprises the training phase whereas, the second phase comprises thetesting phase. In the training phase the speaker has to utter something so as to provide speech samples.These samples enable the system to build a model of that speaker.In the testing phase ,the input speech is matched with the stored reference models and a decision is made. Speaker recognition can be said as a biometric system which validates a person’s claim to an identity based on the features extracted from the speech samples. The training and testing phase of speaker recognition system makes it possible to use the speaker’s voice to verify his/her identity and control access to services such as voice dialing,telephone banking,telephone shopping,voice mail and other computer access services that are voice activated.

1.2 Principles of Speaker Recognition

Speaker recognition can be classified into two categories: identification and verification. Speaker identification is the process of determining which registered speaker provides the utterance.

speech^ Input extraction^ Feature

Referencemodel (Speaker #1)

Similarity

Referencemodel (Speaker #N)

Similarity

Maximumselection Identificationresult (Speaker ID)

Figure 1. depicting SI

Speaker verification on the other hand is a process of determining whethe the claim of the speaker is true or not.

Figure 2 depicting SV

However we find out that speaker recognition is a difficult task. Speaker recognition assumes that human voice exhibits characteristics that are unique to the speaker. This principle is nature of speech signal. Speech signals in training and testing can vary greatly due to time dependent changing of voice , health conditions ,acoustic noise variations in the environment etc..

1.3 Speech’s Feature

The feature extraction process converts the speech waveform into a parametric representation for analysis.Feature extraction is also named as signal processing front end. The speech signal is a slowly time varying signal.Speech characteristics are fairly stationary over short period’s of time i.e. between 5 to 100 ms.However over longer periods of time i.e. over 200ms ,the characteristics vary due to the different speech spoken. Therefore STFT(short time fourier transform) is used. A wide range of possibilities exist representing the speech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel

Speaker verification on the other hand is a process of determining whethe the claim of the speaker is true or not.

Figure 2 depicting SV

However we find out that speaker recognition is a difficult task. Speaker recognition assumes that human voice exhibits characteristics that are unique to the speaker. This principle is challenged by the highly variant nature of speech signal. Speech signals in training and testing can vary greatly due to time dependent changing of voice , health conditions ,acoustic iations in the environment etc..

Speech’s Feature Extraction The feature extraction process converts the speech waveform into a parametric representation for analysis.Feature extraction is also named as signal processing front end. The speech signal is a slowly time varying signal.Speech are fairly stationary over short period’s of time i.e. between 5 to 100 ms.However over longer periods of time i.e. over 200ms ,the characteristics vary due to the different speech spoken. Therefore STFT(short time fourier transform) is used. of possibilities exist representing the speech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel

Speaker verification on the other hand is a process of determining whether

However we find out that speaker recognition is a difficult task. Speaker recognition assumes that human voice exhibits characteristics that challenged by the highly variant nature of speech signal. Speech signals in training and testing can vary greatly due to time dependent changing of voice , health conditions ,acoustic

The feature extraction process converts the speech waveform into a parametric representation for analysis.Feature extraction is also named as

The speech signal is a slowly time varying signal.Speech are fairly stationary over short period’s of time i.e. between 5 to 100 ms.However over longer periods of time i.e. over 200ms ,the characteristics vary due to the different speech spoken. Therefore

of possibilities exist representing the speech signal for the speaker recognition task, such as Linear Prediction Coding (LPC), Mel-

Figure 3 depicting truncation

1.4.2 Frame Blocking In frame blocking step,we block the input ,continuous speech sample into a finite number of frames.Consider we block the speech into frames with M data samples with adjacent frames separated by N sampled datas.Here,the first frame consists of N data elements and the second frame begins with the Nth data sample of the first frame .The second frame overlaps the first frame by N-M samples and the third frame overlaps the second frame by N-M samples and so on.

1.4.3 Windowing

After frame blocking we go for windowing. In windowing we multiply each frame by the window function so as to minimize the signal discontinuities at the beginning and end of each frame. The idea is to set the signal value to zero in the beginning and end of each frame.Window function w(n) is defined for 0 ≤ ᡦ ≤ ᡀ − 1,where N denotes the number of elements in each frame. We used a Hamming window for minimizing the interference from side

lobes.The Hamming window has the formula ᡵ䙦ᡦ䙧 = 0.54 − cos 䙦⡰どぁ〕⡹⡩䙧

,for 0 ≤ ᡦ ≤ ᡀ − 1.

FIGURE 4 depicting a Hamming Window

1.4.4 STFT – Short Time Fourier Transform

Then we compute the fast Fourier transform which maps each frame onto the frequency domain.The FFT actually reduces the number of calculations required to compute the DFT of a given set of samples.For a set of N samples the STFT is defined as ᡐ䙦ᠷ䙧 = ∑ 〕⡹⡩⡨ ᡶ䙦ᡦ䙧ᡗ⡹〷⡰ゕ〸ぁ/〕,where 0 ≤K≤N- Here the X(K) values are complex numbers and we consider only their absolute values.The sequence of X(K) defined as{ᡐ〸} is related to frequency as per the following formula. Positive frequencies 0≤F≤Fs/2 correspond to 0 ≤n≤N/2- Negative frequencies -Fs/2≤F<0 correspond to N/2≤n≤N- Where Fs is the sampling frequency. This FFT calculation is also called as spectrum or periodogram.

.

1.5.3 MFCC Cepstral Coefficients

Figure 6 – MFCC Steps

The above figure shows the block for computing the MFCC (mel frequency cepstrum coefficients).If convert the log mel spectrum back to time the result we get are called MFCCs.Thecepstral representation of speech gives a good

(^00 1000 2000 3000 4000 5000 6000 )

1

2

Mel-spaced filterbank

Frequency (Hz)

Figure5showing a mel filter bank for 20 mel coefficients

representation of the local spectral properties of the signal for a given frame analysis. The mel spectrum coefficients being real quantities are convertible to time domain using discrete cosine transform. If the absolute of the fft coefficients are given as Sk where 0 ≤k≤K-1 then the mel cepstrum coefficients are given as Cn ,where

Cn=∑ 〒〸⢀⡩ log(Sk)cos䙶ᡦ䙦ᡣ − ⡩⡰䙧 ゕ〸䙷, ᡦ = 0,1, … ᠷ − 1

We exclude the first value since it represents the mean of the signal and is of little importance.

Figure 7 showing acoustic vectors corresponding to 5th^ and 6th^ filters of mel filter bank.

The figure below shows a conceptual diagram depicting the recognition process. In the figure two speakers and two dimensions of the acoustic space are shown. The circles refer to the acoustic vectors from the speaker 1 whereas the triangles are from speaker 2.In the training phase , using the clustering algorithm described above , vector quantization codebook specific to a speaker is generated. The resulting codewords ,centroids are shown by black circles for speaker 1 and black tri corresponding to the smallest total distortion is known as the utterance’s original speaker.

Figure 7 The algorithm used to cluster the training is LindeBuzo Gray (LBG) algorithm. Let’s illustrate the algorithm to cluster a set of L training and M codebook vectors.

  1. Firstly, generate a single vector code book .this is the centroid of the complete set of training vectors.

The figure below shows a conceptual diagram depicting the recognition process. In the figure two speakers and two dimensions of the acoustic space are shown. The circles refer to the acoustic vectors from the speaker 1 es are from speaker 2.In the training phase , using the clustering algorithm described above , vector quantization codebook specific to a speaker is generated. The resulting codewords ,centroids are shown by black circles for speaker 1 and black triangles for speaker 2. The speaker corresponding to the smallest total distortion is known as the utterance’s

Figure 7 depicting recognition process The algorithm used to cluster the training is LindeBuzo Gray (LBG) s illustrate the algorithm to cluster a set of L training and M

Firstly, generate a single vector code book .this is the centroid of the complete

The figure below shows a conceptual diagram depicting the recognition process. In the figure two speakers and two dimensions of the acoustic space are shown. The circles refer to the acoustic vectors from the speaker 1 es are from speaker 2.In the training phase , using the clustering algorithm described above , vector quantization codebook specific to a speaker is generated. The resulting codewords ,centroids are shown angles for speaker 2. The speaker corresponding to the smallest total distortion is known as the utterance’s

The algorithm used to cluster the training is LindeBuzo Gray (LBG) s illustrate the algorithm to cluster a set of L training and M

Firstly, generate a single vector code book .this is the centroid of the complete

  1. Now having completed the first step, double the size of the code book by separating the present codebooks in accordance with the rule given below.

y + n = y n ( 1 + ε )

y − n = y n ( 1 − ε )

Where, 1 ≤n≤size of code book

  1. Conduct a search in the present code book for the nearest neighbour i.e. the codebook with the most matching characteristics .Then allot the corresponding cell with that vector.
  2. Now change the codeword in each cell using the centroid of the training vectors assigned to the particular cell.
  3. Repeat the procedures given in steps 3 and 4 unless the mean distance falls below a predefined threshold
  4. Repeatedly perform the instructions in steps 2,3 and 4 until a code book of size M is designed

c e n tr o id^ F in d

S p lit e a c hc e n tr o id

C lu s te rv e c to r s

c e n tr o id s^ F in d C o m p u te D(d is to r tio n )

D D'^ −D < ε

S to p

D ’ = D

m = 2 * m

N o

Y e s

Y e s

m < M N o

Figure 8 depicting LBG algorithm