Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Speaker Identification System Progress 1-Implementation and Applications In Computer Sciences-Project Report, Study Guides, Projects, Research of Applications of Computer Sciences

Birla Institute of Technology and Science Applications of Computer Sciences

This report is for final year project to complete degree in Computer Science. It emphasis on Applications of Computer Sciences. It was supervised by Dr. Abhisri Yashwant at Bengal Engineering and Science University. Its main points are: Literature, Survey, Databases, Comprehensive, Information, Speakers, Addition, Frames

Typology: Study Guides, Projects, Research

2011/2012

Uploaded on 07/18/2012

padmini 🇮🇳

4.4

(207)

175 documents

1 / 23

This page cannot be seen from the preview

Don't miss anything!

Table of Contents

1. INTRODUCTION..................................................................................................... 1

1.1 Project Objectives………………………………………………………………1

1.2 Overview of the Report………………………………………………………....2

2. DATABASE USED FOR THE EXPERIMENTS .................................................. 3

2.1 Literature Survey of Existing Databases………………………………………..3

2.2 Overview of Database Protocol………………………………………………...4

2.2.1 Database Characterization ............................................................................. 5

2.2.2 Designed Tasks and Distribution of age ......................................................... 6

2.3 Comprehensive Information of Speakers in SDSRS…………………………...6

2.4 Working of Speaker Recognition System………………………………………8

2.4.1 Results Obtained ............................................................................................. 9

2.5 Experiments Performed on Database for Analysis……………………………11

2.5.1 Addition of Noise in voice samples ............................................................... 11

2.5.2 Pitch Alteration ............................................................................................. 12

3. GAUSSIAN MIXTURE MODEL ......................................................................... 14

3.1 Gaussian Mixture Model for Speaker Recognition…………………………...14

3.1.1 Frames as classifiers..................................................................................... 16

4. REFERENCES ......................................................................................................... 18

docsity.com

Discover Study Guides, Projects, Research of Applications of Computer Sciences Birla Institute of Technology and Science

Partial preview of the text

Download Speaker Identification System Progress 1-Implementation and Applications In Computer Sciences-Project Report and more Study Guides, Projects, Research Applications of Computer Sciences in PDF only on Docsity!

1. INTRODUCTION.....................................................................................................
- 1.1 Project Objectives………………………………………………………………
- 1.2 Overview of the Report………………………………………………………....
1. DATABASE USED FOR THE EXPERIMENTS
- 2.1 Literature Survey of Existing Databases………………………………………..
- 2.2 Overview of Database Protocol………………………………………………...
  - 2.2.1 Database Characterization
  - 2.2.2 Designed Tasks and Distribution of age
- 2.3 Comprehensive Information of Speakers in SDSRS…………………………...
- 2.4 Working of Speaker Recognition System………………………………………
  - 2.4.1 Results Obtained
- 2.5 Experiments Performed on Database for Analysis……………………………
  - 2.5.1 Addition of Noise in voice samples
  - 2.5.2 Pitch Alteration
1. GAUSSIAN MIXTURE MODEL
- 3.1 Gaussian Mixture Model for Speaker Recognition…………………………...
  - 3.1.1 Frames as classifiers.....................................................................................
1. REFERENCES
Figure 2.1- On-line certificate status protocol (OCSP) publication. List of Figures
Figure 2.2: Screenshot of the poster
Figure 2.3: Components of speaker identification system
Figure 2.4: Computing of mel-cepstrum [1]
Figure 2.5: Comparison of % accuracy between noisy samples
Figure 2.6: Screenshot of the wavlab enviorment
Figure 3.1: GMMs for speaker recognition [1]

iv

Abstract

Given a speech signal there are two kinds of information that may be extracted from it. On one hand there is the linguistic information about what is being said, and on the other there is also speaker specific information. Nowadays it is obvious that speakers can be identified from their voices. Therefore in this work, “Speaker Identification System”, I have looked into the details of speaker identification by reviewing some well-known techniques used in speaker identification. The problem is made harder when the speakers are not constrained to a particular word sequence, or when there are different sources of errors in the speech database. In this work, the task of improving the performance of Gaussian Mixture models for speaker identification, without a substantial increase in computation, by extracting other features from Gaussian Mixture probabilities that are better indicators of who the speaker is, has been emphasized.

v

Time Schedule

(MFCC), Linear Predictive Coding (LPC) for extraction and Learning Vector Quantization (LVQ) for the classification purpose. In the implementation phase I have used MFCCs and LPCs for feature extraction while as features extraction techniques and VQ, and as feature modeling method.

1.2 Overview of the Report

The rest of this report is organized as follows. Chapter 2 gives the detailed description of the database that has been made for speaker recognition system for evaluating the various algorithms. It also contains the experiments performed on database to check the performance of classifiers in real time application. Chapter 3 gives a more detailed description of the standard Gaussian Mixture Model for speaker recognition and the situations over which the method is to be evaluated.

2. Database Used for the Experiments

Over the last two decades there has been an increasing interest in speaker recognition. In order to get adequate amounts of speech to train and test the speaker recognition system, speech databases are needed. There are several applications of speaker recognition, leading to a diversity of the structure and content of speaker recognition databases. The most obvious benefit of using standard and readily available (public) databases is that system performances using different techniques on the same database become comparable, hence, enabling quantitative evaluation of methods and speaker recognition protocols.

2.1 Literature Survey of Existing Databases

According to the survey organization of speaker recognition databases may be based on features such as  The recording protocol  The population of participating subjects  The recording device  Language  Type of verbal statement  The intended use. The intra-speaker and inter-speaker variability are important parameters for a speech database. Intra-speaker variability can be very important for speaker recognition performance and can be estimated if the same sentence is read several times by the same subjects. The intra-speaker variation can originate from a variable speaking rate, changing emotions or other mental variables, and in environment noise. The variance brought by different speakers is denoted inter-speaker variance and is caused by the individual variability in vocal systems involving source excitation, vocal tract articulation, lips and/or nostril radiation. If the inter-speaker variability dominates the intra-speaker variability speaker recognition is feasible. Speech databases are most commonly classified into single-session and multi-session. Multi-session databases allow estimation of temporal intra-speaker variability. Combination sets are also possible

rapidly changing or highly degraded, acquisition processes are not always under control, incriminated people exhibit low degree of cooperativeness, etc., inducing a wide range of variability sources on speech utterances. In this sense, real approaches to speaker identification necessarily imply taking into account all these variability factors. In order to isolate, analyze and measure the effect of some of the main variability sources that can be found in real commercial and computer-human interaction applications and their influence in speaker recognition systems, a specific speech database in English called SDSRS has been designed and acquired under controlled conditions. In this report, together with a detailed description of the database, some experimental results are also presented.

2.2.1 Database Characterization

An important secondary outcome of work with the database survey is a series of questions for characterizing a speaker recognition corpus. i. name and availability, ii. speaker material (including questions on the number of speakers, inter-speaker variation, intra speaker variation, and impostor characterization), iii. speech contents, iv. recording equipment, v. recording environment, vi. other information. Statistics: 50 persons (Males and Females) Recording Equipment: Recorder, Lab Format : .wav Text:  and that feeling of untouched wilderness continues as we go deep into the mangroves in search of an isolated place called Crocodile Creek  then judges wander around giving point scores to each bird  Mr. Wright should write to Ms. Wright right away about his ford or four door Honda  One two three four five six seven eight nine ten.  Gorgeous

2.2.2 Designed Tasks and Distribution of age

Consequently, delimiting the problem of speech variability, together with analyzing the quantitative results of speaker recognition systems will lead to an integral and comprehensive approach to commercial and forensic speaker recognition. All speakers uttered the same sentences  Gorgeous  One two three four five six seven eight nine ten. In order to determine an adequate age distribution of speakers in the database, sociological implications of technology should be taken into account, as equi-distribution of ages may not respond to a real age distribution of users in a specific commercial application. On the other hand, in forensic applications criminals are also unequally distributed in age. During recording session two different microphones (Somic and A- tech4 ) were used. All the voice samples are recorded at Lab-215(B-Block) at a sampling rate of 22050 Hz. Posters for publicizing the recording sessions were pasted at different notice boards. Moreover announced were made in the junior classes regarding the recording date, also request was made in person to faculty and staff members. A screen shots of the poster is shown in Figure 2.

Figure 2.2: Screenshot of the poster

2.3 Comprehensive Information of Speakers in

SDSRS

Detailed information of all the speakers in Speech database for Speaker Recognition

Fayyaz ul Amir Afsar Minhas SE-Fellow Lab
Sana Nazir 8 th^ Lab
Anam Saleem 8 th^ Lab
Irum Inayat 8 th^ Lab
Tayyaba Nasir 8 th^ Lab
Nausheeen Majeed 8 th^ Lab
Abdullah Waseer 6 th^ Lab
Mr, Nauman Shamim Faculty-Mem Lab
Bushra Sadia 6th Lab Table 2.1: Speaker database information

2.4 Working of Speaker Recognition System

First, speech is recorded with a microphone or telephone handset, and environmental noise (computer hum, car engine, door slams, keyboard clicks, traffic noise, background babble, music) adds to the speech wave. Reverbation adds delayed versions of the original signal to the recorded signal [1]. Poor-quality microphones introduce nonlinear distortion to the true speech spectrum. The A/D converter adds its own distortion, and the recording device might interfere with a mobile phone radio-waves. If the speech is transmitted through a telephone network, it is compressed using lossy techniques which might have added noise into the signal. Speech coding can degrade speaker recognition performance significantly [2]. Feature extraction is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but effective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components (speaker modeling and pattern matching) is strongly determined by the quality of the front-end Figure 2.3 shows the abstraction of an automatic speaker recognition system. Regardless of the type of the task (identification or verification), system operates in two modes: training and recognition modes. In the training mode, a new speaker (with known identity) is enrolled into the system‟s database. In the recognition mode, an unknown speaker gives a speech input and the system makes a decision about the speaker‟s identity. Both the training and the recognition modes include feature extraction, sometimes called the front-end of the system. The feature extractor converts the digital

speech signal into a sequence of numerical descriptors, called feature vectors. The features provide a more stable, robust, and compact representation than the raw input signal. Feature extraction can be considered as a data reduction process that attempts to capture the essential characteristics of the speaker with a small data rate.

Figure 2.3: Components of speaker identification system In the training phase, a speaker model is created from the feature vectors. The aim is to model the speaker‟s voice so that it generalizes beyond the training material. In other words, unseen vectors can be classified correctly. A recent overview of various modeling techniques is given in [3]. In the recognition phase, features are extracted from the unknown speaker‟s voice sample. Pattern matching refers to an algorithm, or several algorithms, that compute a match score between the unknown speaker‟s feature vectors and the models stored in the database. The output of the pattern matching module is a similarity score. The last phase in the recognition chain is decision making. The decision module takes the match scores as its input, and makes the final decision of the speaker identity, possibly with a confidence value [4]. For the verification task, the binary decision is either acceptance or rejection of the speaker. In the case of identification, there are two possibilities. In the closed-set identification task, the decision is the ID number of the most similar speaker to the unknown speaker. In the open-set task, there is an additional decision that the speaker is none of the registered speakers (“no decision”).

2.4.1 Results Obtained

The implementation for a speaker recognizer is done in two phases, i.e. training phase and the testing phase, using MFCCs as features extraction technique and VQ as a

speaker conditions in different sessions, i.e., he may be ill or thirsty or tired or any other condition he may have. Parameters of MFCC are shown in Table 2. Parameter Value Sampling frequency 8 KHz Window Type Hamming Number of Coefficients 19 No of Filters in the Filter Bank 20 Length of the Frame 256 Frame Increment 100 Table 2.3: Mel-Cepstrum Parameters. The number of filters used in the filter bank was selected to be 20. This number was selected by keeping in mind the coverage of telephone bandwidth. Length of the frame was selected to contain 256 numbers of samples. For a sampling rate of 8 KHz, 256 numbers of samples corresponds to a frame length of 32ms (256/ 8K = 32 ms) i.e., the speech signal can be assumed as a stationary signal in one frame. These MFCC feature vectors were then given to a vector quantization classifier, this function needs the number of codebooks to make and MFCC features. Size of the codebooks could be 16, 32, 64, 128, 256 depending upon the signal variations. The most suitable codebook size for my data was 16. Complete description of functioning of algorithms has been mentioned in [6].

2.5 Experiments Performed on Database for

Analysis

Different experiments have been performed on database to analyze the performance of classifiers in real time applications.

2.5.1 Addition of Noise in voice samples

Gaussian noise has been added in all the voice samples that have been used for training and testing, because it models most of the natural noise that comes in from random sources acting together. Classification accuracy that has been achieved with noisy samples is given in Table 2.

Table 2.4: Testing results for text dependent SR for noisy voice samples

Following graph shows the comparison of percentage accuracy between samples having different noise ratio when MFCC is used as feature extraction and VQ is used as feature matching technique.

0.02 0.03 0.04 0.05 0.06 0.

(^90 9488 88 ) 84

0

10

20

30

40

50

60

70

80

90

100

Noise Ratio Classification Accuracy

Figure 2.5: Comparison of % accuracy between noisy samples

2.5.2 Pitch Alteration

To analyze the performance of classifiers, when pitch of particular voice sample is altered, a software WavLab has been used, which is a proprietary software and is used for professional mastering, high resolution multi-channel audio editing, audio restoration, sample design and radio broadcast work right through to complete CD/DVD-A production. Already a standard application for digital audio editing and processing due to its outstanding flexibility and pristine audio quality.

Features No of Speakers Tested

Accuracy ( % Result)

MFCCs 50 86

3. Gaussian Mixture Model

Various models can be applied to the task of text independent speaker identification, such as Neural Networks, Vector Quantization, Radial Basis Functions, Hidden Markov Models and Gaussian Mixture Models(GMMs). Among these methods GMMs are usually preferred because they offer high classification accuracy while still being robust to corruptions in the speech signal [7]. There are a number of different speech features that have been shown to be indicative of speaker identity. These include pitch related features, Linear Prediction Cepstral Coefficients (LPCCs) and Linear Perceptual Coding (LPC). Although there are no exclusively speaker distinguishing features, the speech spectrum has been shown to be very effective for speaker recognition. The focus of this report is on extracting additional information from Gaussian Mixture probabilities irrespective of the features used.

3.1 Gaussian Mixture Model for Speaker

Recognition

Representation of some general speaker dependent spectral shapes by Gaussian components and the capability of Gaussian mixtures to model arbitrary densities have motivated the use of Gaussian Mixture models for modeling speaker identity. A GMM is

Figure 3.1: GMMs for speaker recognition [7]

the weighed sum of M component densities as shown in Figure 3.1, given by the equation,

1

M

p X    i  p bi i x (3.1)

where X is a sequence of feature vectors from the audio data, x is D dimensional

speech feature vector, b xi ( ), i 1.... M are component densities and p ii , 1... M are the

mixture weights. Each component density is a D variate Gaussian function of the form, 1 / 2 1/ 2 ( ) 1 exp{ 1 ( ) ' ( )} b x i (^) (2 ) D | (^) i | 2 x ui (^) i x ui

   ^  

with mean vector ui and covariance matrix  i. The mixture weights are such that

1

M

 i ^ pi^ 

For speaker identification, each speaker is represented by a GMM  i which is

completely parameterized by its mixture weights, means and covariance matrices,

  i { p ui , i ,  i } (3.3)

There are two principal motivations for using GMMs to model speaker identity. The first is that the components of such a multi-modal density may represent some underlying set of acoustic classes. It is reasonable to assume that the acoustic space corresponding to a speakers‟ voice can be characterized by a set of acoustic classes. These acoustic classes reflect some general speaker-dependent vocal tract configurations

that are useful for characterizing speaker identity. The spectral shape of the ith acoustic

class can in turn be represented by the mean ui and covariance matrix  i. Because all

the training or testing speech is unlabeled, the acoustic classes are hidden in that the class of an observation is unknown. The second motivation for using Gaussian mixture densities for speaker identification is that a linear combination of Gaussian basis functions is capable of modeling a large class of sample distributions. A GMM can form smooth approximations to arbitrarily shaped densities. There are several techniques that can be used to estimate

the parameters of a GMM  i , which describes the distribution of the training feature

vectors. By far the most popular and well-established is Maximum Likelihood (ML)