















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This report is for final year project to complete degree in Computer Science. It emphasis on Applications of Computer Sciences. It was supervised by Dr. Abhisri Yashwant at Bengal Engineering and Science University. Its main points are: Literature, Survey, Databases, Comprehensive, Information, Speakers, Addition, Frames
Typology: Study Guides, Projects, Research
1 / 23
This page cannot be seen from the preview
Don't miss anything!
















ii
iv
Given a speech signal there are two kinds of information that may be extracted from it. On one hand there is the linguistic information about what is being said, and on the other there is also speaker specific information. Nowadays it is obvious that speakers can be identified from their voices. Therefore in this work, “Speaker Identification System”, I have looked into the details of speaker identification by reviewing some well-known techniques used in speaker identification. The problem is made harder when the speakers are not constrained to a particular word sequence, or when there are different sources of errors in the speech database. In this work, the task of improving the performance of Gaussian Mixture models for speaker identification, without a substantial increase in computation, by extracting other features from Gaussian Mixture probabilities that are better indicators of who the speaker is, has been emphasized.
v
(MFCC), Linear Predictive Coding (LPC) for extraction and Learning Vector Quantization (LVQ) for the classification purpose. In the implementation phase I have used MFCCs and LPCs for feature extraction while as features extraction techniques and VQ, and as feature modeling method.
The rest of this report is organized as follows. Chapter 2 gives the detailed description of the database that has been made for speaker recognition system for evaluating the various algorithms. It also contains the experiments performed on database to check the performance of classifiers in real time application. Chapter 3 gives a more detailed description of the standard Gaussian Mixture Model for speaker recognition and the situations over which the method is to be evaluated.
Over the last two decades there has been an increasing interest in speaker recognition. In order to get adequate amounts of speech to train and test the speaker recognition system, speech databases are needed. There are several applications of speaker recognition, leading to a diversity of the structure and content of speaker recognition databases. The most obvious benefit of using standard and readily available (public) databases is that system performances using different techniques on the same database become comparable, hence, enabling quantitative evaluation of methods and speaker recognition protocols.
According to the survey organization of speaker recognition databases may be based on features such as The recording protocol The population of participating subjects The recording device Language Type of verbal statement The intended use. The intra-speaker and inter-speaker variability are important parameters for a speech database. Intra-speaker variability can be very important for speaker recognition performance and can be estimated if the same sentence is read several times by the same subjects. The intra-speaker variation can originate from a variable speaking rate, changing emotions or other mental variables, and in environment noise. The variance brought by different speakers is denoted inter-speaker variance and is caused by the individual variability in vocal systems involving source excitation, vocal tract articulation, lips and/or nostril radiation. If the inter-speaker variability dominates the intra-speaker variability speaker recognition is feasible. Speech databases are most commonly classified into single-session and multi-session. Multi-session databases allow estimation of temporal intra-speaker variability. Combination sets are also possible
rapidly changing or highly degraded, acquisition processes are not always under control, incriminated people exhibit low degree of cooperativeness, etc., inducing a wide range of variability sources on speech utterances. In this sense, real approaches to speaker identification necessarily imply taking into account all these variability factors. In order to isolate, analyze and measure the effect of some of the main variability sources that can be found in real commercial and computer-human interaction applications and their influence in speaker recognition systems, a specific speech database in English called SDSRS has been designed and acquired under controlled conditions. In this report, together with a detailed description of the database, some experimental results are also presented.
An important secondary outcome of work with the database survey is a series of questions for characterizing a speaker recognition corpus. i. name and availability, ii. speaker material (including questions on the number of speakers, inter-speaker variation, intra speaker variation, and impostor characterization), iii. speech contents, iv. recording equipment, v. recording environment, vi. other information. Statistics: 50 persons (Males and Females) Recording Equipment: Recorder, Lab Format : .wav Text: and that feeling of untouched wilderness continues as we go deep into the mangroves in search of an isolated place called Crocodile Creek then judges wander around giving point scores to each bird Mr. Wright should write to Ms. Wright right away about his ford or four door Honda One two three four five six seven eight nine ten. Gorgeous
Consequently, delimiting the problem of speech variability, together with analyzing the quantitative results of speaker recognition systems will lead to an integral and comprehensive approach to commercial and forensic speaker recognition. All speakers uttered the same sentences Gorgeous One two three four five six seven eight nine ten. In order to determine an adequate age distribution of speakers in the database, sociological implications of technology should be taken into account, as equi-distribution of ages may not respond to a real age distribution of users in a specific commercial application. On the other hand, in forensic applications criminals are also unequally distributed in age. During recording session two different microphones (Somic and A- tech4 ) were used. All the voice samples are recorded at Lab-215(B-Block) at a sampling rate of 22050 Hz. Posters for publicizing the recording sessions were pasted at different notice boards. Moreover announced were made in the junior classes regarding the recording date, also request was made in person to faculty and staff members. A screen shots of the poster is shown in Figure 2.
Figure 2.2: Screenshot of the poster
Detailed information of all the speakers in Speech database for Speaker Recognition
First, speech is recorded with a microphone or telephone handset, and environmental noise (computer hum, car engine, door slams, keyboard clicks, traffic noise, background babble, music) adds to the speech wave. Reverbation adds delayed versions of the original signal to the recorded signal [1]. Poor-quality microphones introduce nonlinear distortion to the true speech spectrum. The A/D converter adds its own distortion, and the recording device might interfere with a mobile phone radio-waves. If the speech is transmitted through a telephone network, it is compressed using lossy techniques which might have added noise into the signal. Speech coding can degrade speaker recognition performance significantly [2]. Feature extraction is the first component in an automatic speaker recognition system. Feature extraction transforms the raw speech signal into a compact but effective representation that is more stable and discriminative than the original signal. Since the front-end is the first component in the chain, the quality of the later components (speaker modeling and pattern matching) is strongly determined by the quality of the front-end Figure 2.3 shows the abstraction of an automatic speaker recognition system. Regardless of the type of the task (identification or verification), system operates in two modes: training and recognition modes. In the training mode, a new speaker (with known identity) is enrolled into the system‟s database. In the recognition mode, an unknown speaker gives a speech input and the system makes a decision about the speaker‟s identity. Both the training and the recognition modes include feature extraction, sometimes called the front-end of the system. The feature extractor converts the digital
speech signal into a sequence of numerical descriptors, called feature vectors. The features provide a more stable, robust, and compact representation than the raw input signal. Feature extraction can be considered as a data reduction process that attempts to capture the essential characteristics of the speaker with a small data rate.
Figure 2.3: Components of speaker identification system In the training phase, a speaker model is created from the feature vectors. The aim is to model the speaker‟s voice so that it generalizes beyond the training material. In other words, unseen vectors can be classified correctly. A recent overview of various modeling techniques is given in [3]. In the recognition phase, features are extracted from the unknown speaker‟s voice sample. Pattern matching refers to an algorithm, or several algorithms, that compute a match score between the unknown speaker‟s feature vectors and the models stored in the database. The output of the pattern matching module is a similarity score. The last phase in the recognition chain is decision making. The decision module takes the match scores as its input, and makes the final decision of the speaker identity, possibly with a confidence value [4]. For the verification task, the binary decision is either acceptance or rejection of the speaker. In the case of identification, there are two possibilities. In the closed-set identification task, the decision is the ID number of the most similar speaker to the unknown speaker. In the open-set task, there is an additional decision that the speaker is none of the registered speakers (“no decision”).
The implementation for a speaker recognizer is done in two phases, i.e. training phase and the testing phase, using MFCCs as features extraction technique and VQ as a
speaker conditions in different sessions, i.e., he may be ill or thirsty or tired or any other condition he may have. Parameters of MFCC are shown in Table 2. Parameter Value Sampling frequency 8 KHz Window Type Hamming Number of Coefficients 19 No of Filters in the Filter Bank 20 Length of the Frame 256 Frame Increment 100 Table 2.3: Mel-Cepstrum Parameters. The number of filters used in the filter bank was selected to be 20. This number was selected by keeping in mind the coverage of telephone bandwidth. Length of the frame was selected to contain 256 numbers of samples. For a sampling rate of 8 KHz, 256 numbers of samples corresponds to a frame length of 32ms (256/ 8K = 32 ms) i.e., the speech signal can be assumed as a stationary signal in one frame. These MFCC feature vectors were then given to a vector quantization classifier, this function needs the number of codebooks to make and MFCC features. Size of the codebooks could be 16, 32, 64, 128, 256 depending upon the signal variations. The most suitable codebook size for my data was 16. Complete description of functioning of algorithms has been mentioned in [6].
Different experiments have been performed on database to analyze the performance of classifiers in real time applications.
Gaussian noise has been added in all the voice samples that have been used for training and testing, because it models most of the natural noise that comes in from random sources acting together. Classification accuracy that has been achieved with noisy samples is given in Table 2.
Table 2.4: Testing results for text dependent SR for noisy voice samples
Following graph shows the comparison of percentage accuracy between samples having different noise ratio when MFCC is used as feature extraction and VQ is used as feature matching technique.
0.02 0.03 0.04 0.05 0.06 0.
(^90 9488 88 ) 84
0
10
20
30
40
50
60
70
80
90
100
Noise Ratio Classification Accuracy
Figure 2.5: Comparison of % accuracy between noisy samples
To analyze the performance of classifiers, when pitch of particular voice sample is altered, a software WavLab has been used, which is a proprietary software and is used for professional mastering, high resolution multi-channel audio editing, audio restoration, sample design and radio broadcast work right through to complete CD/DVD-A production. Already a standard application for digital audio editing and processing due to its outstanding flexibility and pristine audio quality.
Features No of Speakers Tested
Accuracy ( % Result)
MFCCs 50 86
Various models can be applied to the task of text independent speaker identification, such as Neural Networks, Vector Quantization, Radial Basis Functions, Hidden Markov Models and Gaussian Mixture Models(GMMs). Among these methods GMMs are usually preferred because they offer high classification accuracy while still being robust to corruptions in the speech signal [7]. There are a number of different speech features that have been shown to be indicative of speaker identity. These include pitch related features, Linear Prediction Cepstral Coefficients (LPCCs) and Linear Perceptual Coding (LPC). Although there are no exclusively speaker distinguishing features, the speech spectrum has been shown to be very effective for speaker recognition. The focus of this report is on extracting additional information from Gaussian Mixture probabilities irrespective of the features used.
Representation of some general speaker dependent spectral shapes by Gaussian components and the capability of Gaussian mixtures to model arbitrary densities have motivated the use of Gaussian Mixture models for modeling speaker identity. A GMM is
Figure 3.1: GMMs for speaker recognition [7]
the weighed sum of M component densities as shown in Figure 3.1, given by the equation,
1
M
where X is a sequence of feature vectors from the audio data, x is D dimensional
speech feature vector, b xi ( ), i 1.... M are component densities and p ii , 1... M are the
mixture weights. Each component density is a D variate Gaussian function of the form, 1 / 2 1/ 2 ( ) 1 exp{ 1 ( ) ' ( )} b x i (^) (2 ) D | (^) i | 2 x ui (^) i x ui
1
M
completely parameterized by its mixture weights, means and covariance matrices,
There are two principal motivations for using GMMs to model speaker identity. The first is that the components of such a multi-modal density may represent some underlying set of acoustic classes. It is reasonable to assume that the acoustic space corresponding to a speakers‟ voice can be characterized by a set of acoustic classes. These acoustic classes reflect some general speaker-dependent vocal tract configurations
that are useful for characterizing speaker identity. The spectral shape of the ith acoustic
the training or testing speech is unlabeled, the acoustic classes are hidden in that the class of an observation is unknown. The second motivation for using Gaussian mixture densities for speaker identification is that a linear combination of Gaussian basis functions is capable of modeling a large class of sample distributions. A GMM can form smooth approximations to arbitrarily shaped densities. There are several techniques that can be used to estimate
vectors. By far the most popular and well-established is Maximum Likelihood (ML)