Search in the document preview
Face Detection and Recognition in Indoor Environment
H A D I E S T E K I
Master of Science Thesis Stockholm, Sweden 2007
Face Detection and Recognition in Indoor Environment
H A D I E S T E K I
Master’s Thesis in Numerical Analysis (30 ECTS credits) at the Scientific Computing International Master Program Royal Institute of Technology year 2007 Supervisors at CSC were Alireza Tavakoli Targhi and Babak Rasolzadeh Examiner was Axel Ruhe TRITA-CSC-E 2007:139 ISRN-KTH/CSC/E--07/139--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se
Abstract This thesis examines and implements some state-of-the-art methods for human face recognition. In order to examine all aspects of the face recognition task at a generic level, we have divided the task into three consecutive steps: 1) Skin color detection (for identifying regions with skin-colored), 2) Face detection (no identification) and 3) Face recogni- tion (for identifying which face is detected). Using a statistical model for the color region we can narrow down the possible regions in which faces could be found. Furthermore, trying to do an identification on a region that does not include a face is inefficient, since identification is a computationally complex process. Therefore, using a faster and less complex algorithm to do the general face detection before face recogni- tion is a faster way to identify faces. In this work, we use a machine learning approach for boosting weaker hypotheses into a stronger one in order to detect faces. For face recognition, we look deeper into the appearance of the face and define the identity as a specific texture that we try to recognize under different appearances. Finally, we merge all the different steps into an integrated system for full-frontal face detec- tion and recognition. We evaluate the system based on accuracy and performance.
Acknowledgement This research would not have been started at all without the great response in the early days from Alireza Tavakoli. Without the encouraging support from my super- visors, Alireza Tavakoli and Babak Rasolzadeh at the Royal Institute of Technology (KTH-CVAP), there would have been very little to write. To this end, the great interest shown by my examiner Axel Ruhe at the Royal Institute of Technology (KTH-NA) has been encouraging.
Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 Introduction 1 1.1 Face Detection and Recognition . . . . . . . . . . . . . . . . . . . . . 1 1.2 Review of Recent Work . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Color: An Overview 7 2.1 Skin modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Gaussian Model . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Single Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.3 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Skin-Color Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 RGB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Normalized RGB . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3 HSI, HSV, HSL - Hue Saturation Intensity (Value, Lightness) 9 2.2.4 YCrCb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.5 Which skin color space . . . . . . . . . . . . . . . . . . . . . . 10
3 Face Detection 13 3.1 Feature Based Techniques . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Image Based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Face Detection based on Haar-like features and AdaBoost algorithm 14
3.3.1 Haar-like features: Feature extraction . . . . . . . . . . . . . 15 3.3.2 Integral Image: Speed up the feature extraction . . . . . . . . 15
4 Face Recognition 17 4.1 Local Binary Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 The Face Recognition System 21 5.1 Pre-processing with Gaussian Mixture Model . . . . . . . . . . . . . 21 5.2 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2.1 Feature extraction with AdaBoost . . . . . . . . . . . . . . . 22 5.2.2 A fast decision structure (The Casecade): . . . . . . . . . . . 27
5.3 Faced Recognition based LBP . . . . . . . . . . . . . . . . . . . . . . 29
6 The Experiment Evaluation 33 6.1 Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1.1 Skin color space . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.1.2 Number of components . . . . . . . . . . . . . . . . . . . . . 33 6.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.3 Face Regocnition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.4 Overall performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7 Conclusions and Future Works 47 7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
List of Figures 49
1.1 Face Detection and Recognition Finding faces in an arbitrary scene and successfully recognizing them have been an active topics in Computer Vision for decades. A general statement of the face recognition problem (in computer vision) can be formulated as follows: Given still or video images of a scene, identify or verify one or more persons in the scene using a stored database of faces. Although face detection and recognition is still an unsolved problem meaning there is no 100% accurate face detection and recognition system, however during the past decade, many methods and techniques have been gradually developed and applied to solve the problem.
Basically, there are three types of methods in automatic face recognition: veri- fication, identification and watch-list. In the verification method, a comparison of only two images is considered. The comparison is positive if the two images are matched. In the identification method, more than one comparison should be done to return the closest match of the input image. The watch-list method works sim- ilar to the identification method with a difference that the input face can also be rejected (no match).
The method presented in this thesis consists of three steps: skin detection, face detection, and face recognition. The novelty of the proposed method is using a skin detection filter as a pre-processing step for face detection. A scheme of main tasks is shown in Figure 1.1.
• Skin Detection: This first step of the system consists of detecting the skin color region in our input data which generally consists of still-images or video sequences taken from some source such as a camera, a scanner or a file. The experience suggests that human skin has a characteristic color, which is easily recognized by humans. The aim here is to employ skin color modeling for face detection.
• Face Detection: The second step is detecting the presence of face(s) and de- termining their locations from the result of step 1. Thus, the input of step 2
CHAPTER 1. INTRODUCTION
Figure 1.1. General scheme for our system
is the output of step 1. In the case of no skin detecting, the entire original image will be used as input for face detection.
• Facial Representation and Matching: Facial representation of the character- istics of the face is computed in the third step. A comparison for matching will then be done. If the match is close enough to one of the faces in the database, the identity of the face is sorted. Otherwise, the face is rejected in the watch-list mode and the closest face is returned in the identification mode.
1.2 Review of Recent Work A primitive face detection method can be finding faces in images with controlled background by using images with a plain monocolor background, or using them with a predefined static background. The drawback of such methods is that removing the background will always yield face boundaries.
When color exists, another method can be finding faces with the help of color. In case of having access to color images, one might use the typical skin color to find face segments. The process is carried out in two main steps. The first step is skin filtering by detecting regions which are likely to contain human skin in the color image. The result of this step followed by thresholding is a binary skin map which shows us the skin regions. The second step is face detection by extracting informa- tion from regions which might indicate the location of a face in the image by taking the marked skin regions (from first step) and removing the darkest and brightest re- gions from the map. The removed regions have been shown through empirical tests to correspond to those regions in faces which are usually the eyes and eyebrows, nostrils, and mouth. Thus, the  skin detection is performed using a skin filter which relies on color and texture information. The face detection is performed on a
1.2. REVIEW OF RECENT WORK
greyscale image containing only the detected skin areas. A combination of thresh- olding and mathematical morphology is used to extract object features that would indicate the presence of a face. The face detection process works predictably and fairly reliably. The test results show very good performance when a face occupies a large portion of the image, and reasonable performance on those images depicting people as part of a larger scene. The main drawbacks are:
1. Risk of detecting non-face objects when the face objects do not occupy a significant area in the image.
2. Large skin map (for example, a naked person as the image).
Finally, the method does not work with all kinds of skin colors, and is not very robust under varying lighting conditions.
Another method is face detection in color images using PCA, Principal Com- ponent Analysis. Principal Components Analysis can be used for the localization of face region. An image pattern is classified as a face if its distance to the face model in the face space is smaller than a certain threshold. The disadvantage of this method is that it leads to a significant number of false classifications if the face region is relatively small. A classification based on shape may fail if only parts of the face are detected or the face region is merged with skin-colored background. In , the color information and the face detection scheme based on PCA are incorporated in such a way that instead of performing a pixel-based color segmentation, a new image which indicates the probability of each image pixel belonging to a skin region (skin probability image) is created. Using the fact that the original luminance im- age and the probability image have similar gray level distributions in facial regions, Principal Components Analysis is used to detect facial regions in the probability image. The utilization of color information in a PCA framework results in a robust face detection even in the presence of complex and skin colored background.
Hausdorff Distance (HD) is another method used for face detection. HD is a metric between two point sets. The HD is used in image processing as a simi- larity measure between a general face model and possible instances of the object within the image. According to the definition of HD in 2D, if A = a1, . . . , an and B = b1, . . . , bm denote two finite point sets, then
H(A,B) = Max(h(A,B), h(B,A)),
h(A,B) = Maxa∈AMinb∈B ‖a− b‖ .
In , h(A,B)is called the directed Hausdorff Distance from set A to B. A modi- fication of the above definition is useful for image processing applications, which is the so called MHD. It is defined as
CHAPTER 1. INTRODUCTION
hmod(A,B) = 1 |A| ∑ a∈A
Minb∈B ‖a− b‖ .
By taking the average of the single point distances, this version decreases the im- pact of outliers, making it more suitable for pattern recognition purposes. Now let A and B be the image and the object respectively, the goal is to find the transforma- tion parameter such that HD between the transformed model and A is minimized. The detection optimization problem can be formulated as:
dp− = Minp∈PH(A, Tp(B))
When Tp(B) is the transformed model, h(Tp(B), A) and h(A, Tp(B)) are the for- ward and reverse distance, respectively. The value of dp− is the distance value of the best matching position and scale. The implemented face detection system consists of a coarse detection and a refinement phase, containing segmentation and localiza- tion step. Coarse Detection: An AOI (Area Of Interest) with preset width/height rate is defined for an incoming image. This AOI will then be resampled to a fixed size which is independent of the dimension of the image.
• Segmentation: An edge intensity image will be calculated from the resized AOI with the Sobel operator. Then, local thresholding will give us the edge points.
• Localization: The modified forward distance, h(Tp(B), A), is sufficient to give an initial guess for the best position. The dp− minimizes h(Tp(B), A) will be the input for the next step (refinement).
Refinement phase: given a dp− , a second AOI is defined covering the expected area of the face. This AOI is resampled from the original image resulting in a greyscale image of the face area. Then segmentation and localization are like pre- vious phase with modified box reverse distance hbox(A′−, Tp′(B′)) . Validation is based on the distance between the expected and the estimated eye positions: the so called (normalized) relative error with definition,
dbox = Max(dl, dr) ‖Cl − Cr‖ ,
where dl and dr are the distance between the true eye centers and the estimated positions. In , a face is found if deye < 0.25. Two different databases are utilized. The first one contains 1180 color images of 295 test persons (360 × 288). The
1.3. THESIS OUTLINE
second one contains 1521 images of 23 persons with larger variety of illumination, background and face size (384 x 288). The value of 98.4% on the first test set and 91.8% on the second one is obtained as the robustness of the method. The average processing time per frame on a PIII 850MHz system is 23.5 ms for the coarse detection step and an additional 7.0 ms for the refinement step, which allows the use in real time video applications (> 30fps) .
A major problem of the Hausdorff Distance method is the actual creation of a proper face model (Tp(B)). While a simple "hand-drawn" model will be sufficient for the detection of simple objects, a general face model must cover the broad variety of different faces. In order to optimize the method, finding of a well-suited model for HD based face localization can be formulated as a discrete global optimization problem is interested. For this issue The General Algorithm (GA) is employed as a standard approach for multi-dimensional global optimization problems, namely the simple Genetic Algorithm (SGA) described by Goldberg .
A Genetic Algorithm (GA) approach is presented for obtaining a binary edge model that allows localization of a wide variety of faces with the HD method. The GA performs better when starting from scratch than from a hand-drawn model. Three different initializations of the population are tested in : blank model, average edge model, and hand-drawn model. An improvement from 60 % to 90% is achieved for localization performance. Therefore, GA is a powerful tool that can help in finding an appropriate model for face localization. Face localization can be improved by a multi-step detection approach that uses more than one model in different grades of details. Each of these models can then be optimized separately. This does not only speed up the localization procedure but also produces more exact face coordinates .
1.3 Thesis Outline The rest of the thesis consists of three main parts, namely color, face detection and face recognition. Each single part has been described in a seperate chapter. In Chapter 2, the color has been discussed. Chapter 3 and 4 explain basic principles of face detection and recognition respectively. In Chapter 5, the utilizing method in this work has been discussed including thee main parts. The experimental evalua- tion is then presented in Chapter 6, and finally the conclusions and proposed future works in Chapter 7.
Color: An Overview
Skin color has proved to be a useful and robust cue for face detection. Image content filtering and image color balancing applications can also benefit from automatic detection of skin regions in images. Numerous techniques for skin color modeling and recognition have been proposed in the past years. The face detection methods, that use skin color as a detection cue have gained strong popularity among other techniques. Color allows fast processing and is highly robust to geometric variations of the skin pattern. The experience suggests that human skin has a characteristic color, which is easily recognized by humans. So trying to employ skin color modeling for face detection was an idea suggested both by task properties and common sense. In this paper, we discuss pixel-based skin detection methods, which classify each pixel as skin or non-skin individually. Our goal in this work is to evaluate two most important color spaces and try to find out and summarize the advantages.
2.1 Skin modeling The major goal of skin modeling is to discriminate between skin and non-skin pixels. This is usually accomplished by introducing a metric, which measures distances (in general sense) of pixel color to skin tone. The type of this metric is defined by the skin color modeling methods. A classification of skin-color modeling is accomplished by . In this work, the Gaussian model will be discussed.
2.1.1 Gaussian Model The Gaussian model is the most popular parametric skin model. The model perfor- mance directly depends on the representativeness of the training set, which is going to be more compact for certain applications in skin model representation.
2.1.2 Single Gaussian Skin color distribution can be modeled by an elliptical Gaussian joint probability density function (pdf), defined as:
CHAPTER 2. COLOR: AN OVERVIEW
P (c|skin) = 1 2π| ∑ s |
1 2 .e −1 2 (c−µs)
T ∑−1 s
Here, c is a color vector and µs and Σs are the distribution parameters (mean vector and covariance matrix respectively). The model parameters are estimated from the training data by:
µs = 1 n
cj and ∑ s
= 1 n− 1
(cj − µs)(cj − µs)T ,
when j = 1, . . . , n, and n is the total number of skin color samples cj . The P (c|skin) probability can be used directly as the measure of how "skin-like" the c color is , or alternatively, the Mahalanobis distance from the c color vector to mean vector µs, given the covariance matrix σs can serve for the same purpose :
λs(c) = (c− µs)T −1∑ s
2.1.3 Mixture of Gaussians A more sophisticated model, capable of describing complex shaped distributions is the Gaussian mixture model. It is the generalization of the single Gaussian, the pdf in this case is:
P (c|skin) = k∑ i=1
where k is the number of mixture components, Pi are the mixing parameters, obeying the normalization constraint
∑k i=1 πi = 1, and Pi(c|skin) are Gaussian pdfs,
each with their own mean and covariance matrix. Model training is performed with a well-known iterative technique called the Expectation Maximization (EM) algo- rithm, which assumes the number of components k to be known beforehand. The details of training Gaussian mixture model with EM can be found, see for example in . The classification with a Gaussian mixture model is done by comparing the p(c|skin) value to some threshold. The choice of the component number k is important here. The model needs to explain the training data reasonably well with the given model on one hand, and avoid data over-fitting on the other. The number of components used by different researchers varies significantly: from 2 in  to 16 in . A bootstrap test for justification of k = 2 hypothesis was performed in . In , k = 8 was chosen as a "good compromise between the accuracy of of estimation of the true distributions and the computational load for thresholding".
2.2. SKIN-COLOR SPACE
2.2 Skin-Color Space Skin color detection and modelling have been frequently used for face detection. A rapid survey of common color spaces will be given. Then, we will try to figure out which color space would be more appropriate for our purposes.
2.2.1 RGB RGB is a color space originated from CRT display applications (or similar applica- tions), when it is convenient to describe color as a combination of three colored rays (red, green and blue). It is one of the most widely used color spaces for process- ing and storing of digital image data. However, high correlation between channels, significant perceptual non-uniformity, mixing of chrominance and luminance data make RGB not a very favorable choice for color analysis and colorbased recognition algorithms .
2.2.2 Normalized RGB Normalized RGB is a representation that is easily obtained from the RGB values by a simple normalization procedure:
r = R R+G+B
; g = G R+G+B
; b = B R+G+B
As the sum of the three normalized components is known (r + g + b = 1), the third component does not hold any significant information and can be omitted, reducing the space dimensionality. The remaining components are often called "pure colors", for the dependance of r and g on the brightness of the source RGB color is diminished by the normalization. A remarkable property of this representation is for matte surfaces: while ignoring ambient light, normalized RGB is invariant (under certain assumptions) to changes of surface orientation relatively to the light source . This, together with the simplicity of the transformation helped this color space to gain popularity among researchers.
2.2.3 HSI, HSV, HSL - Hue Saturation Intensity (Value, Lightness) Hue-saturation based color spaces are introduced when there is a need for the user to specify color properties numerically. They describe color with intuitive values, based on the artist’s idea of tint, saturation and tone. Hue defines the dominant color (such as red, green, purple and yellow) of an area, saturation measures the colorfulness of an area in proportion to its brightness . The "intensity", "lightness" or "value" is related to the color luminance. The intuitiveness of the color space components and explicit discrimination between luminance and chrominance properties made these color spaces popular in the works on skin color segmentation. However,  points out several undesirable features of these color spaces, including hue discontinuities
CHAPTER 2. COLOR: AN OVERVIEW
and the computation of "brightness" (lightness, value), which conflicts badly with the properties of color vision.
H = arccos 1 2 ((R−G)+(R−B))√
S = 1− 3 (R,G,B)R+G+B V = 13(R+G+B)
An alternative way of Hue-Saturation computation using log opponent values was introducing additional logarithmic transformation of RGB values aimed to re- duce the dependance of chrominance on the illumination level. The polar coordinate system of Hue-Saturation spaces, resulting in cyclic nature of the color space makes it inconvenient for parametric skin color models that need tight cluster of skin col- ors for best performance. Here, different representations of Hue-Saturation using Cartesian coordinates can be used :
X = S cosH;Y = S sinH.
2.2.4 YCrCb Y CrCb is an encoded nonlinear RGB signal, commonly used by European televi- sion studios and for image compression work. Color is represented by luma (which is luminance, computed from nonlinear RGB ), constructed as a weighted sum of the RGB values, and two color difference values Cr and Cb that are formed by subtracting luma from the red and blue components in RGB .
Y = 0.299R+ 0.587G+ 0.114B Cr = R− Y Cb = B − Y
The simplicity of the transformation and explicit separation of luminance and chrominance components makes this color space attractive for skin color modelling.
2.2.5 Which skin color space One of the major questions in using skin color in skin detection is how to choose a suitable color space. A wide variety of different color spaces has been applied to the problem of skin color modelling. From a recent research , a briefly review of the most popular color spaces and their properties is presented. For real world applications and dynamic scenes, color spaces that separate the chrominance and luminance components of color are typically preferable. The main reason for this is that chrominance-dependent components of color are considered, and increased
2.2. SKIN-COLOR SPACE
robustness to illumination changes can be achieved. Since for example, HSV seems to be a good alternative, but HSV family presents lower reliability when the scenes are complex and they contain similar colors such as wood textures . Moreover, in order to transform a frame, it would be necessary to change each pixel to the new color space which can be avoided if the camera provides RGB images directly, as most of them do. Therefore, for the purpose of this thesis, a choice between RGB and Y CrCb color spaces is considered.
Face detection is a useful task in many applications such as video conferencing, human-machine interfaces, Content Based Image Retrieval (CBIR), surveillance systems etc. It is also often used in the first step of automatic face recognition by determining the presence of faces (if any) in the input image (or video sequence). The face region including its location and size is the output of a face detection step. In general, the face recognition problem (in computer vision) can be formulated as follows: Given still or video images of a scene, determine the presence of faces and then identify or verify one or more faces in the scene using a stored database of faces. Thus, the accuracy of a face recognition system is depended on the accuracy of the face detection system. But, the variability of the appearance in the face patterns makes it a difficult task. A robust face detector should be able to find the faces regardless of their number, color, positions, occlusions, orientations, an facial expressions, etc. Although this issue is still an unsolved problem, many methods have been proposed for detecting faces. Additionally, color and motion, when avail- able, may be characteristics in face detection. Even if the disadvantages of color based methods like sensitivity on varying lighting conditions make them not as ro- bust methods, they can still be easily used as a pre-processing step in face detection.
Most of the robust face detecting methods can be classified into two main cat- egories: feature based and image based techniques. The feature based techniques make explicit use of face knowledge. They start by deriving low-level features and then apply knowledge based analysis. The image based techniques rely on a face in 2D. By using training schemes and learning algorithms the data can be classified into face or non-face groups. Here, a brief summary of feature and image based techniques will be presented.
CHAPTER 3. FACE DETECTION
3.1 Feature Based Techniques
As the title declared the focus in this class is on extracting facial features. The foun- dation of face detection task in feature based methods is the facial feature searching problem. Even these techniques are quite old and had been active up to the middle 90’s. However, some feature extraction is still being utilized e.g. facial features using Gabor filters. The advantages of the feature based methods are their relative insensitivity to illumination conditions, occlusions and viewpoint whereas complex analysis (because computationally heavy) and the difficulties with low-quality im- ages are the main drawbacks of these methods.
3.2 Image Based Techniques
Basically, these methods scan an input image at all possible locations and scale and then classify the sub-windows either as face or non-face. In fact, the techniques rely on training sets to capture the large variability in facial appearances instead of extracting the visual facial features (i.e. previous techniques).
Since the face detection step will be strictly affected on the performance of the whole system, a robust face detector should be employed. The accuracy and speed up of the face detectors have been studied in previous works. In this thesis, the chosen face detector is an efficient detector scheme presented by Viola and Jones (2001) using Haar-like features and Adaboost as training algorithm. In the next section, a brief description of the chosen scheme is given.
3.3 Face Detection based on Haar-like features and AdaBoost algorithm
This technique relies on the use of simple Haar-like features with a new image rep- resentation (integral image). Then AdaBoost is used to select the most prominent features among a large number of extracted features. Finally, a strong classifier from boosting a set of weak classifiers would be extracted. This approach has proven to be an effective algorithm to visual object detection and also one of the first real- time frontal-view face detectors. The effectiveness of this approach is based on four particular facts.
1. Using a set of simple masks similar to Haar-filters.
2. Using integral image representation which speeds up the feature extraction.
3. Using a learning algorithm, AdaBoost, yielding an effective classifier, which decreases the number of features.
3.3. FACE DETECTION BASED ON HAAR-LIKE FEATURES AND ADABOOST ALGORITHM
4. Using the Attentional Cascade structure which allows background region of an image to be quickly discarded while spending more computation on promising object-like regions.
A discussion of each particular fact is presented below.
3.3.1 Haar-like features: Feature extraction Working with only image intensities ( i.e. the greylevel pixel values at each and every pixel of image) generally makes the task computationally expensive. An al- ternate feature set to the usual image intensities can be much faster. This feature set considers rectangular regions of the image and sums up the pixels in this region. Additionally, features carry better domain knowledge than pixels. The Viola-Jones features can be thought of as pixel intensity set evaluations in its simplest form. The feature value is defined as the difference value between the sum of the luminance of some region(s) pixels and the sum of the luminance of other region(s) pixels. The position and the size of the features depend on the detection box. For instance, features of type 3 in Figure 3.1 will have 4 parameters: the position (x, y) in the detection box, the size of the white region (or positive region) (w), the size of black region(or negative region) (b), and the height (h) of the feature.
Figure 3.1. Four examples of the type of feature normally used in the Viola-Jones system.
3.3.2 Integral Image: Speed up the feature extraction In order to have a reliable detection algorithm we need to develop two main issues, namely accuracy and speed. There is generally a trade-off between them. To improve the speed of the feature extraction one efficient way is to use the integral image representation. The integral image representation of an image (1) is defined as:
Int(x, y) = x∑
Hence, the integral image at location (x, y) is the summation of all pixel values above and left of (x, y) inclusive. The computational tasks will be easier by using
CHAPTER 3. FACE DETECTION
the integral image representation which yields a speed up in the feature extracting process. This is done in such a manner that any rectangle in an image can be calculated from the corresponding integral image, by indexing the integral image only four times.
In Figure 3.2, there is an evaluation of a rectangle as an example. The rectangle is specified as four coordinates (x1, y1) upper left and (x4, y4) lower right.
A(x1, y1, x4, y4) = Int(x1, y1) + Int(x4, y4)− Int(x1, y4)− Int(x4, y1).
Figure 3.2. An example of integral image application.
A simple definition of face recognition is determining the identity of a given face. Thus, the facial representation and characteristics must be first extracted and then matched to database. The database consists of facial representations of known individuals. The given face can be recognized as a known person or rejected as a new unknown person after matching. The main issue is extracting facial representation. Several methods have been proposed based on sex different approaches.
1. Feature based methods: The feature based method is the earliest approach in face recognition, based on geometrical relations in poses, image conditions and rotation, e.g. distances between facial features like eyes. The recognition is usually done by using the Euclidian distance.
2. Model based methods: The model based methods basically consist of three steps, (a) Defining a model of the face structure, (b) Fitting the model to the given face image, and (c) Using the parameters of the fitted model as the feature vector to calculate similarity between the input face and those in the database.
3. Appearance based methods: The appearance based methods are similar to model based methods. The aim here is to achieve higher accuracy through larger training data set by finding some transformation for mapping the faces from the input space into a low-dimensional feature space (e.g. Principal Component Analysis).
4. 3D based methods: As the name is suggesting, 3D based methods rely on three dimensional poses. The time consuming nature of 3D based poses makes theses methods complicated to use.
5. Video based methods: Most of these methods are based on facial special structure. The best frames of video are chosen to feed into face recognition system.
CHAPTER 4. FACE RECOGNITION
6. Hybrid & Multimodal based methods: theses methods may be a combination of some of the above presented methods in order to have better performance, e.g. a combination of appearance and feature based methods.
In this work, a new feature space for representing face images is used. The facial representation is based on the local binary pattern, presented by Ojala 1996. A proper description of Local Binary Patterns (LBP) is presented in this work.
4.1 Local Binary Patterns
In short, a Local Binary Pattern is a texture descriptor. Since 2D surface texture is a valuable issue in machine vision, having a sufficient description of texture is useful in various applications. The Local Binary Patterns operator is one of the best performing texture descriptors. The LBP operator labels pixels of an image by thresholding the NxN neighborhood of each pixel with the value of the center pixel and considers the result as a binary mask. Figure 4.1 shows an example of an LPB operator utilizing 3x3 neighborhoods. The operator assigns a binary code of 0 and 1 to each neighbor of the mask. The binary code of each pixel in the case of 3x3 masks would be a binary code of 8 bits and by a single scan through the image for each pixel the LBP codes of the entire image can be calculated. An easy manner to show the final LBP codes over the image is the histogram of the labels where a 256-bin histogram represents the texture description of the image and each bin can be regarded as a micro-pattern, see Figure 4.2 for more details, .
Figure 4.1. Example of an LBP calculation.
Since each bin represents a micro-pattern, the curved edges can be easily de- tected by LBP. Local primitives which are coded by these bins include different types of curved edges, spots, flat areas, etc. An example of texture primitives which can be detected by LBP comes in the Figure 4.3.
Ideally, a good texture descriptor should be easy to compute and has high extra- class variance (i.e., between different persons in the case of face recognition) and low intra-class variance, which means that the descriptor should be robust with respect to aging of the subjects, alternating illumination and other factors.
4.1. LOCAL BINARY PATTERNS
Figure 4.2. 256 bins LBP histograms of two samples .
Figure 4.3. Example of texture primitives detected by LBP ,(white circles represent ones and black cirlces zeros).
The original LPB operator has been extended to consider different neighborhood sizes (Ojala et al. 2002b). For example, the operator LBP (4, 1) uses only 4 neigh- bors while LBP (16, 2) considers the 16 neighbors on a circle of radius 2. In general, the operator LBP (P,R) refers to a neighborhood size of P of equally spaced pixels on a circle of radius R that form a circularly symmetric neighbor set. Figure 4.4 shows some examples of neighborhood sets. LBP (P,R) produces 2p different out- put values, corresponding to the 2P different binary patterns that can be formed by the P pixels in the neighbor set. It has been shown that certain bins contain more information than others . Therefore, it is possible to use only a subset of the 2P local binary patterns to describe the textured images. Ojala et al. (2002b) defined these fundamental patterns (also called "uniform" patterns) as those with a small number of bitwise transitions from 0 to 1 and vice versa. For example, 00000000 and 11111111 contain 0 transition while 00000110 and 01111000 contain 2 transi- tions and so on. Thus a pattern is uniform if the number of bitwise transitions is