

















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A comprehensive survey of current methods for detecting faces in images. The authors discuss various approaches to face detection, including knowledge-based top-down methods, multiple features methods, and machine learning techniques. They also discuss the use of eigenfaces and neural networks for face recognition. The document also covers the importance of face databases for research and development in this field.
Typology: Papers
1 / 25
This page cannot be seen from the preview
Don't miss anything!


















Abstract—Images containing faces are essential to intelligent vision-based human computer interaction, and research efforts in face processing include face recognition, face tracking, pose estimation, and expression recognition. However, many reported methods assume that the faces in an image or an image sequence have been identified and localized. To build fully automated systems that analyze the information contained in face images, robust and efficient face detection algorithms are required. Given a single image, the goal of face detection is to identify all image regions which contain a face regardless of its three-dimensional position, orientation, and lighting conditions. Such a problem is challenging because faces are nonrigid and have a high degree of variability in size, shape, color, and texture. Numerous techniques have been developed to detect faces in a single image, and the purpose of this paper is to categorize and evaluate these algorithms. We also discuss relevant issues such as data collection, evaluation metrics, and benchmarking. After analyzing these algorithms and identifying their limitations, we conclude with several promising directions for future research.
Index Terms—Face detection, face recognition, object recognition, view-based recognition, statistical pattern recognition, machine learning. æ
ITH the ubiquity of new information technology and media, more effective and friendly methods for human computer interaction (HCI) are being developed which do not rely on traditional devices such as keyboards, mice, and displays. Furthermore, the ever decreasing price/ performance ratio of computing coupled with recent decreases in video image acquisition cost imply that computer vision systems can be deployed in desktop and embedded systems [111], [112], [113]. The rapidly expand- ing research in face processing is based on the premise that information about a user’s identity, state, and intent can be extracted from images, and that computers can then react accordingly, e.g., by observing a person’s facial expression. In the last five years, face and facial expression recognition have attracted much attention though they have been studied for more than 20 years by psychophysicists, neuroscientists, and engineers. Many research demonstra- tions and commercial applications have been developed from these efforts. A first step of any face processing system is detecting the locations in images where faces are present. However, face detection from a single image is a challen- ging task because of variability in scale, location, orientation (up-right, rotated), and pose (frontal, profile). Facial expression, occlusion, and lighting conditions also change the overall appearance of faces.
We now give a definition of face detection: Given an arbitrary image, the goal of face detection is to determine whether or not there are any faces in the image and, if present, return the image location and extent of each face. The challenges associated with face detection can be attributed to the following factors:
. Pose. The images of a face vary due to the relative camera-face pose (frontal, 45 degree, profile, upside down), and some facial features such as an eye or the nose may become partially or wholly occluded. . Presence or absence of structural components. Facial features such as beards, mustaches, and glasses may or may not be present and there is a great deal of variability among these components including shape, color, and size. . Facial expression. The appearance of faces are directly affected by a person’s facial expression. . Occlusion. Faces may be partially occluded by other objects. In an image with a group of people, some faces may partially occlude other faces. . Image orientation. Face images directly vary for different rotations about the camera’s optical axis. . Imaging conditions. When the image is formed, factors such as lighting (spectra, source distribution and intensity) and camera characteristics (sensor response, lenses) affect the appearance of a face. There are many closely related problems of face detection. Face localization aims to determine the image position of a single face; this is a simplified detection problem with the assumption that an input image contains only one face [85], [103]. The goal of facial feature detection is to detect the presence and location of features, such as eyes, nose, nostrils, eyebrow, mouth, lips, ears, etc., with the assumption that there is only one face in an image [28], [54]. Face recognition or face identification compares an input image (probe) against a database (gallery) and reports a match, if
34 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002
. M.-H. Yang is with Honda Fundamental Research Labs, 800 California Street, Mountain View, CA 94041. E-mail: [email protected]. . D.J. Kriegman is with the Department of Computer Science and Beckman Institute, University of Illinois at Urbana-Champaign, Urbana, IL 61801. E-mail: [email protected]. . N. Ahjua is with the Department of Electrical and Computer Engineering and Beckman Institute, University of Illinois at Urbana-Champaign, Urbana, IL 61801. E-mail: [email protected]. Manuscript received 5 May 2000; revised 15 Jan. 2001; accepted 7 Mar. 2001. Recommended for acceptance by K. Bowyer. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 112058. 0162-8828/02/$17.00 ß 2002 IEEE
any [163], [133], [18]. The purpose of face authentication is to verify the claim of the identity of an individual in an input image [158], [82], while face tracking methods continuously estimate the location and possibly the orientation of a face in an image sequence in real time [30], [39], [33]. Facial expression recognition concerns identifying the affective states (happy, sad, disgusted, etc.) of humans [40], [35]. Evidently, face detection is the first step in any automated system which solves the above problems. It is worth mentioning that many papers use the term “face detection,” but the methods and the experimental results only show that a single face is localized in an input image. In this paper, we differentiate face detection from face localization since the latter is a simplified problem of the former. Meanwhile, we focus on face detection methods rather than tracking methods. While numerous methods have been proposed to detect faces in a single image of intensity or color images, we are unaware of any surveys on this particular topic. A survey of early face recognition methods before 1991 was written by Samal and Iyengar [133]. Chellapa et al. wrote a more recent survey on face recognition and some detection methods [18]. Among the face detection methods, the ones based on learning algorithms have attracted much attention recently and have demonstrated excellent results. Since these data- driven methods rely heavily on the training sets, we also discuss several databases suitable for this task. A related and important problem is how to evaluate the performance of the proposed detection methods. Many recent face detection papers compare the performance of several methods, usually in terms of detection and false alarm rates. It is also worth noticing that many metrics have been adopted to evaluate algorithms, such as learning time, execution time, the number of samples required in training, and the ratio between detection rates and false alarms. Evaluation becomes more difficult when researchers use different definitions for detection and false alarm rates. In this paper, detection rate is defined as the ratio between the number of faces correctly detected and the number faces determined by a human. An image region identified as a face by a classifier is considered to be correctly detected if the image region covers more than a certain percentage of a face in the image (See Section 3.3 for details). In general, detectors can make two types of errors: false negatives in which faces are missed resulting in low detection rates and false positives in which an image region is declared to be face, but it is not. A fair evaluation should take these factors into consideration since one can tune the parameters of one’s method to increase the detection rates while also increasing the number of false detections. In this paper, we discuss the benchmarking data sets and the related issues in a fair evaluation. With over 150 reported approaches to face detection, the research in face detection has broader implications for computer vision research on object recognition. Nearly all model-based or appearance-based approaches to 3D object recognition have been limited to rigid objects while attempting to robustly perform identification over a broad range of camera locations and illumination conditions. Face detection can be viewed as a two-class recognition problem
in which an image region is classified as being a “face” or “nonface.” Consequently, face detection is one of the few attempts to recognize from images (not abstract representa- tions) a class of objects for which there is a great deal of within-class variability (described previously). It is also one of the few classes of objects for which this variability has been captured using large training sets of images and, so, some of the detection techniques may be applicable to a much broader class of recognition problems. Face detection also provides interesting challenges to the underlying pattern classification and learning techniques. When a raw or filtered image is considered as input to a pattern classifier, the dimension of the feature space is extremely large (i.e., the number of pixels in normalized training images). The classes of face and nonface images are decidedly characterized by multimodal distribution func- tions and effective decision boundaries are likely to be nonlinear in the image space. To be effective, either classifiers must be able to extrapolate from a modest number of training samples or be efficient when dealing with a very large number of these high-dimensional training samples. With an aim to give a comprehensive and critical survey of current face detection methods, this paper is organized as follows: In Section 2, we give a detailed review of techniques to detect faces in a single image. Benchmarking databases and evaluation criteria are discussed in Section 3. We conclude this paper with a discussion of several promising directions for face detection in Section 4. 1 Though we report error rates for each method when available, tests are often done on unique data sets and, so, comparisons are often difficult. We indicate those methods that have been evaluated with a publicly available test set. It can be assumed that a unique data set was used if we do not indicate the name of the test set.
In this section, we review existing techniques to detect faces from a single intensity or color image. We classify single image detection methods into four categories; some methods clearly overlap category boundaries and are discussed at the end of this section.
YANG ET AL.: DETECTING FACES IN IMAGES: A SURVEY 35
part of the face (the dark shaded parts in Fig. 2) has four cells with a basically uniform intensity,” “the upper round part of a face (the light shaded parts in Fig. 2) has a basically uniform intensity,” and “the difference between the average gray values of the center part and the upper round part is significant.” The lowest resolution (Level 1) image is searched for face candidates and these are further processed at finer resolutions. At Level 2, local histogram equalization is performed on the face candidates received from Level 2, followed by edge detection. Surviving candidate regions are then examined at Level 3 with another set of rules that respond to facial features such as the eyes and mouth. Evaluated on a test set of 60 images, this system located faces in 50 of the test images while there are 28 images in which false alarms appear. One attractive feature of this method is that a coarse-to-fine or focus-of-attention strategy is used to reduce the required computation. Although it does not result in a high detection rate, the ideas of using a multiresolution hierarchy and rules to guide searches have been used in later face detection works [81]. Kotropoulos and Pitas [81] presented a rule-based localization method which is similar to [71] and [170]. First, facial features are located with a projection method that Kanade successfully used to locate the boundary of a face [71]. Let Iðx; yÞ be the intensity value of an m n image at position ðx; yÞ, the horizontal and vertical projections of the image are defined as HIðxÞ ¼
Pn y¼ 1 Iðx; yÞ^ and^ V IðyÞ ¼^
Pm x¼ 1 Iðx; yÞ. The horizontal profile of an input image is obtained first, and then the two local minima, determined by detecting abrupt changes in HI, are said to correspond to the left and right side of the head. Similarly, the vertical profile is obtained and the local minima are determined for the locations of mouth lips, nose tip, and eyes. These detected features constitute a facial candidate. Fig. 3a shows one example where the boundaries
of the face correspond to the local minimum where abrupt intensity changes occur. Subsequently, eyebrow/eyes, nos- trils/nose, and the mouth detection rules are used to validate these candidates. The proposed method has been tested using a set of faces in frontal views extracted from the European ACTS M2VTS (MultiModal Verification for Teleservices and Security applications) database [116] which contains video sequences of 37 different people. Each image sequence contains only one face in a uniform background. Their method provides correct face candidates in all tests. The detection rate is 86.5 percent if successful detection is defined as correctly identifying all facial features. Fig. 3b shows one example in which it becomes difficult to locate a face in a complex background using the horizontal and vertical profiles. Furthermore, this method cannot readily detect multiple faces as illustrated in Fig. 3c. Essentially, the projection method can be effective if the window over which it operates is suitably located to avoid misleading interference.
In contrast to the knowledge-based top-down approach, researchers have been trying to find invariant features of faces for detection. The underlying assumption is based on the observation that humans can effortlessly detect faces and objects in different poses and lighting conditions and, so, there must exist properties or features which are invariant over these variabilities. Numerous methods have been proposed to first detect facial features and then to infer the presence of a face. Facial features such as eyebrows, eyes, nose, mouth, and hair-line are commonly extracted using edge detectors. Based on the extracted features, a statistical model is built to describe their relationships and to verify the existence of a face. One problem with these feature-based algorithms is that the image features can be severely corrupted due to illumination, noise, and occlu- sion. Feature boundaries can be weakened for faces, while shadows can cause numerous strong edges which together render perceptual grouping algorithms useless.
Sirohey proposed a localization method to segment a face from a cluttered background for face identification [145]. It uses an edge map (Canny detector [15]) and heuristics to remove and group edges so that only the ones on the face
YANG ET AL.: DETECTING FACES IN IMAGES: A SURVEY 37
Fig. 2. A typical face used in knowledge-based top-down methods: Rules are coded based on human knowledge about the characteristics (e.g., intensity distribution and difference) of the facial regions [170].
Fig. 3. (a) and (b) n = 8. (c) n = 4. Horizontal and vertical profiles. It is feasible to detect a single face by searching for the peaks in horizontal and vertical profiles. However, the same method has difficulty detecting faces in complex backgrounds or multiple faces as shown in (b) and (c).
contour are preserved. An ellipse is then fit to the boundary between the head region and the background. This algorithm achieves 80 percent accuracy on a database of 48 images with cluttered backgrounds. Instead of using edges, Chetverikov and Lerch presented a simple face detection method using blobs and streaks (linear sequences of similarly oriented edges) [20]. Their face model consists of two dark blobs and three light blobs to represent eyes, cheekbones, and nose. The model uses streaks to represent the outlines of the faces, eyebrows, and lips. Two triangular configurations are utilized to encode the spatial relationship among the blobs. A low resolution Laplacian image is generated to facilitate blob detection. Next, the image is scanned to find specific triangular occurrences as candidates. A face is detected if streaks are identified around a candidate. Graf et al. developed a method to locate facial features and faces in gray scale images [54]. After band pass filtering, morphological operations are applied to enhance regions with high intensity that have certain shapes (e.g., eyes). The histogram of the processed image typically exhibits a prominent peak. Based on the peak value and its width, adaptive threshold values are selected in order to generate two binarized images. Connected components are identified in both binarized images to identify the areas of candidate facial features. Combinations of such areas are then evaluated with classifiers, to determine whether and where a face is present. Their method has been tested with head-shoulder images of 40 individuals and with five video sequences where each sequence consists of 100 to 200 frames. However, it is not clear how morphological operations are performed and how the candidate facial features are combined to locate a face. Leung et al. developed a probabilistic method to locate a face in a cluttered scene based on local feature detectors and random graph matching [87]. Their motivation is to formulate the face localization problem as a search problem in which the goal is to find the arrangement of certain facial features that is most likely to be a face pattern. Five features (two eyes, two nostrils, and nose/lip junction) are used to describe a typical face. For any pair of facial features of the same type (e.g., left- eye, right-eye pair), their relative distance is computed, and over an ensemble of images the distances are modeled by a Gaussian distribution. A facial template is defined by averaging the responses to a set of multiorientation, multi- scale Gaussian derivative filters (at the pixels inside the facial feature) over a number of faces in a data set. Given a test image, candidate facial features are identified by matching the filter response at each pixel against a template vector of responses (similar to correlation in spirit). The top two feature candidates with the strongest response are selected to search for the other facial features. Since the facial features cannot appear in arbitrary arrangements, the expected locations of the other features are estimated using a statistical model of mutual distances. Furthermore, the covariance of the esti- mates can be computed. Thus, the expected feature locations can be estimated with high probability. Constellations are then formed only from candidates that lie inside the appropriate locations, and the most face-like constellation is determined. Finding the best constellation is formulated as a random graph matching problem in which the nodes of the
graph correspond to features on a face, and the arcs represent the distances between different features. Ranking of constellations is based on a probability density function that a constellation corresponds to a face versus the probability it was generated by an alternative mechanism (i.e., nonface). They used a set of 150 images for experiments in which a face is considered correctly detected if any constellation correctly locates three or more features on the faces. This system is able to achieve a correct localization rate of 86 percent. Instead of using mutual distances to describe the relationships between facial features in constellations, an alternative method for modeling faces was also proposed by the Leung et al. [13], [88]. The representation and ranking of the constellations is accomplished using the statistical theory of shape, developed by Kendall [75] and Mardia and Dryden [95]. The shape statistics is a joint probability density function over N feature points, repre- sented by ðx (^) i ; y (^) iÞ, for the ith feature under the assumption that the original feature points are positioned in the plane according to a general 2 N-dimensional Gaussian distribu- tion. They applied the same maximum-likelihood (ML) method to determine the location of a face. One advantage of these methods is that partially occluded faces can be located. However, it is unclear whether these methods can be adapted to detect multiple faces effectively in a scene. In [177], [178], Yow and Cipolla presented a feature- based method that uses a large amount of evidence from the visual image and their contextual evidence. The first stage applies a second derivative Gaussian filter, elongated at an aspect ratio of three to one, to a raw image. Interest points, detected at the local maxima in the filter response, indicate the possible locations of facial features. The second stage examines the edges around these interest points and groups them into regions. The perceptual grouping of edges is based on their proximity and similarity in orientation and strength. Measurements of a region’s characteristics, such as edge length, edge strength, and intensity variance, are computed and stored in a feature vector. From the training data of facial features, the mean and covariance matrix of each facial feature vector are computed. An image region becomes a valid facial feature candidate if the Mahalanobis distance between the corresponding feature vectors is below a threshold. The labeled features are further grouped based on model knowledge of where they should occur with respect to each other. Each facial feature and grouping is then evaluated using a Bayesian network. One attractive aspect is that this method can detect faces at different orientations and poses. The overall detection rate on a test set of 110 images of faces with different scales, orientations, and viewpoints is 85 percent [179]. However, the reported false detection rate is 28 percent and the implementation is only effective for faces larger than 60 60 pixels. Subse- quently, this approach has been enhanced with active contour models [22], [179]. Fig. 4 summarizes their feature- based face detection method. Takacs and Wechsler described a biologically motivated face localization method based on a model of retinal feature extraction and small oscillatory eye movements [157]. Their algorithm operates on the conspicuity map or region of interest, with a retina lattice modeled after the magnocellular
38 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002
where is a threshold selected empirically from the histogram of samples. Saxe and Foulds proposed an iterative skin identification method that uses histogram intersection in HSV color space [138]. An initial patch of skin color pixels, called the control seed, is chosen by the user and is used to initiate the iterative algorithm. To detect skin color regions, their method moves through the image, one patch at a time, and presents the control histogram and the current histogram from the image for comparison. Histogram intersection [155] is used to compare the control histogram and current histogram. If the match score or number of instances in common (i.e., intersection) is greater than a threshold, the current patch is classified as being skin color. Kjeldsen and Kender defined a color predicate in HSV color space to separate skin regions from background [79]. In contrast to the nonparametric methods mentioned above, Gaussian density functions [14], [77], [173] and a mixture of Gaussians [66], [67], [174] are often used to model skin color. The parameters in a unimodal Gaussian distribution are often estimated using maximum-likelihood [14], [77], [173]. The motivation for using a mixture of Gaussians is based on the observation that the color histogram for the skin of people with different ethnic background does not form a unimodal distribution, but rather a multimodal distribution. The parameters in a mixture of Gaussians are usually estimated using an EM algorithm [66], [174]. Recently, Jones and Rehg conducted a large-scale experiment in which nearly 1 billion labeled skin tone pixels are collected (in normalized RGB color space) [69]. Comparing the performance of histogram and mixture models for skin detection, they find histogram models to be superior in accuracy and computational cost. Color information is an efficient tool for identifying facial areas and specific facial features if the skin color model can be properly adapted for different lighting environments. How- ever, such skin color models are not effective where the spectrum of the light source varies significantly. In other words, color appearance is often unstable due to changes in both background and foreground lighting. Though the color constancy problem has been addressed through the formula- tion of physics-based models [45], several approaches have been proposed to use skin color in varying lighting conditions. McKenna et al. presented an adaptive color mixture model to track faces under varying illumination conditions [99]. Instead of relying on a skin color model based on color constancy, they used a stochastic model to estimate an object’s color distribution online and adapt to accom- modate changes in the viewing and lighting conditions. Preliminary results show that their system can track faces within a range of illumination conditions. However, this method cannot be applied to detect faces in a single image. Skin color alone is usually not sufficient to detect or track faces. Recently, several modular systems using a combina- tion of shape analysis, color segmentation, and motion information for locating or tracking heads and faces in an image sequence have been developed [55], [173], [172], [99], [147]. We review these methods in the next section.
Recently, numerous methods that combine several facial features have been proposed to locate or detect faces. Most of them utilize global features such as skin color, size, and shape
to find face candidates, and then verify these candidates using local, detailed features such as eye brows, nose, and hair. A typical approach begins with the detection of skin-like regions as described in Section 2.2.3. Next, skin-like pixels are grouped together using connected component analysis or clustering algorithms. If the shape of a connected region has an elliptic or oval shape, it becomes a face candidate. Finally, local features are used for verification. However, others, such as [17], [63], have used different sets of features. Yachida et al. presented a method to detect faces in color images using fuzzy theory [19], [169], [168]. They used two fuzzy models to describe the distribution of skin and hair color in CIE XYZ color space. Five (one frontal and four side views) head-shape models are used to abstract the appear- ance of faces in images. Each shape model is a 2D pattern consisting of m n square cells where each cell may contain several pixels. Two properties are assigned to each cell: the skin proportion and the hair proportion, which indicate the ratios of the skin area (or the hair area) within the cell to the area of the cell. In a test image, each pixel is classified as hair, face, hair/face, and hair/background based on the distribu- tion models, thereby generating skin-like and hair-like regions. The head shape models are then compared with the extracted skin-like and hair-like regions in a test image. If they are similar, the detected region becomes a face candidate. For verification, eye-eyebrow and nose-mouth features are extracted from a face candidate using horizontal edges. Sobottka and Pitas proposed a method for face localization and facial feature extraction using shape and color [147]. First, color segmentation in HSV space is performed to locate skin-like regions. Connected components are then deter- mined by region growing at a coarse resolution. For each connected component, the best fit ellipse is computed using geometric moments. Connected components that are well approximated by an ellipse are selected as face candidates. Subsequently, these candidates are verified by searching for facial features inside of the connected components. Features, such as eyes and mouths, are extracted based on the observation that they are darker than the rest of a face. In [159], [160], a Gaussian skin color model is used to classify skin color pixels. To characterize the shape of the clusters in the binary image, a set of 11 lowest-order geometric moments is computed using Fourier and radial Mellin transforms. For detection, a neural network is trained with the extracted geometric moments. Their experiments show a detection rate of 85 percent based on a test set of 100 images. The symmetry of face patterns has also been applied to face localization [131]. Skin/nonskin classification is carried out using the class-conditional density function in YES color space followed by smoothing in order to yield contiguous regions. Next, an elliptical face template is used to determine the similarity of the skin color regions based on Hausdorff distance [65]. Finally, the eye centers are localized using several cost functions which are designed to take advantage of the inherent symmetries associated with face and eye locations. The tip of the nose and the center of the mouth are then located by utilizing the distance between the eye centers. One drawback is that it is effective only for a single frontal-view face and when both
40 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002
eyes are visible. A similar method using color and local symmetry was presented in [151]. In contrast to pixel-based methods, a detection method based on structure, color, and geometry was proposed in [173]. First, multiscale segmentation [2] is performed to extract homogeneous regions in an image. Using a Gaussian skin color model, regions of skin tone are extracted and grouped into ellipses. A face is detected if facial features such as eyes and mouth exist within these elliptic regions. Experimental results show that this method is able to detect faces at different orientations with facial features such as beard and glasses. Kauth et al. proposed a blob representation to extract a compact, structurally meaningful description of multispec- tral satellite imagery [74]. A feature vector at each pixel is formed by concatenating the pixel’s image coordinates to the pixel’s spectral (or textural) components; pixels are then clustered using this feature vector to form coherent connected regions, or “blobs.” To detect faces, each feature vector consists of the image coordinates and normalized chrominance, i.e., X ¼ ðx; y; (^) rþrgþb ; (^) rþggþbÞ [149], [105]. A connectivity algorithm is then used to grow blobs and the resulting skin blob whose size and shape is closest to that of a canonical face is considered as a face. Range and color have also been employed for face detection by Kim et al. [77]. Disparity maps are computed and objects are segmented from the background with a disparity histogram using the assumption that background pixels have the same depth and they outnumber the pixels in the foreground objects. Using a Gaussian distribution in normalized RGB color space, segmented regions with a skin- like color are classified as faces. A similar approach has been proposed by Darrell et al. for face detection and tracking [33].
In template matching, a standard face pattern (usually frontal) is manually predefined or parameterized by a function. Given an input image, the correlation values with the standard patterns are computed for the face contour, eyes, nose, and mouth independently. The existence of a face is determined based on the correlation values. This approach has the advantage of being simple to implement. However, it has proven to be inadequate for face detection since it cannot effectively deal with variation in scale, pose, and shape. Multiresolution, multiscale, subtemplates, and deformable templates have subsequently been proposed to achieve scale and shape invariance.
An early attempt to detect frontal faces in photographs is reported by Sakai et al. [132]. They used several subtemplates for the eyes, nose, mouth, and face contour to model a face. Each subtemplate is defined in terms of line segments. Lines in the input image are extracted based on greatest gradient change and then matched against the subtemplates. The correlations between subimages and contour templates are computed first to detect candidate locations of faces. Then, matching with the other subtemplates is performed at the candidate positions. In other words, the first phase deter- mines focus of attention or region of interest and the second phase examines the details to determine the existence of a
face. The idea of focus of attention and subtemplates has been adopted by later works on face detection. Craw et al. presented a localization method based on a shape template of a frontal-view face (i.e., the outline shape of a face) [27]. A Sobel filter is first used to extract edges. These edges are grouped together to search for the template of a face based on several constraints. After the head contour has been located, the same process is repeated at different scales to locate features such as eyes, eyebrows, and lips. Later, Craw et al. describe a localization method using a set of 40 templates to search for facial features and a control strategy to guide and assess the results from the template-based feature detectors [28]. Govindaraju et al. presented a two stage face detection method in which face hypotheses are generated and tested [52], [53], [51]. A face model is built in terms of features defined by the edges. These features describe the curves of the left side, the hair-line, and the right side of a frontal face. The Marr-Hildreth edge operator is used to obtain an edge map of an input image. A filter is then used to remove objects whose contours are unlikely to be part of a face. Pairs of fragmented contours are linked based on their proximity and relative orientation. Corners are detected to segment the contour into feature curves. These feature curves are then labeled by checking their geometric properties and relative positions in the neighborhood. Pairs of feature curves are joined by edges if their attributes are compatible (i.e., if they could arise from the same face). The ratios of the feature pairs forming an edge is compared with the golden ratio and a cost is assigned to the edge. If the cost of a group of three feature curves (with different labels) is low, the group becomes a hypothesis. When detecting faces in newspaper articles, collateral information, which indicates the number of persons in the image, is obtained from the caption of the input image to select the best hypotheses [52]. Their system reports a detection rate of approximately 70 percent based on a test set of 50 photographs. However, the faces must be upright, unoccluded, and frontal. The same approach has been extended by extracting edges in the wavelet domain by Venkatraman and Govindaraju [165]. Tsukamoto et al. presented a qualitative model for face pattern (QMF) [161], [162]. In QMF, each sample image is divided into a number of blocks, and qualitative features are estimated for each block. To parameterize a face pattern, “lightness” and “edgeness” are defined as the features in this model. Consequently, this blocked template is used to calculate “faceness” at every position of an input image. A face is detected if the faceness measure is above a predefined threshold. Silhouettes have also been used as templates for face localization [134]. A set of basis face silhouettes is obtained using principal component analysis (PCA) on face examples in which the silhouette is represented by an array of bits. These eigen-silhouettes are then used with a generalized Hough transform for localization. A localization method based on multiple templates for facial components was proposed in [150]. Their method defines numerous hypoth- eses for the possible appearances of facial features. A set of hypotheses for the existence of a face is then defined in terms of the hypotheses for facial components using the Dempster- Shafer theory [34]. Given an image, feature detectors compute
YANG ET AL.: DETECTING FACES IN IMAGES: A SURVEY 41
characteristics of face and nonface images. The learned characteristics are in the form of distribution models or discriminant functions that are consequently used for face detection. Meanwhile, dimensionality reduction is usually carried out for the sake of computation efficiency and detection efficacy. Many appearance-based methods can be understood in a probabilistic framework. An image or feature vector derived from an image is viewed as a random variable x, and this random variable is characterized for faces and nonfaces by the class-conditional density functions pðxjfaceÞ and pðxjnonfaceÞ. Bayesian classification or maximum likelihood can be used to classify a candidate image location as face or nonface. Unfortunately, a straightforward implementation of Bayesian classification is infeasible because of the high dimensionality of x, because pðxjfaceÞ and pðxjnonfaceÞ are multimodal, and because it is not yet understood if there are natural parameterized forms for pðxjfaceÞ and pðxjnonfaceÞ. Hence, much of the work in an appearance-based method concerns empirically validated parametric and nonpara- metric approximations to pðxjfaceÞ and pðxjnonfaceÞ. Another approach in appearance-based methods is to find a discriminant function (i.e., decision surface, separating hyperplane, threshold function) between face and nonface classes. Conventionally, image patterns are projected to a lower dimensional space and then a discriminant function is formed (usually based on distance metrics) for classification [163], or a nonlinear decision surface can be formed using multilayer neural networks [128]. Recently, support vector machines and other kernel methods have been proposed. These methods implicitly project patterns to a higher dimensional space and then form a decision surface between the projected face and nonface patterns [107].
An early example of employing eigenvectors in face recognition was done by Kohonen [80] in which a simple neural network is demonstrated to perform face recognition for aligned and normalized face images. The neural network computes a face description by approximating the eigenvectors of the image’s autocorrelation matrix. These eigenvectors are later known as Eigenfaces. Kirby and Sirovich demonstrated that images of faces can be linearly encoded using a modest number of basis images [78]. This demonstration is based on the Karhunen-Loe`ve transform [72], [93], [48], which also goes by other names, e.g., principal component analysis [68], and the Hotelling transform [50]. The idea is arguably proposed first by Pearson in 1901 [110] and then by Hotelling in 1933 [62]. Given a collection of n by m pixel training images represented as a vector of size m n, basis vectors spanning an optimal subspace are determined such that the mean square error between the projection of the training images onto this subspace and the original images is minimized. They call the set of optimal basis vectors eigenpictures since these are simply the eigenvectors of the covariance matrix computed from the vectorized face images in the training set. Experiments with a set of 100 images show that a face image of 91 50 pixels can be effectively encoded using only 50 eigenpictures, while retaining a reasonable likeness (i.e., capturing 95 percent of the variance).
Turk and Pentland applied principal component analysis to face recognition and detection [163]. Similar to [78], principal component analysis on a training set of face images is performed to generate the Eigenpictures (here called Eigenfaces) which span a subspace (called the face space) of the image space. Images of faces are projected onto the subspace and clustered. Similarly, nonface training images are projected onto the same subspace and clustered. Since images of faces do not change radically when projected onto the face space, while the projection of nonface images appear quite different. To detect the presence of a face in a scene, the distance between an image region and the face space is computed for all locations in the image. The distance from face space is used as a measure of “faceness,” and the result of calculating the distance from face space is a “face map.” A face can then be detected from the local minima of the face map. Many works on face detection, recognition, and feature extractions have adopted the idea of eigenvector decomposition and clustering.
Sung and Poggio developed a distribution-based system for face detection [152], [154] which demonstrated how the distributions of image patterns from one object class can be learned from positive and negative examples (i.e., images) of that class. Their system consists of two components, distribution-based models for face/nonface patterns and a multilayer perceptron classifier. Each face and nonface example is first normalized and processed to a 19 19 pixel image and treated as a 361-dimensional vector or pattern. Next, the patterns are grouped into six face and six nonface clusters using a modified k-means algorithm, as shown in Fig. 6. Each cluster is represented as a multidimensional Gaussian function with a mean image and a covariance matrix. Fig. 7 shows the distance measures in their method. Two distance metrics are computed between an input image pattern and the prototype clusters. The first distance component is the normalized Mahalanobis distance between the test pattern and the cluster centroid, measured within a lower-dimensional subspace spanned by the cluster’s 75 largest eigenvectors. The second distance component is the Euclidean distance between the test pattern and its projection
YANG ET AL.: DETECTING FACES IN IMAGES: A SURVEY 43
Fig. 6. Face and nonface clusters used by Sung and Poggio [154]. Their method estimates density functions for face and nonface patterns using a set of Gaussians. The centers of these Gaussians are shown on the right (Courtesy of K.-K. Sung and T. Poggio).
onto the 75-dimensional subspace. This distance component accounts for pattern differences not captured by the first distance component. The last step is to use a multilayer perceptron (MLP) network to classify face window patterns from nonface patterns using the twelve pairs of distances to each face and nonface cluster. The classifier is trained using standard backpropagation from a database of 47,316 window patterns. There are 4,150 positive examples of face patterns and the rest are nonface patterns. Note that it is easy to collect a representative sample face patterns, but much more difficult to get a representative sample of nonface patterns. This problem is alleviated by a bootstrap method that selectively adds images to the training set as training progress. Starting with a small set of nonface examples in the training set, the MLP classifier is trained with this database of examples. Then, they run the face detector on a sequence of random images and collect all the nonface patterns that the current system wrongly classifies as faces. These false positives are then added to the training database as new nonface examples. This bootstrap method avoids the problem of explicitly collecting a representative sample of nonface patterns and has been used in later works [107], [128]. A probabilistic visual learning method based on density estimation in a high-dimensional space using an eigenspace decomposition was developed by Moghaddam and Pentland [103]. Principal component analysis (PCA) is used to define the subspace best representing a set of face patterns. These principal components preserve the major linear correlations in the data and discard the minor ones. This method decomposes the vector space into two mutually exclusive and complementary subspaces: the principal subspace (or feature space) and its orthogonal complement. Therefore, the target density is decomposed into two components: the density in the principal subspace (spanned by the principal components) and its orthogonal complement (which is discarded in standard PCA) (See Fig. 8). A multivariate Gaussian and a mixture of Gaussians are used to learn the statistics of the local features of a face. These probability densities are then used for object detection based on maximum likelihood estimation. The proposed method has been applied to face localization, coding, and recognition.
Compared with the classic eigenface approach [163], the proposed method shows better performance in face recogni- tion. In terms of face detection, this technique has only been demonstrated on localization; see also [76]. In [175], a detection method based on a mixture of factor analyses was proposed. Factor analysis (FA) is a statistical method for modeling the covariance structure of high dimensional data using a small number of latent variables. FA is analogous to principal component analysis (PCA) in several aspects. However, PCA, unlike FA, does not define a proper density model for the data since the cost of coding a data point is equal anywhere along the principal component subspace (i.e., the density is unnormalized along these directions). Further, PCA is not robust to independent noise in the features of the data since the principal components maximize the variances of the input data, thereby retaining unwanted variations. Synthetic and real examples in [36], [37], [9], [7] have shown that the projected samples from different classes in the PCA subspace can often be smeared. For the cases where the samples have certain structure, PCA is suboptimal from the classification standpoint. Hinton et al. have applied FA to digit recognition, and they compare the performance of PCA and FA models [61]. A mixture model of factor analyzers has recently been extended [49] and applied to face recognition [46]. Both studies show that FA performs better than PCA in digit and face recognition. Since pose, orientation, expression, and lighting affect the appearance of a human face, the distribution of faces in the image space can be better represented by a multimodal density model where each modality captures certain characteristics of certain face appearances. They present a probabilistic method that uses a mixture of factor analyzers (MFA) to detect faces with wide variations. The parameters in the mixture model are estimated using an EM algorithm. A second method in [175] uses Fisher’s Linear Discrimi- nant (FLD) to project samples from the high dimensional image space to a lower dimensional feature space. Recently, the Fisherface method [7] and others [156], [181] based on linear discriminant analysis have been shown to outperform the widely used Eigenface method [163] in face recognition on several data sets, including the Yale face database where face images are taken under varying lighting conditions. One possible explanation is that FLD provides a better projection than PCA for pattern classification since it aims to find the most discriminant projection direction. Consequently, the classification results in the projected subspace may be
44 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002
Fig. 7. The distance measures used by Sung and Poggio [154]. Two distance metrics are computed between an input image pattern and the prototype clusters. (a) Given a test pattern, the distance between that image pattern and each cluster is computed. A set of 12 distances between the test pattern and the model’s 12 cluster centroids. (b) Each distance measurement between the test pattern and a cluster centroid is a two-value distance metric. D 1 is a Mahalanobis distance between the test pattern’s projection and the cluster centroid in a subspace spanned by the cluster’s 75 largest eigenvectors. D 2 is the Euclidean distance between the test pattern and its projection in the subspace. Therefore, a distance vector of 24 values is formed for each test pattern and is used by a multilayer perceptron to determine whether the input pattern belongs to the face class or not (Courtesy of K.-K. Sung and T. Poggio).
Fig. 8. Decomposition of a face image space into the principal subspace F and its orthogonal complement F for an arbitrary density. Every data point x is decomposed into two components: distance in feature space (DIFS) and distance from feature space (DFFS) [103] (Courtesy of B. Moghaddam and A. Pentland).
nonface images (i.e., the intensities and spatial relationships of pixels) whereas Sung [152] used a neural network to find a discriminant function to classify face and nonface patterns using distance measures. They also used multiple neural networks and several arbitration methods to improve performance, while Burel and Carel [12] used a single network, and Vaillant et al. [164] used two networks for classification. There are two major components: multiple neural networks (to detect face patterns) and a decision- making module (to render the final decision from multiple detection results). As shown in Fig. 10, the first component of this method is a neural network that receives a 20 20 pixel region of an image and outputs a score ranging from -1 to 1. Given a test pattern, the output of the trained neural network indicates the evidence for a nonface (close to -1) or face pattern (close to 1). To detect faces anywhere in an image, the neural network is applied at all image locations. To detect faces larger than 20 20 pixels, the input image is repeatedly subsampled, and the network is applied at each scale. Nearly 1,050 face samples of various sizes, orientations, positions, and intensities are used to train the network. In each training image, the eyes, tip of the nose, corners, and center of the mouth are labeled manually and used to normalize the face to the same scale, orientation, and position. The second component of this method is to merge overlapping detection and arbitrate between the outputs of multiple networks. Simple arbitra- tion schemes such as logic operators (AND/OR) and voting are used to improve performance. Rowley et al. [127] reported several systems with different arbitration schemes that are less computationally expensive than Sung and Poggio’s system and have higher detection rates based on a test set of 24 images containing 144 faces. One limitation of the methods by Rowley [127] and by Sung [152] is that they can only detect upright, frontal faces. Recently, Rowley et al. [129] extended this method to detect rotated faces using a router network which processes each input window to determine the possible face orientation and then rotates the window to a canonical orientation; the rotated window is presented to the neural networks as described above. However, the new system has a lower detection rate on upright faces than the upright detector. Nevertheless, the system is able to detect 76.9 percent of faces over two large test sets with a small number of false positives.
Support Vector Machines (SVMs) were first applied to face detection by Osuna et al. [107]. SVMs can be considered as a new paradigm to train polynomial function, neural networks, or radial basis function (RBF) classifiers. While most methods for training a classifier (e.g., Bayesian, neural networks, and RBF) are based on of minimizing the training error, i.e., empirical risk, SVMs operates on another induction principle, called structural risk minimization, which aims to minimize an upper bound on the expected generalization error. An SVM classifier is a linear classifier where the separating hyperplane is chosen to minimize the expected classification error of the unseen test patterns. This optimal hyperplane is defined by a weighted combination of a small subset of the training vectors, called support vectors. Estimating the optimal hyperplane is equivalent to solving a linearly constrained quadratic programming problem. However, the computation is both time and memory intensive. In [107], Osuna et al. developed an efficient method to train an SVM for large scale problems, and applied it to face detection. Based on two test sets of 10,000,000 test patterns of 19 19 pixels, their system has slightly lower error rates and runs approximately 30 times faster than the system by Sung and Poggio [153]. SVMs have also been used to detect faces and pedestrians in the wavelet domain [106], [108], [109].
Yang et al. proposed a method that uses SNoW learning architecture [125], [16] to detect faces with different features and expressions, in different poses, and under different lighting conditions [176]. They also studied the effect of learning with primitive as well as with multiscale features. SNoW (Sparse Network of Winnows) is a sparse network of linear functions that utilizes the Winnow update rule [92]. It is specifically tailored for learning in domains in which the potential number of features taking part in decisions is very large, but may be unknown a priori. Some of the characteristics of this learning architecture are its sparsely connected units, the allocation of features and links in a data driven way, the decision mechanism, and the utiliza- tion of an efficient update rule. In training the SNoW-based face detector, 1,681 face images from Olivetti [136], UMIST [56], Harvard [57], Yale [7], and FERET [115] databases are used to capture the variations in face patterns. To compare with other methods, they report results with two readily
46 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002
Fig. 10. System diagram of Rowley’s method [128]. Each face is preprocessed before feeding it to an ensemble of neural networks. Several arbitration methods are used to determine whether a face exists based on the output of these networks (Courtesy of H. Rowley, S. Baluja, and T. Kanade).
available data sets which contain 225 images with 619 faces [128]. With an error rate of 5.9 percent, this technique performs as well as other methods evaluated on the data set 1 in [128], including those using neural networks [128], Kullback relative information [24], naive Bayes classifier [140] and support vector machines [107], while being computationally more efficient. See Table 4 for performance comparisons with other face detection methods.
In contrast to the methods in [107], [128], [154] which model the global appearance of a face, Schneiderman and Kanade described a naive Bayes classifier to estimate the joint probability of local appearance and position of face patterns (subregions of the face) at multiple resolutions [140]. They emphasize local appearance because some local patterns of an object are more unique than others; the intensity patterns around the eyes are much more distinctive than the pattern found around the cheeks. There are two reasons for using a naive Bayes classifier (i.e., no statistical dependency between the subregions). First, it provides better estimation of the conditional density functions of these subregions. Second, a naive Bayes classifier provides a functional form of the posterior probability to capture the joint statistics of local appearance and position on the object. At each scale, a face image is decomposed into four rectangular subregions. These subregions are then projected to a lower dimensional space using PCA and quantized into a finite set of patterns, and the statistics of each projected subregion are estimated from the projected samples to encode local appearance. Under this formulation, their method decides that a face is present when the likelihood ratio is larger than the ratio of prior probabilities. With an error rate of 93.0 percent on data set 1 in [128], the proposed Bayesian approach shows comparable performance to [128] and is able to detect some rotated and profile faces. Schneiderman and Kanade later extend this method with wavelet representations to detect profile faces and cars [141]. A related method using joint statistical models of local features was developed by Rickert et al. [124]. Local features are extracted by applying multiscale and multiresolution filters to the input image. The distribution of the features vectors (i.e., filter responses) is estimated by clustering the data and then forming a mixture of Gaussians. After the model is learned and further refined, test images are classified by computing the likelihood of their feature vectors
with respect to the model. Their experimental results on face and car detection show interesting and good results.
The underlying assumption of the Hidden Markov Model (HMM) is that patterns can be characterized as a parametric random process and that the parameters of this process can be estimated in a precise, well-defined manner. In devel- oping an HMM for a pattern recognition problem, a number of hidden states need to be decided first to form a model. Then, one can train HMM to learn the transitional probability between states from the examples where each example is represented as a sequence of observations. The goal of training an HMM is to maximize the probability of observing the training data by adjusting the parameters in an HMM model with the standard Viterbi segmentation method and Baum-Welch algorithms [122]. After the HMM has been trained, the output probability of an observation determines the class to which it belongs. Intuitively, a face pattern can be divided into several regions such as the forehead, eyes, nose, mouth, and chin. A face pattern can then be recognized by a process in which these regions are observed in an appropriate order (from top to bottom and left to right). Instead of relying on accurate alignment as in template matching or appearance- based methods (where facial features such as eyes and noses need to be aligned well with respect to a reference point), this approach aims to associate facial regions with the states of a continuous density Hidden Markov Model. HMM-based methods usually treat a face pattern as a sequence of observation vectors where each vector is a strip of pixels, as shown in Fig. 11a. During training and testing, an image is scanned in some order (usually from top to bottom) and an observation is taken as a block of pixels, as shown in Fig. 11a. For face patterns, the boundaries between strips of pixels are represented by probabilistic transitions between states, as shown in Fig. 11b, and the image data within a region is modeled by a multivariate Gaussian distribution. An observation sequence consists of all intensity values from each block. The output states correspond to the classes to which the observations belong. After the HMM has been trained, the output probability of an observation determines the class to which it belongs. HMMs have been applied to both face recognition and localization. Samaria [136] showed that the states of the HMM he trained corresponds to facial regions, as shown in
YANG ET AL.: DETECTING FACES IN IMAGES: A SURVEY 47
Fig. 11. Hidden Markov model for face localization. (a) Observation vectors: To train an HMM, each face sample is converted to a sequence of observation vectors. Observation vectors are constructed from a window of W L pixels. By scanning the window vertically with P pixels of overlap, an observation sequence is constructed. (b) Hidden states: When an HMM with five states is trained with sequences of observation vectors, the boundaries between states are shown in (b) [136].
Inductive learning algorithms have also been applied to locate and detect faces. Huang et al. applied Quinlan’s C4.5 algorithm [121] to learn a decision tree from positive and negative examples of face patterns [64]. Each training example is an 8 8 pixel window and is represented by a vector of 30 attributes which is composed of entropy, mean, and standard deviation of the pixel intensity values. From these examples, C4.5 builds a classifier as a decision tree whose leaves indicate class identity and whose nodes specify tests to perform on a single attribute. The learned decision tree is then used to decide whether a face exists in the input example. The experiments show a localization accuracy rate of 96 percent on a set of 2,340 frontal face images in the FERET data set. Duta and Jain [38] presented a method to learn the face concept using Mitchell’s Find-S algorithm [101]. Similar to [154], they conjecture that the distribution of face patterns pðxjfaceÞ can be approximated by a set of Gaussian clusters and that the distance from a face instance to one of the cluster centroids should be smaller than a fraction of the maximum distance from the points in that cluster to its centroid. The Find-S algorithm is then applied to learn the thresholding distance such that faces and nonfaces can be differentiated. This method has several distinct characteristics. First, it does not use negative (nonface) examples, while [154], [128] use both positive and negative examples. Second, only the central portion of a face is used for training. Third, feature vectors consist of images with 32 intensity levels or textures, while [154] uses full-scale intensity values as inputs. This method achieves a detection rate of 90 percent on the first CMU data set.
We have reviewed and classified face detection methods into four major categories. However, some methods can be classified into more than one category. For example, template matching methods usually use a face model and subtem- plates to extract facial features [132], [27], [180], [143], [51], and then use these features to locate or detect faces. Furthermore, the boundary between knowledge-based meth- ods and some template matching methods is blurry since the latter usually implicitly applies human knowledge to define the face templates [132], [28], [143]. On the other hand, face detection methods can also be categorized otherwise. For example, these methods can be classified based on whether they rely on local features [87], [140], [124] or treat a face pattern as whole (i.e., holistic) [154], [128]. Nevertheless, we think the four major classes categorize most methods sufficiently and appropriately.
Most face detection methods require a training data set of face images and the databases originally developed for face recognition experiments can be used as training sets for face detection. Since these databases were constructed to empiri- cally evaluate recognition algorithms in certain domains, we first review the characteristics of these databases and their applicability to face detection. Although numerous face
detection algorithms have been developed, most of them have not been tested on data sets with a large number of images. Furthermore, most experimental results are reported using different test sets. In order to compare methods fairly, a few benchmark data sets have recently been compiled. We review these benchmark data sets and discuss their char- acteristics. There are still a few issues that need to be carefully considered in performance evaluation even when the methods use the same test set. One issue is that researchers have different interpretations of what a “successful detec- tion” is. Another issue is that different training sets are used, particularly, for appearance-based methods. We conclude this section with a discussion of these issues.
Although many face detection methods have been proposed, less attention has been paid to the development of an image database for face detection research. The FERET database consists of monochrome images taken in different frontal views and in left and right profiles [115]. Only the upper torso of an individual (mostly head and necks) appears in an image on a uniform and uncluttered background. The FERET database has been used to assess the strengthens and weaknesses of different face recognition approaches [115]. Since each image consists of an individual on a uniform and uncluttered background, it is not suitable for face detection benchmarking. This is similar to many databases that were created for the development and testing of face recognition algorithms. Turk and Pentland created a face database of 16 people [163] (available at ftp://whitechapel. media.mit.edu/pub/images/). The images are taken in frontal view with slight variability in head orientation (tilted upright, right, and left) on a cluttered background. The face database from AT&T Cambridge Laboratories (formerly known as the Olivetti database) consists of 10 different images for forty distinct subjects. (available at http:// www.uk.research.att.com/facedatabase.html) [136]. The images were taken at different times, varying the lighting, facial expressions, and facial details (glasses). The Harvard database consists of cropped, masked frontal face images taken from a wide variety of light sources [57]. It was used by Hallinan for a study on face recognition under the effect of varying illumination conditions. With 16 individuals, the Yale face database (available at http://cvc.yale.edu/) con- tains 10 frontal images per person, each with different facial expressions, with and without glasses, and under different lighting conditions [7]. The M2VTS multimodal database from the European ACTS projects was developed for access control experiments using multimodal inputs [116]. It contains sequences of face images of 37 people. The five sequences for each subject were taken over one week. Each image sequence contains images from right profile (-90 degree) to left profile (90 degree) while the subject counts from “0” to “9” in their native languages. The UMIST database consists of 564 images of 20 people with varying pose. The images of each subject cover a range of poses from right profile to frontal views [56]. The Purdue AR database contains over 3,276 color images of 126 people (70 males and 56 females) in frontal view [96]. This database is designed for face recognition experiments under several mixing factors, such as facial expressions, illumination conditions, and occlusions. All the faces appear with different facial expression (neutral, smile, anger, and scream), illumination
YANG ET AL.: DETECTING FACES IN IMAGES: A SURVEY 49
(left light source, right light source, and sources from both sides), and occlusion (wearing sunglasses or scarf). The images were taken during two sessions separated by two weeks. All the images were taken by the same camera setup under tightly controlled conditions of illumination and pose. This face database has been applied to image and video indexing as well as retrieval [96]. Table 2 summarizes the characteristics of the abovementioned face image databases.
The abovementioned databases are designed mainly to measure performance of face recognition methods and, thus, each image contains only one individual. Therefore, such databases can be best utilized as training sets rather than test sets. The tacit reason for comparing classifiers on test sets is that these data sets represent problems that systems might face in the real world and that superior performance on these benchmarks may translate to superior performance on other
real-world tasks. Toward this end, researchers have compiled a wide collection of data sets from a wide variety of images. Sung and Poggio created two databases for face detection [152], [154]. The first set consists of 301 frontal and near- frontal mugshots of 71 different people. These images are high quality digitized images with a fair amount of lighting variation. The second set consists of 23 images with a total of 149 face patterns. Most of these images have complex background with faces taking up only a small amount of the total image area. The most widely-used face detection database has been created by Rowley et al. [127], [130] (available at http://www.cs.cmu.edu/~har/faces.html). It consists of 130 images with a total of 507 frontal faces. This data set includes 23 images of the second data set used by Sung and Poggio [154]. Most images contain more than one face on a cluttered background and, so, this is a good test set to assess algorithms which detect upright frontal faces. Fig. 12 shows some images in the data set collected by Sung and
50 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002
Face Image Database
Fig. 12. Sample images in Sung and Poggio’s data set [154]. Some images are scanned from newspapers and, thus, have low resolution. Though most faces in the images are upright and frontal. Some faces in the images appear in different pose.
expressions and in profile views [141]. Fig. 15 shows some images in the test set. Recently, Kodak compiled an image database as a common test bed for direct benchmarking of face detection and recognition algorithms [94]. Their database has 300 digital photos that are captured in a variety of resolutions and face size ranges from as small as 13 13 pixels to as
large as 300 300 pixels. Table 3 summarizes the character- istics of the abovementioned test sets for face detection.
In order to obtain a fair empirical evaluation of face detection methods, it is important to use a standard and representative test set for experiments. Although many face detection methods have been developed over the past
52 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 1, JANUARY 2002
Fig. 15. Sample images of profile faces from Schneiderman and Kanade’s data set [141]. This data set contains images with faces in profile views and some with facial expressions.
Test Sets for Face Detection
decade, only a few of them have been tested on the same data set. Table 4 summarizes the reported performance among several appearance-based face detection methods on two standard data sets described in the previous section. Although Table 4 shows the performance of these methods on the same test set, such an evaluation may not characterize how well these methods will compare in the field. There are a few factors that complicate the assessment of these appearance-based methods. First, the reported results are based on different training sets and different tuning parameters. The number and variety of training examples have a direct effect on the classification perfor- mance. However, this factor is often ignored in performance evaluation, which is an appropriate criteria if the goal is to evaluate the systems rather than the learning methods. The second factor is the training time and execution time. Although the training time is usually ignored by most systems, it may be important for real-time applications that require online training on different data sets. Third, the number of scanning windows in these methods vary because they are designed to operate in different environ- ments (i.e., to detect faces within a size range). For example, Colmenarez and Huang argued that their method scans more windows than others and, thus, the number of false detections is higher than others [24]. Furthermore, the criteria adopted in reporting the detection rates is usually not clearly described in most systems. Fig. 16a shows a test image and Fig. 16b shows some subimages to be classified as a face or nonface. Suppose that all the subimages in Fig. 16b are classified as face patterns, some criteria may consider all of them as “successful” detections. However, a more strict criterion (e.g., each successful detection must contain all the visible eyes and mouths in an image) may classify most of them as false alarms. It is clear that a uniform criteria should be adopted to assess different classifiers. In [128], Rowley et al. adjust the criteria until the experimental results match their intuition of what a correct detection is, i.e., the square window should contain the eyes and also the mouth. The criteria they eventually use is that the center of the detected bounding box must be within four pixels and the scale must be within a factor of 1.2 (their scale step size) of ground truth (recorded manually). Finally, the evaluation criteria may and should depend on the purpose of the detector. If the detector is going to be used to count people, then the sum of false positives and false
negatives is appropriate. On the other hand, if the detector is to be used to verify that an individual is who he/she claims to be (validation), then it may be acceptable for the face detector to have additional false detections since it is unlikely that these false detections will be acceptable images of the individual, i.e., the validation process will reject the false detections. In other words, the penalty or cost of one type of error should be properly weighted such that one can build an optimal classifier using Bayes decision rule (See Sections 2.2-2.4 in [36]). This argument is supported by a recent study which points out the accuracy of the classifier (i.e., detection rate in face detection) is not an appropriate goal for many of the real- world task [118]. One reason is that classification accuracy assumes equal misclassification costs. This assumption is problematic because for most real-world problems one type of classification error is much more expensive than another. In some face detection applications, it is important that all the existing faces are detected. Another reason is accuracy maximization assumes that the class distribution is known for the target environment. In other words, we assume the test data sets represent the “true” working environment for the face detectors. However, this assumption is rarely justified. When detection methods are used within real systems, it is important to consider what computational resources are required, particularly, time and memory. Accuracy may need to be sacrificed for for speed.
YANG ET AL.: DETECTING FACES IN IMAGES: A SURVEY 53
Experimental Results on Images from Test Set 1 (125 Images with 483 Faces) and Test Set 2 (23 Images with 136 Faces) (See Text for Details)
Fig. 16. (a) Test image. (b) Detection results. Different criteria lead to different detection results. Suppose all the subimages in (b) are classified as face patterns by a classifier. A loose criterion may declare all the faces as “successful” detections, while a more strict one would declare most of them as nonfaces.