The History Of The Boosted Appearance Model

Published Date: 02 Nov 2017

1 Introduction

With the explosion of inexpensive technological devices, as well as the increase popularity of

media-sharing websites like Facebook, Picasa, Youtube, Flickr, face alignment gained popularity

amongst the researched areas in computer vision. Face alignment or locating semantic facial

landmarks (e.g. eyes, nose, mouth) became essential for problems such as face recognition, face

tracking, face animation, and 3D face modeling. However, such requirements are still challenging

for current approaches in unconstrained environments, due to the large variations on facial

appearance, illumination, and partial occlusions.

The aim of face alignment methods is to find a transformation (i.e. warping) between two or

more images in order for the images to match. In recent years, probabilistic methods emerged

as an effective class of facial-alignment approaches, which attempt to find optimal parameters

of a probability measure given the landmarks positions. Probabilistic methods can be divided

into discriminative face-alignment models (DFA) and generative face-alignment models (GFA).

The GFA is able to generate values of any variable in the model [11], whereas a DFA allows

only sampling of the target variables conditional on the observed quantities [11]. DFAs have the

advantage that they do not need to model the distribution of the observed variables as they calculate

the posterior probability directly. However, they cannot express complex relationships between

the observed and target variables [11]. In general, selecting between discriminative and generative

approaches is done by the user and it is application dependent. Examples of DFA approaches

include Active Shape Model (ASM), Boosted Appearance Model (BAM), Ranking Appearance

Model (RAM) as well as their extensions; while GFA approaches encompass Active Appearance

Model (AAM), Conditional Local Models (CLM), Congealing, and Batch Set Methods.

Because face-alignment involves warping and deformable surfaces, most of the algorithms addressing

the topic are non-rigid and can be separated into two categories: holistic or local. Holistic

methods involve manual selection of the landmark positions [15], while the local methods are fully

automated [14]. Holistic methods, in which the whole face is warped, have been recently employed

to align batches of images, rather than individial faces. Examples of holistic approaches include

ASM, AAM, BAM, RAM, and Active Wavelet Networks (AWN). Local methods use patch-based

representation and assume that image observations made for each landmark are conditionally independent.

This assumption leads to better generalization with limited amount of data compared

to holistic representations, because it only accounts for local correlations between pixel values.

However, local methods suffer from detection ambiguities as a direct result the use of local representation.

Therefore a better technique would be to use a non-parametric set of shape models and

combine detection results from several local detectors, as observed in congealing, CLM and Batch

Set Methods, which are all local approaches. A detailed description of these methods is provided

in the following chapter.

Despite substantial progress in the face-alignment area, the problem of real-time deformable face

alignment in the presence of pose, identity, expression, and lighting variations as well as image

noise, resolution and partial occlusions, remains unresolved. We believe that future directions

lead towards BAM extensions, which can benefit from the availability of boosting and patternclassification

techniques from machine learning. For example, incremental boosting can be used

for incorporating warped hard-to-classify images into the training data, so as to improve the classification

capability of BAM methods. Also, sophisticated optimization methods that reduce com-

putational complexity and avoid local optima (e.g. constrained mean-shift [9], or non-parametric

strategies) can be used to maximize the classification score computed in these state-of-the-art methods.

2 Deformable Face Alignment

Face alignment is a key component of computer-vision methods that analyze and recognize faces

in images and videos. Practical applications of face alignment include face and expression recognition

[5, 4], age estimation [4], face fitting [8], image coding [1], performance-driven animation

[13], and object tracking [6, 7].

[Figure 1 about here.]

Algorithms for facial alignment aim to find a transformation between two facial images, so that the

two images to match as much as possible [?]. In the computer-vision literature, this transformation

is often achieved by using a deformable model that warps a template image onto a target image,

while minimizing a measure of error between the deformed template and the target. The warping is

in general computed with the help of polygonal meshes, a popular and efficient class of deformation

models [?]. When using polygonal meshes, pixels inside a polygon (e.g. triangle) deform equally

with that polygon (e.g. affine, projective) causing the image to distort along with the mesh. Thus

the alignment problem reduces to finding the mesh deformation (i.e. an offset vector for every

vertex).

This can be seen in Figure 1, where a male face is tracked after aligning using a mesh. This method

works with different poses (a), while varying the facial expressions (b), as well as under changing

illumination (c).

Another promissing approach to face alignment involves the selection of landmarks on the template

face (e.g. corners of eyes, corners of mouth, tip of nose, iris center), which are then warped by a

function to correspond to consistent locations on the target face. Finding the correct warping is a

challenging task as it involves optimization in high dimensions due to the landmark parametrization

(e.g. computing the parametric values that distort the landmarks), as appearance can vary greatly

between images due to lighting conditions, image noise, and resolution [9].

[Figure 2 about here.]

An example of landmark selection and tracking is shown in Figure 2, where 34 landmarks are

aligned over several sequential frames.

TodayÂ´s increasing availability of face photos on the Web (e.g. Faces in the Wild, Bosphorus,

FERET, BioID, FRGC, CMU Pie, LFW), and new applications such as face search and annotation

[16] have raised new requirements for the face alignment: fully automatic, efficient, and robust to

facial images in under-controlled conditions [10].

A popular approach to address the facial alignment introduced probabilistic models to solve the

warping problem, starting from the seminal works of Cootes and Taylor with their Active Shape

Model (ASM) [3] and its extension, Active Appearance Model (AAM) [2]. I meant to say here that

ASM and AAM are the first to use probabilistic methods: The particular characteristic that distinguishes

these methods, as well as later ones, from earlier work is the use of statistical models of

the faces appearance and geometry from the data provided. This means that these facial alignment

methods statistically model all sources of variation shown in the images, namely that of geometry

and appearance.

In general, algorithms for face alignment using ASM or its extensions encompasses three major

areas:

Template representation. Can be a simple image patch, or the more sophisticated model. It

includes the descriptors used for learning the key features of the face, which are described in

more details in the following subsection. The main idea is to use two eigenspaces to model

the facial shape and shape-free appearance respectively.

Distance metric. The metric between the appearance instance synthesized from the appearance

eigenspace and the warped appearance from the image observation is minimized by iteratively

updating the shape and/or appearance parameters [7] (e.g. The Mean Squared Error

(MSE)).

Optimization method. Gradient descent methods are commonly used to iteratively update the

shape parameters (e.g. Gauss-Newton, Levenberg-Marquardt).

2.1 Discriminative vs. Generative Models

Discriminative Models

Because the face-recognition step mentioned earlier is determininstic, most facial alignment

procedures use discriminative models (also called conditional models) to incorporate classspecific

knowledge. Within a probabilistic framework, this is done by modeling the con-

ditional probability distribution P (y|x), which can be used for predicting y, the facial appearance

from x, the measured image data. In contrast with their generative counter parts

discriminative models do not allow the generation of samples from the joint distribution of x

and y. However, for tasks such as classification and regression, which do not require the joint

distribution, they can yield better performance. In addition, most discriminative models are

inherently supervised and cannot easily be extended to unsupervised learning. Incorporating

prior knowledge in a principled way is also an issue.

Discriminative face alignment (DFA) methods construct a face-alignment model for each

person to be recognized. This is different from conventional face alignment, which concentrates

on general-purpose face alignment (GPFA). GPFA builds the model from faces of

numerous individuals (i.e. different from the people to be recognized in order to cover the

variance of all the faces). Thus it attains the ability of generalization at the cost of specialization.

Moreover, GPFA does not take into account the higher-level tasks. However, the

requirements of different tasks can be different (e.g. face recognition needs discriminative

features, while face animation requires accurate positions of key points). So the higher-level

task should be a priority for effective face alignment. As face recognition needs discriminative

features, it would be better that face alignment could also give discriminative features.

However, the goals of GPFA used in bottom-up approaches is accurate localization. Therefore,

the performance of GPFA is not directly related to the performance of the face recognition

system. Quite the opposite, DFA can provide accurate localization in extracting good

features that recognize the person on which its model is built. If a being-recognized person is

not the person with the discriminative alignment model, the discriminative alignment model

will give bad localization so as to extract bad features to prevent the being-recognized person

from being recognized as the person with the discriminative alignment model. Therefore,

DFA can provide discriminative features for face recognition, which makes it better than

GPFA.

Generative Models

Generative models describe a joint probability distribution over observation and label sequences.

Generative models are used in machine learning for either modeling data directly,

or as an intermediate step to forming a conditional probability density function, and are

typically more flexible than discriminative models in expressing dependencies in complex

learning tasks.

Generative models describe full probabilistic models of all variables, whereas discriminative

models provide a pattern only for the target variable(s) conditioned on the observed data.

Thus a generative model can be used, for example, to generate values of any variable in the

model, whereas a discriminative model allows only sampling of the target variables conditional

on the observed quantities. While discriminative models do not need to model the

distribution of the observed variables, they cannot generally express complex relationships

between the observed and target variables. Also, they do not necessarily perform better than

generative models at classification and regression tasks.

Examples of generative models include Conditional Local Models (CLM), Congealing, Active

Appearance Model well as Batch Set Alignments.

If the observed data are truly sampled from the generative model, then fitting the parameters of the

generative model to maximize the data likelihood is a common method. However, because most

statistical models only approximate the true distribution, if the goal is to infer about a subset of

variables conditioned on known values of others, then it can be inferred that the approximation

makes more assumptions than are necessary to solve the problem at hand. In such cases, it can

be more accurate to model the conditional density functions directly using a discriminative model

(see above), although application-specific details will ultimately dictate which approach is most

suitable in any particular case.

2.2 Parametric vs Non-parametric Descriptors

2.2.1 Parametric Descriptors

Parametric descriptor learning yields excellent performance with high dimensionality, whereas the

non-parametric learning has a small number of dimensions, but with a slightly inferior performance.

Nevertheless, parametric descriptors are sometimes hard to use due to the high number of

parameter choices that are difcult to optimize by hand.

2.2.2 Non-Parametric Descriptors

Although non-parametric descriptors are desirable since they are low dimensional descriptors without

imposing a choice of parameters, they have two main drawbacks:

1. Firstly, they are less statistical powerful than the analogous parametric descriptors when the

data follows a Gaussian distribution. This means that there is a smaller probability that the

procedure will tell us that two variables are associated with each other when they in fact truly

are associated. For this reason, the key feature points need to be more numerous to have the

same strength as the corresponding parametric descriptor.

2. Secondly, the non-parpametric descriptors are harder to analyze than their counterparts. Several

non-parametric tests use rankings of the values in the data, rather than using the actual

data.

To summarize, non-parametric descriptors are useful in many cases and necessary in some, but

they are not a perfect solution. Thus, the best approach is to combine these descriptors, by running

a stage of non-parametric dimensionality reduction after a stage of parametric learning [7].

[Table 1 about here.]

As observed from the Table 1, all the parametric descriptors are local descriptors, dominantly

used in recognition and registration. In traditional vision tasks such as panoramic stitching and

structure from motion, they have widely replaced other methods due to their speed, robustness,

and the ability to work without initialization [?]. One downside of the local image descriptors

has been their high dimensionality (e.g. 128 dimensions for SIFT). However, recent techniques

combine the non-parametric and parametric descriptors (e.g PCA-SIFT which use the principal

components of gradient patches to form local descriptors). An even better approach is to look

for projections that actively discriminate between classes, instead of just modelling the total input

data[?]. variance.

Ever since the introductory works of Cootes and Taylor on statistical shape models [6,7,8], major

advancements in deformable face alignment field have dealt with improvements in accuracy,

computational complexity and generalisation. These optimization algorithms can be broadly categorised

into two groups:

Regression-based algorithms Regression-based approaches learn a regression model that fortells

the location of facial features in an image, given a set of features extracted from it. Improvements

to the original regression approach include the use of more sophisticated feature representations

[17], averaging multiple predictions [19, 22], and leveraging successive regression

iterations [24]. The main advantage of the regression-based approach is its simplicity and

efciency: we simply need to extract features from the image and then use the regression

model on them. The drawback of this approach is that the relationship between the image

features and the location of facial features in an image is often highly non-linear, requiring

high-capacity regression models that are difcult to train and most of the times generalise

poorly.

Optimisation-based Optimisation-based approaches design/learn an objective function that encodes

the degree of misalignment between the model and face in the image [23,18]. To

reduce sensitivity to local minima, we can either learn an objective function that exhibits

fewer local minima or simplify components of the problem such that the objective is amendable

to exact inference methods [1,]. The main drawback against local minima is that they

often perturb the global solution, away from the true conguration in the image.

3 Methods

Since facial alignment deals with warping and deformable surfaces, most of the algorithms are

non-rigid and can be separated into two categories:

â€¢ Holistic Methods

â€¢ Local Methods.

The most recent publications use holistic methods that operate on batches of input images simultaneously

[?], or employ methods that combines the output of local detectors with a non-parametric

set of shape models [13]. Other promising directions use patch-based representation and assume

image observations made for each landmark are conditionally independent [2, 3, 4, 5, 16]. This

leads to better generalization with limited data compared to holistic representations [10, 11, 14,

15], since it only accounts for local correlations between pixel values. However, the method suffers

from detection ambiguities as a direct result of its local representation. Thus, it seems that a

better technique should combine detection results from several local detectors in order to achive

optimal results. In the following section, we highlight some key diferences between holistic and

local methods.

4 Holistic Methods

4.1 Active Shape Model (ASM)

The majority of the previous research in face alignment is based on ASM, as well as its extensions

and variations [43,44,47,50], because of their elegant mathematical formulation and efficient computation.

As the pioneer of statistical modelling introduced by Cootes and Taylor in 1995 [], the

ASM infers the distribution of the target face shape and profile texture. A comprehensive survey

on this topic is presented by Cootes and Taylor [8,9] for a more indepth understanding.

In the classical global ASM, the shape is holistically represented using a generative shape model, as

a set of pre-dened points, called landmarks, along the shape contour. The ASM detects these facial

landmarks through a local-based search constrained by a global shape model, statistically learned

from the training data. This leads to a shape vector s, concatenating all x and y coordinates of the

ordered landmark points:

s(p) = s0 +

Î£n

i=1

pisi: (1)

This model is known as the Point Distribution Model (PDM) which has been broadly applied to

statistical deformable models to employ a linear approximation of how the shape of facial features

deforms. The PDM captures the shape parameters, which are iteratively updated by locally nding

the best nearby match for each landmark point. The landmarks, xi = (xi; yi)i=1;:::;l are placed over

the selected facial feature points and face contours. s0 denes the mean shape, while si represents

the i-th shape basis. p = [p1; :::; pn] represents the shape parameter vector. The mean shape and

the shape basis are learned from an annotated training set via Principal Component Analysis (PCA)

after a normalization step using the Procrustes method. This model is both simple and efficient,

and has been shown to span the deformation space of objects such as the human face and organs

in medical image analysis [10,12]. We present the pseudo-code for the AMS method in Table ??,

and also provide a visualization of the method in Figure 3.

[Figure 3 about here.]

Although PCA provides a signicant compaction of the space of plausible facial shapes, the dimensionality

of the deformation basis that accounts for a major portion of variations is rarely sufficient

to eliminate enough spurious congurations. That is, the employment of the PCA prior over facial

shapes in the previous equation does not eliminate a sufcient number of local minima such that

local optimisation strategies converge to the global solution at acceptable rates.

Since the rst introduction of the traditional ASM, numerous extensions have been thought of, for

its improving robustness, performance and effciency. For example,[8] employs mixtures of Gaussians

for representing the shape, [29] uses Kernel PCA and Support Vector Machine (SVM), and

models nonlinear shape changes by 3D rotation, [30] applies robust least squares algorithms to

match the shapes to observations, [11] uses more robust texture descriptors to replace the 1D prole

model and used k nearest neighbor search for prole searching, and [12,33,34] relies on Bayesian

inference. Among those extensions to classical ASM, the recent work of Milborrow and Nicolls

[1] with the introduction of the 2D prole model and denser point set obtained promising results

through quantitative evaluation. The recent work [44] combines global and local models based on

Markov Random Fields (MRF) and an iterative tting scheme, but this approach is mainly focused

on localizing very sparse set of landmark points. Other notable recent works exploring alternatives

to ASM include [35]. In [35], a robust approach for facial feature localization is proposed by a

discriminative search approach combining component detectors with learned directional classiers.

In [3], a generative model with shape regularization prior is learned and used for face alignment

to robustly deal with challenging cases such as expression, occlusion and noise. As an alternative

to the parametric model approaches, a principled optimization strategy with nonparametric

representations for deformable shapes is recently proposed in [43].

[Table 2 about here.]

4.2 Active Appearance Model (AAM)

AAM is probably the most well known extension of ASM and shares the statistical model of joint

variations in facial feature locations with its predecesor. The core difference between ASM and

AAM is in how the appearance of the face is modelled. In AAMs, the appearance of the whole face

is modelled jointly, whereas in ASM, each facial feature is modelled independently of all others.

As such, AAMs can potentially capture a more faithful representation of the underlying statistics

of facial appearance. However, in practice, due to the large space of appearance variability of

the face and its high dimensionality, a compact representation that generalise well can only be

afforded in highly restricted settings such as in the person-specic case [18, 34]. In contrast, ASM

has the advantages of being more accurate in point (contour) localization, less sensitive to lighting

variations and more effcient, hence is proves more suitable for applications requiring accurate

contour tting.

[Figure 4 about here.]

Still, in the field of non-rigid alignments, AAM is extremely popular, with numerous recent works

[3, 6, 7, 15, 20] relying on its framework. The AAM algorithm elegantly combines shape and

texture model by learning generative statistical models and assumes a linear relationship between

appearance and pose variation. During a training phase, the AAM learns from labeled data the

statistical generative models for the shape of a face (represented by landmark positions), and for

the appearance of a face (represented by pixel intensities in the shape-normalized domain). Thus,

given a face database, each facial image is manually labelled with a set of 2D landmarks [xi; yi],

i = 1; :::; n. The set of feature keypoints is considered as a random process defined by the shape

model s, which concatenates all x and y coordinates of the ordered landmark points, just like in

the ASM case. Eigenanalysis is utilized to retrieve the shape model:

s(p) = s0 +

Î£n

i=1

pisi: (2)

s0 denes the mean shape, while si represents the i-th shape basis. p = [p1; :::; pn] represents the

shape parameter vector. For the appearance model, a piece-wise affine warping function from the

model coordinate system to the coordinate in the image observation is defined asW(x; y; p), where

(x; y) is the coordinate of a pixel within the face region denoted R(s0) and defined by the mean

shape S0:

W(x; y; p) = [1xy] âˆ— a(p): (3)

a(p) = [a1(p)a2(p)] is a 3X2 affine transformation matrix unique to each triangle pair between s0

and s(p). Given the shape parameters p, the a(p) matric needs to be computed for each triangle.

Since we know where each pixel belongs, we can pre-compute a(p). Let the image observation be

I and the resultant warped image as an n-dimensional vector I(W(x; y; p)). The system can be

liniarized to present the appearance model as:

A(x; y; _) = A0(x; y) +

Î£m

i=1

_iAi(x; y); (4)

where A0 is the mean appearance, Ai is the i-th appearance basis, and _ = [_1; _2; :::; _m]â€² are

the appearance parameters to be computed. Here, we can select the rectangular Haar-like features,

mainly because of their computational efficiency, which exploits the integral image representation

[], as well as due to their success in face-related applications []. Finally, to fit a specific shape over

a target face, we need to minimize the cost function used for model fitting:

J(P; _) =

Î£

xâˆˆR(s0)

âˆ¥I(W(x; y; P)) âˆ’ A(x; y; _)âˆ¥2; (5)

which is the MLE between the warped observation I(W(x; y; P)) and the synthesized appearance

instance A(x; y; _), and N is the total number of pixelsin R(s0) [36,38,39].

The overview of the AAM algorithm can be seen in Table 3. Compared to the feature-based tracking,

AAM can track face more accurately and stably with little jitter. Nevertheless, AAM may

have difculty generalizing to unseen images - the alignment tends to diverge on images that are

not included as the training data, especially when the model is used on a large dataset where it

can encounter performance degradation. In part, this is caused by the fact that the appearance

model only learns the appearance variation retained in the training data. When more training data

is used to model larger appearance variations, the representational power of the eigenspace is very

limited even under the cost of a much higher-dimensional appearance subspace, which in turn results

in a harder optimization problem. Also, using the MSE as the distance metric essentially

employs an analysis-by-synthesis approach, further limiting the generalization capability by the

representational power of the appearance model. Researchers have noticed this problem and proposed

methods to handle it. Jiao et al. [26] suggest using Gabor wavelet features to represent the

local appearance information. Hu et al. [45] utilize a wavelet network representation to replace

the eigenspace-based appearance model, and demonstrate improved alignment with respect to illumination

changes and occlusions. Another downside is that AAM is sensitive to the initial shape

and may easily get stuck in local minima because of its gradient decent optimization. Cluttered

backgrounds also reduce AAM stability in tracking a face outline. During the fitting phase, the

AAM is aligned in such a way that the data can be best explained, or reproduced by the model in

the least mean square error sense. In AAM, the appearance is modeled globally by PCA on the

mean shape coordinates (also called shape-normalized frame). The shape parameters are locally

searched using a linear regression function on the texture residual.

[Table 3 about here.]

Unlike the ASM which only deals with shapes, the appearance of the face is also considered. The

model combines constraints on both shape and texture by learning statistical generative models for

the shape of a face and the appearance of a face. Shape is represented by landmark positions , while

the appearance is represented by pixel intensities in the shape-free face image. The tting of the

AAMis dened by solving a least mean square error (LMSE) problem, where difference between the

warped image and the model appearance is minimized. Efcient optimization algorithms such as the

Inverse Compositional (IC) and Simultaneously Inverse Compositional (SIC) methods have been

proposed by Baker and Matthews [2], which enable fast face alignment for real-time applications.

However, the alignment performance degrades quickly when generic AAMs are trained instead

of person specic AAMs [36,37]. The generalization issue is caused by generative appearance

modeling and the LMSE optimization schema. Other variations of AAM have also been proposed

to improve the original algorithm, namely view-based AAM [6], Direct Appearance Models [12],

a compositional approach [22] and 3D AAM. Despite the success of these methods, problems

still remain to be solved. For example, AAM is sensitive to illumination changes, especially if

the lighting in the test image signicantly differs from the lighting encoded in the training set.

Moreover, under the presence of partial occlusion, the PCA-based texture model of AAM causes

the reconstruction error to be globally spread over the image, thus impairing alignment.

Both models described above require good initialization because most of them are use gradient

descent or local search from a rough initial shape. The objective functions of these models might

have a large number of local minimums in the high dimensional solution space. Even using the

coarse-to-ne strategy, these approaches often get stuck into a local minimum if the initial shape is

far from the correct solution, especially on the facial images in real applications.

For this reason other variants of ASM or AAM have been proposed:

â€¢ 3D ASM face alignment [11,39]

â€¢ Boosted Appearance Model (BAM) [27]

â€¢ Boosted Regression [29].

â€¢ Active Wavelet Method [31]

The list is longer and could include Bayesian Tangent Shape Model (BTSM) [28] - in which a

Bayesian inference solution and expectation maximization (EM) method are used for estimating

a maximum aposteriori probability (MAP) [32,36], Mixture or non-linear shape model [7,22] or

View-based AAM [], Direct Appearance Model (DAM) [12], Constrained Markov Network [25],

Inverse Compositional updating [19], but we are not going to address them in more depth in this

work, since they are not as well efficient as the rest.

4.3 3D ASM

3D ASM is an extension of the 2D classic ASM. However, one of the downside of the 2D model

is the difficulty to fit a model under large pose variations. This can happens for two reasons: either

the 2D texture around facial landmark changes a lot when the head rotates amply, or the 2D shape

model constrains variation only in two dimentions impeeding pose estimation.The later face is also

related to the idea that at the present there is no database which contains all head poses. Since the

algorithm is similar to the classical ASM, the pseudo-code was obmitted.

[Figure 5 about here.]

3D ASM uses PDM and local texture model. These two models store global shape variations and

local texture prior knowledge of each landmark. The local texture is used to find the best matching

points while the PDM constrains global shape to avoid extensive variations. If the shape model is

no different from the 2D variant , in the 3D case we are dealing with a covariance matrix, which

helps compute the shape parameters. However, the critical issue here is to achieve the extension

of PDM to three and more dimensions. This is done by point correspondence: the landmarkshave

to be placed in a consistent way over a large database of training shapes, otherwise resulting in an

incorrect parameterization of the object class [49]. The general idea of this method is to align all

the images of the training set to a mean atlas, which can be seen in Figure 5. The transformations

are a concatenation of a global rigid registration with nine degrees of freedom (translation, rotation,

and anisotropic scaling) and a local transformation using non-rigid registration. After registration

of all samples to the mean shape, the transformations are inverted to propagate a topologically

fixed point set onto the atlas surface to the coordinate system of each training shape. While it

is still necessary to manually segment each training image, this technique reliefs from manual

landmark definition. Another issue arrises when having densely sampled data, when a 3Ddata

volume can be reconstructed, allowing for 3D updates in each model landmark. However, this is

not the case when dealing with a sparsely sampled dataset containing large undersampled regions.

In such situation, interpolation between sparse image slices with different orientations needs to be

computed- a non-trivial task. In void locations, no information can be extracted from the image

data to contribute to a new model instance. However, for the calculation of new model parameters,

updates for the complete landmark set are required: setting updates of zero displacement would

fixate the nodes to their current position, thus preventing proper model deformation.

4.3.1 Boosted Appearance Model (BAM)

In order to tackle the generalization problem of AAM, Liu proposed the Boosted Appearance

Model (BAM) [8], in which a shape representation similar to the AAM is used. However, the

appearance model is represented by a set of discriminative features (weak classifiers based on

Haar-like rectangular features [21, 25],) trained to form a boosted classier. The discriminative

appearance models is able to distinguish between correct and incorrect alignment and can also

improve the generalization capabilities of the AAM. The BAM tting can be done by iteratively

updating the landmark positions according to gradient ascent on the corresponding classier score

function. Another improvement to this model is further explained in [7], where the Pseudo Census

Transform (PCT) feature for boosting is used to obtain a more robust BAM against illumination

changes,better than if using Haar features or modified census transformatios (MCT). The shape

model of BAM is identical to that of ASM and AAM, and will be skipped for discussion. On the

other hand, the appearance model is a collection m features computed over the shape-free face

image I(W(x; p)). For learning an alignment score function to perform the tting of face model,

we are interested in learning a score function F, such that, when maximized with respect to the

parameters p, it will return the shape parameter corresponding to the correct alignment.

pâˆ— = argmaxF(p) (6)

where pâˆ— is the shape parameter denoting the correct alignment. With this formulation, the appearance

model is actually a two-class classier. In particular,a linear combination of several PCT

features denes the appearance model:

F(I(W(x; p))) =

Î£M

m=1

fm(I(W(x; p))): (7)

The fm are the weak classifiers, while F(p) is the strong classifier. Our weak classier using the

PCT features is dened as follows:

where Ami

is the i-th template dened at the m-th position(rm; cm). Since the classier response

fm(p) is continuous within âˆ’1 and 1, the atan() function is used to ensure both discriminability

and derivability. S is a sigmoid function dened as S(t) = 1

1+eô€€€_t , with _ as a scale parameter.

The sigmoid function normalizes the raw PCT feature values into a range of (0; 1) before a linear

projection. The projection vector wm and bias bm are learned on the training data with linear

support vector machines (SVM). The individual SVM cost parameter C for each feature location is

searched with cross validation. The GentleBoost algorithm [4] is used to boost the weak classiers

as suggested in [8] for two reasons. On one hand, it is a soft classer with continuous output

which enables us to derive gradient ascent algorithm for maximizing the strong classier function.

Secondly, the GentleBoost outperforms other boosting algorihms since it is more robust against

noisy data.

[Table 4 about here.]

Different variants of boosting have been proposed in the literature [20]. We use the GentleBoost

algorithm [11] based on two considerations. First, unlike the commonly used AdaBoost algorithm

[10], the weak classifier in the GentleBoost algorithm is a soft classifier with continuous output.

This property allows the output of the strong classifier to be smoother and favorable as an alignment

metric. In contrast, the hard weak classifiers in the AdaBoost algorithm lead to a piecewise

constant strong classifier, which is difficult to optimize. Second, as shown in [17], for object detection

tasks, the GentleBoost algorithm outperforms other boosting methods in that it is more robust

to noisy data and more resistant to outliers. On the generic face alignment problem, this proposed

framework greatly improves the robustness, accuracy, and efficiency of alignment, but there is no

guarantee that moving along its gradient will always improve alignment, to a great detriment of the

generalization capabilities. As seen from the pseudo-code, the basic idea of BAM is optimization

via maximizing a classification score. Similar ideas have been explored in object tracking research

[1, 14, 26]. Avidan [1] estimates the 2D translation parameters by maximizing the Support Vector

Machine (SVM) classification score. Limitations of this method include dealing with partial occlusions

and the large number of support vectors which might be needed for tracking, burdening

both computation and storage. Williams et al. [26] build a displacement expert, which takes an

image as input and returns the displacement, by using Relevance Vector Machine (RVM). Since

RVM is basically a probabilistic SVM, it still suffers from the problem of requiring a large set of

support vectors. The recent work by Hidaka et al. [14] performs face tracking (2D translation only)

via maximizing the score from a Viola and Jones face detector [25], where a face versus non-face

classifier is trained.

4.3.2 Boosted Regression with Ranking Appearance Model (BRM)

Gradient boosting is a machine learning technique for regression problems, which produces a prediction

model formed by several weak prediction models, typically represented by decision trees.

It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them

by allowing optimization of an arbitrary differentiable loss function. Gradient boosting method

can also be used for classification problems by reducing them to regression with a suitable loss

function. Most solutions tend to build discriminative appearance models to replace the generative

model. For example, Tresadern et al. [18] learn a discriminative appearance model via boosted regression.

As far as the ranking problem, the goal is to learn an ordering or ranking over objects. The

wide variety of applications in which ranking is required includes, information retrieval, collabo-

rative filtering, computational biology, econometrics, and social sciences. As relevant references,

we mention [16] that proposes an SVM-based ranking method to improve search engines, and [12]

that proposes RankBoost for collaborative filtering. RankBoost [12] is an algorithm that, from

information about the relative ranking of individual pair of instances, learns a ranking function by

combining a number of weak ranking functions selected in a greedy fashion. Roughly speaking,

the algorithm is a direct extension of AdaBoost [13] in that the ranking function is the byproduct

of learning a classifier, which says whether a pair of instances appear to be ranked in ascending

or descending order. Therefore, this process minimizes the weighted number of incorrectly ranked

pairs, as opposed to AdaBoost that minimizes the weighted number of misclassifications. Another

inovation towards practical applicability of boosting is worth mentioning: large parts of the early

boosting literature persistently contained the misconception that Boosting would not overt even

when running for a large number of iterations. Simulations by [31, 35] on data sets with higher

noise content could clearly show overtting effects, which can only be avoided by regularizing

Boosting so as to limit the complexity of the function class. In [33] RankBoost learning is used

to provide a relative similarity measure between a given shape and a reference shape. This can be

used to rank a fixed number of predefined warpings of an image, and then combine the first few top

ranked to perform shape detection. In contrast, the generic boosted algorithm is trained to model

the shape variability, which we then utilize to optimize the learned alignment score function.

[Table 5 about here.]

BRM has similar representations for shape and appearance as the BAM, but arises from a very

different formulation of the alignment problem. In particular, it learns from data an alignment

score function, which is concave within the neighborhood of the correct alignment. In this way

we assure that by updating the alignment of the BRM via gradient ascent, we will always deform

the model towards the correct fitting. The overall concept is that BRM learns a classifier that

is able to say whether or not by switching from one alignment to another one, the BRM moves

closer to the correct solution. In order to learn the weak classifiers a boosting procedure called

GentleBoost in employed. Compared to AdaBoost, it is numerically more robust, has been shown

experimentally that has better convergence properties, and performs well on several face-related

applications [22,24]. Once the classifier is trained, it will inform which of the two paired images

warped from different landmarks corresponds to a better alignment. This naturally leads to the

creation of a positive training set and a negative training set with the same cardinality, making the

learning problem balanced. The particular structure of the resulting classifier allows to map the

original problem to a ranking problem, because it implies the learning of a function (ex: alignment

score function), which can be interpreted as a ranking function. We propose to learn the alignment

score function by extending the use of GentleBoost for ranking, and show experimentally that it

converges faster, and performs better alignment ranking than other approaches, such as RankBoost

[24]. The overview of the algorithm is shown in Table 5. Here, a ranking model is considered good

for learning a local maximum free objective function. Ideally, the model returns higher value if the

corresponding shape parameter is closer to the ground truth than the other one:

F(p2) > F(p1) â‡â‡’ p2 â‰» p1 (8)

where p2 â‰»p1 means p2 is superior to p1 or âˆ¥ p2 âˆ’p0 âˆ¥<âˆ¥ p1 âˆ’p0 âˆ¥ k, p0 corresponds to the shape

parameter of ground truth. This ensures the learned ranking model is a unimodal objective function

with its maximum located exactly at p0 since it is convex. Gentleboost is applied for boosting

weak rankers as in [19]. Equation 2 suggests that the ranking function F can be formulated as a

classication problem. More precisely, if we dene a classier H(p1; p2) = sign[F(p2)F(p1)] (e.g

H(p1; p2) = +1 if p2 â‰» p1, else H(p1; p2) = âˆ’1). The classier H helps decide whether or not

switching from p1 to p2 constitutes an alignment improvement. In the boosting framework, we

assume H to be an additive model:

H =

Î£M

m=1

h(p1; p2) (9)

where h(p1; p2) = fm(p2) âˆ’ fm(p1). fm is the m-th weak ranking function that is dened as:

fm(p) =

_

(wmâ€²S(_m) âˆ’ tm) (10)

The function is used to ensure both discriminability and derivability.The S is a sigmoid function,

just like with the BAM, which normalizes the raw PCT feature values into a range of (0; 1) before

the linear projection dened by a projection vector wm learned with RankSVM [25]. The threshold

tm needs to be determined during boosting. The strong ranking function is again assumed to be

an additive model and needs to be maximized. To learn the strong ranking function F, we sample

ordering pairs from a training dataset containing D facial images with annotated landmarks. For

each of the training images, we randomly perturb the ground truth pi in U different directions

{Î"piu}u=1;:::;U . In each direction V shape parameters pi +vxÎ"piuv=1;:::;V are evenly sampled. For

each direction V ordinal adjacent pairs are selected using the samples including the ground truth,

which are denoted xl with their corresponding labels zl = += âˆ’ 1.

In [19, 21], RAM is investigated by boosting the score function in a pairwise ordinal classication

way. This model ensures that the score function returns a higher value if the current alignment

is closer to the ground truth than the others in the shape parameter space. A local optimizer

benets from such a model as the gradient of the learned score function is constrained to the same

direction towards the ground truth. The RAM is constructed in a boosting manner, just like the

BAM. However, experiments show that the PCT-based RAM is more robust and generalize better

than the PCT-based BAM. Another RAM model is proposed, in which we formulate the ranking

problem with regression trees. The gradient boosted regression trees (GBRT) are used to learn the

appearance model as it achieves top results in the domain of web-search ranking [12]. To overcome

the drawbacks of GBRT, we train Random Forests (RF) and use its outputs as the initial estimation

for GBRT learning. In the third model, both the PCT features and the MCT features are used

for appearance representation, as a derivativefree local optimizer is applied for face alignment.

Experimental results show that the regression trees-based RAM achieves superior results than the

pairwise ordinal classication model. The initialization step for GBRT learning results in a very

robust face alignment, which improves the performance.

4.4 ActiveWavelet Network (AWM)

Many applications of signal processing entail detecting, extracting and classifying specic elements

from high-dimensional data. Classical tools for signal processing such as the Fourier Transform

(FT) have interesting properties for emphasising frequential features. However, because of the

way they are dened, they are unable to distinguish signals that are stationary from others which

vary over time. The wavelet transform has been specically conceived for evaluating time-varying

frequency information, and it is especially suitable for highlighting time-frequency properties in

that it decomposes functions over test functions which have compact support and minimal spread

in the time-frequency domain. In particular, Active Wavelet Network (AWN) [9] approach was

recently proposed for automatic face alignment, showing advantages over AAM, such as more

robustness against partial occlusions and illumination changes. AWN are able to model the face

texture as an alternative to Principal Component Analysis in standard AAM.

To further improve this method, certain researchers [32,44] use Gabor filters, which are recognized

as good feature detectors and provide the best trade-off between spatial and frequency resolution

[14]. A Gabor wavelet network for a given image consists in a set of n wavelets and a set of associated

weights wk, specically chosen so that the GWN reconstruction best approximates the target

image. The advantage of this approach is that one can trade-off computational effort with representational

accuracy, by increasing or decreasing the number n of wavelets. The GWN method

displays a face image through a linear combination of 2D Gabor functions whose parameters (position,

scale and orientation), while weights are optimally determined to preserve the maximum

image information for a chosen number of wavelets. Texture parameters from AAM can be replaced

by the wavelet coefcients, which are obtained by projecting the image into the learned

wavelet subspace. For an orthogonal wavelet basis, these coefcients may be calculated by simple

inner products of the image with each wavelet function which guarantees an optimal image reconstruction

-in the Least Square sense. However, Gabor functions are not orthogonal, so the texture

parameter cannot be computed by inner products of the image with the wavelet functions. In this

case, a family of dual wavelets need to be considered to obtain the set of coefcients for an optimal

image reconstruction[44].

5 Local Methods

5.1 Constrained Local Methods (CLM)

Modelling the local appearance of facial features independently of the rest of the face often exhibits

better generalisation properties than holistic representations largely because the dimensionality of

the data is much lower and some degree of invariance towards lighting variation can be obtained by

a simple power normalisation [48]. One of the main problems with patch based methods is that the

appearance of facial features can vary greatly between people, pose, lighting and expression. Thus,

to account for these variations, it may be necessary to use high-capacity models, which tend to have

poorer generalisation and higher evaluation costs. The second, and perhaps, more pressing issue,

is the aperture problem; the local appearance of some facial features are inherently ambiguous. By

observing only a small patch around a facial feature, it can be very difcult to pin-point where within

an image a facial feature is located as many locations can share a similar local appearance. This

is complicated further when one has to account for inter-personal variabilities. As such, even the

use of highly sophisticated models that account for inter-personal variations may not help resolve

these ambiguities. In recent years, an approach to that utilizes an ensemble of local detectors (see

[2, 3, 4, 5, 16]) has attracted some interest as it circumvents many of the drawbacks of holistic

approaches, such as modeling complexity and sensitivity to lighting changes.

Constrained Local Models (CLMs) developed by Cristinacce and Cootes [9] represent a face as a

combination of shape and local feature templates.[9, 14, 16, 19] and are able to overcome many

of the problems inherent in holistic methods. For examples, CLMs have inherent computational

advantages (e.g., opportunities for parallelization) [14], and reduced modeling complexity and

sensitivity to illumination changes [16, 19]. CLMs also generalize well to new face images, and

can be made robust against other confounding effects such as reectance, image blur, and occlusion

[9]. CLMs model the appearance of the face locally via an ensemble of region experts, or local

detectors. A variety of local detectors have been proposed for this purpose, including those based

on discriminative classiers that operate on local image patches [19], or feature descriptors such as

SIFT [13]. These local detectors generate a likelihood map around the current estimate of each

landmark location. The likelihood maps are then combined with an overall face shape model to

jointly recover the location of landmarks. Like AAMs, CLMs typically model non-rigid shape

variation linearly.

All instantiations of CLMs can be considered to be pursuing the same two goals:

â€¢ perform an exhaustive local search for each PDM landmark around their current estimate

using some kind of feature detector

â€¢ optimize the PDM parameters such that the detection responses over all of its landmarks are

jointly maximized.

[Figure 6 about here.]

A distinguishing feature of CLM, compared to other approaches, is that the likelihood of a

particular location in the image corresponding to a particular facial feature is assumed to be

conditionally independent of all other facial features. Although this assumption may appear

restrictive, the advantage of this parameterisation is two-fold. First, modelling each facial

part independently of all others is far easier; it requires less data to generalise well compared

to holistic representations that model the appearance of all parts jointly together. The second

advantage of this form is that the likelihood of each part over all locations in the image can

be computed independently, admitting an efficient implementation through parallelisation.

Having a lookup-table of the likelihood of part locations in the image is also advantageous

as it allows posing alignment as a generic graph inference problem [16,17].

5.2 Congealing or Joint Face Alignment

As digital cameras become cheaper and more ubiquitous, and visual media-sharing websites

like Flickr, Picasa, Facebook, and YouTube become more popular, it is increasingly convenient

to perform face alignment on batches of images for other applications such as face

image retrieval [17] and digital media management and exploration [11].

Joint face alignment problem is initially studied by Learned-Millers inuential congealing

procedure, which registers a batch of images by minimizing the entropy of each column of

pixels through the image set [4]. Congealing has been proven to work well on simple images,

such as binary handwritten digits. Later on, many efforts have been devoted to make

congealing more robust to complex real-world faces [5,6], which include exploring more

robust features for congealing algorithm instead of raw pixels [5] or using least square constrains

to estimate warping parameters [6]. Besides these congealing-style methods, Robust

Alignment by Sparse and Low-rank Decomposition (RASL) is proposed to make more robust

face alignment in the case of occlusions and large lighting changes by minimizing the

rank of the image ensemble [7]. However, due to the ignorance of non-rigid transformations,

the above methods are restricted from being applied to scenarios which need more accurate

face alignment, such as face swapping [8] and face animation [9]. Moreover, the human

face information, such as the shape and appearance models of the faces, is not exploited to

remove outliers produced in the process of joint face alignment.

Unlike typical supervised image alignment methods [6, 23, 10, 19] that often require manual

labeling, the unsupervised approaches only assume that the parameterization of the alignment

is known and the input images have similar appearance, making them more exible and

practical for some real applications. Least squares congealing [7] treats the entire facial region

at once. It assumes an afne transformation sufces for the alignment. Such approach

can only deal with frontal faces with a neutral expression. Moreover, the method has a

high computational cost and can therefore only handle few images at a time. To better deal

with pose variations, one can split the face into several patches and apply image alignment

to each patch separately. Such approach [1] improves recognition performance under pose

variations, but discards the consistency between neighboring patches and still assumes a

rigid transformation per patch. Splitting a face into disconnected patches also precludes the

synthesis of photo-realistic, rectied facial images. The result of congealing can be observed

in Table 7, where multiple images were aligned.

[Figure 7 about here.]

To align facial images without any supervision, earlier methods have focused on estimating

a set of aligned basis images to account for spatial variations. Most of this work is closely

related to subspace learning again, such as PCA and transformed component analysis [9].

Recently, a new strand of fully unsupervised face alignment from exemplars has been introduced.

The seminal congealing method [15, 11] employed a sum of entropies cost function

and a sequential algorithm to nd some transformation parameters. Later, Cox et al. [7] extended

this approach by introducing a sum-of-squared error cost function. The alignment

was formulated under the Lucas-Kanade framework [17], and the optimization can be iteratively

solved by the Gauss-Newton method. However, their approach only handles frontal

face alignment, using 2D afne transformations. Only a small set of facial images with neutral

expressions and uniform lighting were evaluated. Our work is also related to viewpoint

invariant face recognition. Kanade and Yamada [14] developed a probabilistic model of how

face appearance changes with viewpoint, in which the facial images are separated into a set

of independent patches. Later, Lucey and Chen [18] proposed a joint distribution model of

individual patches to deal with the misalignment problem. Ashraf et al. [1] added a stack

ow algorithm to nd the spatial deformations through the transformation between the separate

patches. Although gaining great improvements on face recognition under viewpoint

changes, this method assumes the face is aligned and the head pose is known. In addition,

the input facial images must be grouped by their poses at the start.

Intuitively, their approach aims to simultaneously

â€" identify the person-specic appearance space of the input images, which are assumed to

be linear and low-dimensional

â€" align the person-specic appearance space with the generic appearance space, which are

assumed to be proximate rather than distant

This joint approach was shown to produces excellent results on a wide variety real-world

images from the internet. However, the approach breaks down under several common conditions,

including signicant occlusion or shadow, image degradation, and outliers.

5.3 Set Alignment (Manifold Alignment)

To align an image set with the reference set, we further formulate the problem as a quadratic

programming. It integrates three constrains to guarantee robust alignment, including appearance

matching cost term exploiting principal angles, geometric structure consistency

using affine invariant reconstruction weights, smoothness constraint preserving local neighborhood

relationship. Recently, there has been an increasing interest on Video-based Face

Recognition (VFR) [1,38] because video cameras are commonly available and provide more

information compared to still cameras. In the case of VFR, both gallery and query set are

video sequences rather than still images. So VFR problem can be converted to measuring the

similarity between two video sequences. Intuitively, one could build an appearance-based

system by choosing a subset of representative frames (so-called key-frames or exemplars)

from video sequence as models and then perform still image based recognition. Obviously,

such an approach does not fully utilize spatiotemporal information. To make use of it, some

techniques are developed, for instance, by using Hidden Markov Model (HMM) [1,2]. However,

temporal model based approaches have not yet shown their full potentials as they also

suffer from some drawbacks, such as only using global features while ignoring local information,

the lack of discriminability between the facial dynamics

To avoid the above bias originated from clustering, alignment between two image sets is

a possible solution. One scheme is aligning the test image set to each gallery image set

respectively, and then comparing them directly. This can be seen in Figure 8. However,

such strategy is unreasonable in practical VFR for two reasons: (1) in many cases, the query

image set does not wholly but only partially corresponds with the gallery image set, which

implies difficult alignment; (2) it is too time-consuming to align the query set with each of

the gallery sets online. To address the above issues, a reference set is introduced to bridge

the query set and the gallery set. Furthermore, to obtain more discriminant features, multiple

linear transformations can be learned from corresponded local models which are structured

by aligning all gallery image sets with the pre-partitioned reference set.

[Figure 8 about here.]

6 Future Directions

Despite substantial progress in all aspects of the ASM/AAM paradigm, the problem of realtime

deformable face alignment in the presence of pose, identity, expression and lighting

variations as well as image noise, resolution and partial occlusions, remains unresolved.

Nonetheless, the space of variations that state-of-the-art methods can handle is such that

their application in real-world settings has started to be realised. There are several future

direc

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now