Towards modelling the Semantics of Natural Human Body

Download Report

Transcript Towards modelling the Semantics of Natural Human Body

Towards modelling the Semantics of Natural Human
Body Movement
Hayley Hung
http://www.dcs.qmul.ac.uk/~hswh/report/
A method of feature extraction using temporal frame differencing and motion
moments provids a trajectory-based matching framework. In this case, six different
gestures are performed by 9 subjects and matched against six subjects, which have
been used to train the model. Tests are carried out to ascertain the overall
performance of the condensation algorithm, by varying parameters within the
algorithm. Results have been analysed based on the performance of each gesture
subject. From this experiment, extrapolations to further studies in the area of human
behaviour analysis have been made.
•
Human body movement can be categorised into:
–
–
–
–
•
•
Gait or posture is usually an unconscious form of body movement, which can be observed when a person is
walking.
Actions are usually body movements that consciously interact with objects.
Gesture is a subconscious communicative form, which aids the ability of a person to communicate.
Sign language is a conscious form of communicative language between people.
Temporal information about a gesture is important since it indicates where a gesture begins and
ends.
Context can be from proceeding and/or preceding gestures, but also from interaction with objects
or other people in an environment.
•
•
•
•
•
•
This report concentrates on the application of the condensation algorithm
for gesture recognition.
performs an exhaustive search of the search space and is therefore more
likely to provide the best global match for an observed gesture, and one in
the training set.
it is able to propagate multiple probable states, hence allowing for
ambiguity in the performance of a gesture.
This method is advantageous in that the training data is not processed
before it is used for inference.
Inference occurs through Dynamic Time Warping (DTW) of a set of training
data, over a particular time interval. These warped trajectories are matched
against an observed gesture for the most likely classification.
The heuristic of this search algorithm is the conditional density of a
particular gesture over a large number of samples.
•
•
•
•
A basic feature extraction technique has been used to represent
the motions of the gestures. The technique used is temporal frame
differencing.
Due to the limitations of this technique, all test subjects are
required to perform the gestures facing the camera and the
algorithm has been implemented using MATLAB.
They are also asked to try to minimise movement from the rest of
the body as the gestures are being performed with either one or
both hands.
In this application, 6 gestures have been taken from 18 different
subjects, each sitting facing the camera.
Feature extraction
•
•
•
The simplest form of feature extraction, through motion moments was used.
Motion moments are obtained using two-frame temporal differencing. This involves taking the
difference between consecutive frames and creating a binary image
FEATURES
–
Motion area:
–
Centroid coordinates:
–
Displacement of centroid coordinates:
–
Elongation:
Condensation algorithm
•
•
•
•
•
Since the advantage of the condensation algorithm is that it is able to propagate the probability
densities of many states in the search space, it is particularly useful for multiple tracking problems
as well as gesture recognition.
Though the technique used here takes the motion of the gesture to be one motion trajectory, it is
quite possible for the condensation algorithm to track both hands separately, for example.
Dynamic Time Warping allows generalisation of a particular gesture by distorting each feature
trajectory by three different parameters; , , and
which represent amplitude, rate and phase
adjustments.
Each state in the search space contains values for these three variables, as well as a variable to
represent the model or gesture. generally represents a different number, for each gesture. A
state at time t is defined as
Essentially, the condensation algorithm consists of four basic steps; initialisation, selection,
prediction and updating, as shown in Figure 3.1.
Figure 3.1: High level block diagram of the condensation algorithm.
Figure 3.2: Flow Diagram of the
matching and selection
processes of the condensation
algorithm.
•
•
Each trajectory described in this figure represents the variation of one particular feature over time.
The training data was treated as a vector
of N values , where
•
The search space was firstly initialised by choosing S sample states (typically of the order of
1000).
This produced a set
samples.
•
•
The purpose of the algorithm to find the most likely state
input or observation data,
•
The observation vector for a particular trajectory i (or each variable of the feature set) is
that creates the best match for the
•
To find likelihoods for each state, DTW,according to the state parameters, must be performed on
the model data. This is calculated as the probability of the observation given the state
is given
by:
•
where
(3.8)
•
•
and where
is the size of the temporal window over which matching from t backwards to
occurs.
are estimates of the standard deviation for each of the trajectories i for the whole sequence.
•
Equation (3.8) represents the mean distance between the test gesture and a DTW'd model for
sized window of the trajectories.
•
The term
data.
The model number is
•
•
•
•
performs the dynamic time warping of the trajectory i from the training
, the trajectory is shifted by
, interpolated by
, and scaled by .
Using S values of
it is possible to create a probability distribution of the whole search
space at one time instant.
Each conditional probability acts as a weighting for its corresponding state and with successive
iterations, the distributions of the states in the search space cluster round areas which represent
the more likely gestures.
The weighting or normalised probabilities are calculated as follows:
-
•
•
•
•
•
•
•
From these weights, it is possible to predict the probability distribution over the search space at the
next time instant.
Thus, more probable states are more likely to be propagated over the total time of the observed
sequence.
It is emphasised here that more than one probable state can be propagated at each time instant.
The sample set is first initialised by sampling uniformly for each parameter for every set of
samples,
S samples are initialised, where S is chosen to be 1000.
Once the samples have been initialised, the states to be propagated must be chosen.
This is done by constructing a cumulative probability distribution using the weights,
as shown in
Figure 3.2
•
A value r is chosen uniformly and then the smallest value of the cumulative weight,
such that
•
where (t-1) represents the current time frame and t indexes the samples and weights of the next
time frame that is being predicted.
The corresponding state
is then selected for propagation.
With this method of selection, larger weights are more likely to be chosen.
The ordering of the cumulative weight distribution is therefore irrelevant.
To avoid getting trapped in local minima or maxima, 5% to 10% of the sample set are randomly
chosen and initialised, as described above.
After states have been selected for propagation, the parameters for that state, at the next time step
are predicted using the following equations:
•
•
•
•
•
is chosen
•
After this stage, the new state is evaluated using the probability
•
If the conditional probability is below zero then the state is predicted again using the above
equations and
is recalculated.
•
If this process needs to be repeated more than a predetermined number of times, then, the state is
deemed unlikely and it is reinitialised using the random initialisation described previously.
The number of `tries' is a pre-determined amount and this was chosen to be 100. Once all S new
states
have been generated, the normalised weights, are recalculated for
state selection and propagation at the next time instant.
•
•
The process of selection, prediction, and update is repeated until the end of the observed
sequence or the gesture is considered recognised is reached.
Model Training
•
•
•
The model was trained using six subjects, chosen at random, to represent the model
data or training set for the algorithm.
The model trajectories were created by the interpolating each example of a particular
gesture, to the mean length of the trajectory for that gesture.
Then, the mean value at each time step, for each of the four trajectories, was
calculated. Hence, each gesture was represented by four model trajectories. These
were then used to match with the observed gestures. The adjustable parameter
values were:–
–
–
–
–
1. The window size : The total length of gestures varied between 30 and 90 frames
2. The number of samples:
3. The percentage of randomised samples at each iteration: To minimise the risk of the
condensation algorithm getting stuck in local maxima in the search space,
4. The warp ranges of the state parameters : control over the amount of generalisation of the
model.
5. The number of tries : Size of the local search
The Recognition Process: Model Fitting
•
•
•
•
•
•
The video sequences that were used were manually trimmed, so that each sequence contained
one gesture, performed once.
It would have been artificial to find the most likely gesture at the end of the sequence since the
likelihood of each gesture evolved in time and depended on which part of the gesture was being
matched.
Also, in reality, such a system would have to deal with a whole sequence of many gestures so that
the start and end of each gesture would be unknown. Therefore, it was important to find a suitable
method of measuring when a gesture had been recognised by the algorithm.
The method of measuring when a gesture had been recognised that was used involved taking the
proportion of the highest likelihood against the second highest likelihood. If this fell below a
predefined threshold, for a certain number of consecutive frames, then the states would be reset
and the gesture, considered recognised.
As well as this, a maximum value for the phase parameter was used so that when it was greater
than it, the gesture would also be considered recognised. This was used to emulate a situation
where the end of a gesture might be unknown.
The algorithm stops when either one of these two criteria are satisfied.
Experiment
•
The condensation algorithm was implemented using MATLAB to recognise six possible gestures,
as shown in 4.1.
•
Video images of these six gestures were taken from sixteen different subjects, four times. The
gestures were ; `come', `go', a high wave, a low wave, point left, and point right.
All subjects were asked to perform the `come' and `go' gestures with both hand and the waving
pointing gestures with their right hand.
•
•
•
•
•
Training data was produced by taking six subjects, chosen at random. Nine subjects were chosen
for input observation sequences.
Confusion matrices were generated for all results.
The mean values, and interpolation to the mean lengths of each gesture for six of the subjects was
calculated to create the training set.
The rest of the subjects were used to test the algorithm.
•
•
•
•
•
•
•
Figure 4.1: The six gestures that
were used for recognition.
clockwise from top left;come, go,
point right, point left, a high wave
and a low wave.
The parameter values were
chosen to be:1. The window size : 10;
2. The number of samples :
1000;
3. The percentage of
randomised samples at each
Iteration: 10 %;
4. The warp ranges of the state
parameters $\rho $ and $\alpha $:
$1 \pm 0.2$;
5. The number of tries : 100;
Gesture Characteristics of Individual Subjects
•
The overall performance of the algorithm was not good. Most subjects only had 2 or 3 gestures
that were recognised at all. Only one subject had 5 out of 6 gestures recognised once or more.
The TPRs, FPRs and accuracy for each gesture for every subject is shown in Tables 4.1, 4.2, and
4.3.
•
Figure 4.2: Feature vectors of the `come'
gesture from the subject, `Chris'.
•
Figure 4.3: Feature vectors of the `come'
gesture from the training set.
•
Figure 4.4: Feature vectors of the low
waving gesture from the training set.
•
Figure 4.5: Feature vectors of the left
pointing gesture from the training set.
•
Figure 4.6: Feature vectors of an
example of left pointing gesture
from the subject, `Cth'.
•
Figure 4.7: Feature vectors of an example of
left pointing gesture from the subject, `Kate'.
The Recognition of Individual Gestures
•
The `come' and `go ' gestures were the most distinguishable from other gestures since they had
comparatively high TPRs and amongst the highest accuracy values.
•
However, when tests were run with just `come' and `go' as possible classifications, the two
gestures were completely indistinguishable.
•
Observing the actual training trajectories of `come' and `go' in Figures 4.3 and 4.8, we can see that
they are also virtually indistinguishable.
•
Hence, the feature extraction technique was not able to capture the hand pose or orientation of the
hands and wrist,which was where the fundamental difference between the two gestures lay.
•
Figure 4.8: Feature vectors
of the go gesture from the
training set.
•
Figure 4.9: Comparison of the motion area trajectories of the `come' and `go' gesture for subject,
`Kate'.
•
Figure 4.10: Comparison of the motion area trajectories of the `come' and `go' gesture for subject,
`Chris'.
•
Figure 4.11: Feature vectors of the high
waving gesture from the training set.
The Effect of Classifying Fewer Gestures
The Effect of Altering Parameters in the Algorithm