Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications
by Lakhmi C. Jain; N.M. Martin
CRC Press, CRC Press LLC
ISBN: 0849398045   Pub Date: 11/01/98
  

Previous Table of Contents Next


As an alternative to articulatory measurement based on video camera acquisition (by far the more usual and easy method), infra-red cameras capable of detecting the 3D position of small markers (passive reflectors) placed along the lips contour could be used. In this case high time resolution can be reached (100 Hz), precise space localization and tracking of points is assured. Negligible memory storage is required, but the nature and size of markers typically constrain the spatial resolution and prevent locating them on the tongue. An effective solution would be to integrate the two acquisition procedures and record both video and markers at different rates. In case video is recorded for parameter extraction, the use of make-up is almost unavoidable for facilitating lip/tongue segmentation (see Figures 4 and 5). Lipstick can be used to enhance the lips/cheeks contrast with a suitable color, like blue-cyan, with significant chromatic difference from the typical “pink” hue of human skin. Tongue is usually difficult to be detected through image processing: appreciable advantage is obtained by pointing a light source frontally to the mouth and by using natural substances to color the mouth cavity (like blue-metylene or special paste used by TV actors and showmen to smooth-out the tongue color and enhance their contrast with teeth).

In order to evaluate the vertical aperture of the jaw (the jaw motion has three freedom degrees: vertical rotation, back-to-forward and, side translations), the distance between two rigid reference points must be measured. Typical points are chosen on the tip of the nose and on tip of the chin and must be marked in a way to facilitate their extraction and tracking.

In case of infra-red images with 3D localization, two reflectors can be placed for correspondence of these points. In case of video acquisition and processing, on the contrary, a suitable marker can be obtained by painting a small colored cross on the skin of nose and chin. Previous considerations of color still apply.

Since mouth articulation is properly three-dimensional and since some visemes are characterized by the protrusion of lips and by the position of the tongue tip with respect to teeth, a side view of the speakers mouth is almost necessary for integrating the frontal information in case of video-based acquisition. Stereo video acquisition of two orthogonal views (frontal and side) can be adopted by the use of a mirror placed on one side of the speaker’s head and oriented at 45° degrees with respect to the camera.


Figure 4  Make-up with coloured lipstick for chroma-key segmentation


Figure 5  Make-up with lipstick and reference point on the forehead


Figure 6  Model of the mouth with associated articulatory parameters (1063-6528/95$04.00 © 1995 IEEE).

H external height of the mouth
h internal height of the mouth
W external width of the mouth
w internal width of the mouth
dw segment of adjacency between the upper and the lower lips
LC mouth-nose distance
Lup external lip-nose distance
lup internal lip-nose distance
LM chin-nose distance (jaw aperture)

The mouth model, which has been employed in [8] and sketched in Figure 6, is defined by a vector of 10 parameters (LC, lup, Lup, dw, w, W, LM, h, H, teeth). The mouth articulatory parameters described in Figure 6 have been analyzed in order to evaluate their cross-correlation and provide a measure of their mutual dependence. In the following some examples of significant cross-correlation surfaces are reported in Figure 7, from which a basis of 5 almost noncorrelated parameters (LM, H, W, dw, Lup) has been defined.


Figure 7  Analysis of the cross-correlation among H-LM and H-W pairs of articulatory parameters

3.2 Acoustic/Visual Speech Analysis

Extensive experimentation on normal hearing and hearing-impaired subjects [2-4] has clearly demonstrated that if, on one hand, phonemes can be associated rather easily to well defined mouth configurations (called “visemes”), the inverse association is usually troublesome since the same posture of the mouth can correspond to different phonemes. As an example, the “bilabial” viseme is associated to different phonemes like /m,p,b/, and the “velar” viseme is associated to different phonemes like /k,g/.

Moreover, intense investigations on the articulatory dynamics [5-11] stress the role played by the coarticulatory phenomena which describe the effects on articulation due to past acoustic outputs (backward coarticulation) and to future going-to-be-produced acoustic information (forward coarticulation).

A rather common approach consists of a preliminary phoneme recognition step followed by phoneme-to-viseme mapping as shown in the scheme of Figure 8. In this case estimates of the articulatory parameters are typically obtained by means of vector quantizers, neural networks, or Hidden Markov Models (HMM) [12-19], based on preliminary learning procedures for training the system to associate acoustic speech representations to coherent visual information. A very wide use of these methodologies is done in bimodal speech recognition for improving the performances of the system by adding visual cues to the conventional acoustic cues. It helps in exploiting the audio/video complementarity as it is usually done in speech comprehension performed by humans [20]. In all these methodologies, the phoneme-viseme association is performed in two separate and consecutive steps concerned with phoneme recognition and articulatory estimation, respectively. For this aspect, the approach is similar to that characterizing various algorithms proposed for converting text or phonetic transcriptions into audio/visual speech. Here the task of coarticulation modeling is performed based on a priori knowledge either during phoneme recognition or during articulatory estimation.


Figure 8  Speech is converted to lip movements after an intermediate stage of phoneme recognition (1063-6528/95$04.00 © 1995 IEEE).


Previous Table of Contents Next

Copyright © CRC Press LLC