Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications
by Lakhmi C. Jain; N.M. Martin
CRC Press, CRC Press LLC
ISBN: 0849398045   Pub Date: 11/01/98
  

Previous Table of Contents Next


4.6 MSE Minimization vs. Cross-Correlation Maximization

Different from the MSE-based learning where the TDNN output values try to track exactly the target sequence, when the cross-correlation is minimized, the network output is similar to the target sequence but usually different in amplitude.

Noticing that two coherent sinusoids always have unitary cross-correlation despite possible differences in amplitude and mean value, the output produced by the TDNN can be easily adapted to the specific target sequence by means of suitable scale and shift factors. The main advantage with this kind of learning is that, since there is no constraint on the output absolute value, the TDNN neurons operate in the linear interval of the activation function thus leading to fast convergence.


Figure 17  The lower curves provide a comparison between the MSE-based and r-based performances expressed in terms of pattern-target MSE. The first 1000 iterations have been done using DB1 as training/testing set while DB2 was used in the second 1000 iterations. The upper curves have been computed using DB3 as cross-validation set for the entire (2000 iterations) learning phase (1051-8215/97$10.00 © 1997 IEEE).

The size of the TDNN can be indicated as nU(12-8-3-1), meaning 12-dimensional input vectors, two hidden layers with 8 and 4 neurons, respectively, and a single output neuron. Delays have been sized as nD(2-4-6), indicating a delay of 2, 4, and 6 time instants at the first hidden, second hidden, and output layer, respectively. The training of the network has been performed in two steps, first on a simple audio-video database DB1 (vowels) and then on a more complex database DB2 (V/C/V transitions). In Figures 17 and 18 a comparison between the MSE and the r learning curves is reported with reference to 2000 iterations “by epoch,” 1000 for each database. The comparison is done with reference to the training set (DB1-DB2) and also to a testing set (DB3) containing isolated words and used for cross-validation. Comparison has been done in terms of patter-target MSE (minimization) as well as of patter-target cross-correlation (maximization). The experimental curves show that, for this specific estimation problem, learning based on cross-correlation outperforms the conventional MSE-based as far as both convergence speed and distortion are concerned.


Figure 18  The lower curves provide a comparison between the MSE-based and r-based performances expressed in terms of pattern-target cross-correlation. The first 1000 iterations have been done using DB1 as training/testing set while DB2 was used in the second 1000 iterations. The upper curves have been computed using DB3 as cross-validation set for the entire (2000 iterations) learning phase (1051-8215/97$10.00 © 1997 IEEE).

5. Speech Visualization and Experimental Results

Our experimental results are satisfactory using networks with two hidden layers, composed of 8 and 3 units each, with D(1) = 2, D(2) = 3 and D(3) = 4, so that each output pattern depends directly on the previous 9 input patterns.

In Figures 19 and 20, the articulatory parameter H, estimated through the network with reference to a test sequence, is compared to the actual parameter values.


Figure 19  Performances of the network evaluated on a test word extracted from the training set: the solid line indicates the actual mouth external height H while the dashed line represents the estimated H parameter (1063-6528/95$04.00 © 1995 IEEE).

Any effective synthesis of visual speech cues should reproduce on the screen all the necessary articulatory information with usually associated a talking mouth. The articulatory estimates derived from the analysis of acoustic speech are usually very coarse and basically limited to the lips horizontal and vertical aperture. Important visible cues like teeth visibility and tongue position are generally characterized by acoustic signitures, too weak to be discriminated in noise; this fact reflects in poor visualization and consequent confusion in speech reading.


Figure 20  Performances of the network evaluated on a test word outside the training set: the solid line indicates the actual mouth external height H while the dashed line represents the H parameter estimated by the network (1063-6528/95$04.00 © 1995 IEEE).

The visual information associated with the talking mouth of one single speaker, seen from a constant point of view, without rotations, occlusions, and lighting variations, can be considered reasonably stationary. Based on this hypothesis, a valid statistical analysis of the image content can be carried out and a compact representation of it can be obtained.

The visualization methodology adopted is based on Vector Quantization procedures applied to clusterize visems in articulatory spaces of increasing dimensionality and then, to represent them in the pixel domain as a combination of elementary blocks. The audio-video Italian corpus we have recorded includes more than 30,000 images synchronized with speech where all the necessary details of the mouth visible articulation are reproduced. Each image has been automatically classified in terms of articulatory descriptors of varying complexity, ranging from the plain pixel coordinates of some specific feature to more sophisticated shape description of the lips’ contour. Since each image contains both front and side views of the speaker’s face, the articulatory description is expressed by orthogonal parameters. The articulatory vectors, which characterize the corpus images, have been clustered in spaces of increasing dimensionality yielding more and more precise quantized descriptions of the mouth configuration.

The resulting vector distribution is clustered in small subregions of the articulatory space which identify special configurations of the mouth representative of the various Italian visemes, thanks to the good properties of the employed clustering algorithms which allow the identification of small clusters. The main articulatory trajectories between the Italian visemes have been tracked and quantized into a pre-defined number of clusters (128 and 256), each of them associated to a corresponding image selected from the corpus, whose articulatory vector is closest to its centroid. These images have been taken as “key-frame.”


Figure 21  Description of the speech visualization procedure.


Previous Table of Contents Next

Copyright © CRC Press LLC