![]() |
Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications
by Lakhmi C. Jain; N.M. Martin CRC Press, CRC Press LLC ISBN: 0849398045 Pub Date: 11/01/98 |
Previous | Table of Contents | Next |
To reduce the memory requirements further, key-frames have been subdivided into 4×4 pixel blocks and the resultant vectors have been clusterized a second time in the 16-dimensional space of pixels providing a predefined number of key-blocks (128 and 256), which form the viseme reconstruction codebook (see Figure 21). Each of the 128 key-frames has been finally associated to a list of 7-bit indexes for addressing the suitable blocks in the reconstruction codebook. Experiments have been performed using 256×256 and 128×128 pel image formats composed of 4096 and 1024 key-blocks, respectively.
The visual synthesis of speech has been evaluated by computing the Mean Square Error (MSE) between the original images of the corpus and those reconstructed by means of the key-blocks. In this evaluation, the viseme reconstruction codebook has been addressed either by using the actual articulatory vectors measured from the images, or the estimates derived through speech analysis. Various dimensionalities of the articulatory space and of the reconstruction code-book have been used. The estimation MSE for each articulatory parameter has also been evaluated. The objective MSE evaluation alone, however, cannot provide enough indications on the performances since relevant components in speech reading depend on the quality with which the coarticulatory dynamics are rendered and on the level of coherence with acoustic speech.
Because of this, a set of subjective experiments has been carried out with both normal hearing and hearing-impaired subjects. Experiments consist of showing some sample sequences of visualized speech to persons asked to express their evaluation in terms of readibility, visual discrimination, quality of the articulatory dynamics, and level of coherence with acoustic speech. Sequences were encoded off-line with different configurations; in particular, two choices have been taken both for spatial resolution (128×128 and 256×256 pixel) and for time resolution (15 and 25 frame/sec). The number of articulatory parameters, used to synthesize the mouth articulation, has been increased from 2 (mouth height and width) to 10 (including the protrusion parameter extracted from the side view). The original video sequence, representing the speakers mouth pronouncing a list of Italian isolated words (from a corpus of 400 words), was displayed at 12.5 frame/sec without audio at half-resolution (128×128 pels) and full-resolution (256×256 pels). Only the frontal view of the mouth was displayed. Observers, seated at a distance of 30 cm from a 21 monitor in an obscure and quiet site of the laboratory, were asked to write down the words they succeeded in speech reading. The sequence was displayed a second time, increasing the time resolution to 25 frame/sec.
The presentation of the original sequence provided indications on the personal proficiency of each observer. In fact, besides the evident difference of sensitivity between normal hearing and hearing-impaired people, a significant variability is also present within the same class of subjects. Therefore, the computation of the subjective evaluation score has been normalized for each individual on the basis of his/her speech reading perception threshold. For each observer, with reference to each of the two possible image formats (128×128 or 256×256), the minimum time resolution (12.5 or 25 Hz) allowing successful speech reading was found. Success was measured on the basis of a restricted set of articulatory easy words for which 90% correct recognition was required.
Further experimentation with synthetic images was then performed, observer by observer, using the exact time frequency which had allowed his/her successful speech reading of the original sequences. The whole test was repeated replacing the original images with synthetic ones, reconstructed by key-blocks addressed by means of actual (no estimation error) articulatory parameters. Finally, a third repetition of the test was performed addressing the reconstruction code-book by means of parameters estimated from speech through the TDNNs (with estimation error). Before each test repetition, the order of the words in the sequence were randomly shuffled to avoid expectations by the observer.
The number of people involved in the tests is still too small, especially as far as pathological subjects are concerned, but on-going experimentation aims at enlarging this number significantly. A total number of 15 observers were involved in the evaluations and only 2 of them are hearing impaired with 70 dB loss. The preliminary results reported in Tables 4 and 5 take into account, for each observer, only the words that he/she has correctly recognized in the original sequence. This has been done according to our particular interest in evaluating the subjective similarity of the reconstructed images to the original, as far as the possibility of correct speech reading was concerned.
The use of parameters W and H alone did not allow speech reading; since observers were more inclined to guess than recognize, the test outcomes have been considered unreliable and have been omitted. From the results in both tables, it is evident that the progressive introduction of parameters W, H, dw, LM, and Lup raises significantly the recognition rate while slight improvement is gained when parameters h, w, LC, and lup are added. This can be easily explained by the fact that the former set of parameters represent a basis (mutual independence) while the latter set of parameters is strongly correlated to the previous one and can supply only marginal information. The information associated to teeth (supplied manually since it could not be estimated through the TDNN) has proved to be of great importance for improving the quality of speech visualization since it directly concerns the dental articulatory place and provides information on the tongue position.
Previous | Table of Contents | Next |