Project guidelines

This project is oriented to the design and recording of a multimodal speech corpus for applications in areas such as audio-visual information integration, multimodal communication, cortex-inspired models of multimodal perception, speech recognition, 3D head model animation, image synthesis, etc. The availability of a multimodal speech corpus is fundamental for the development and evaluation of multimodal communication systems. This project would contribute to the long-term objective in these research areas, of designing Human Computer Interfaces allowing for a more natural interaction with the devices.

Project description

During the last decades there has been an increasing interest in the research on communication among humans, motivated by the need of more natural Human Computer Interfaces (HCI). Human communication is inherently multimodal. The audio channel (acoustic signal) generally provides phonetic information necessary to communicate a linguistic message. The visual channel (human face) provides additional information necessary to communicate the message. Note, in particular, that lips communicate most aspects of the visual face information. The visual channel provides information related to prosody, expression, or emotion. The suprasegmental information can become essential when the perception of the acoustic signal is degraded, for hard of hearing people for instance, or when the perception of emotions is impaired for autistic children. The third channel is gesture. Gesture is a universal feature of human communication, usually associated to speech production. Humans frequently provide acoustic utterances while they also perform symbolic gestures, which express the same meaning as the utterance and usually are redundant, but indispensable in cued speech for instance. The availability of a multimodal speech corpus is fundamental for the development of audiovisual human computer interfaces. In contrast to the abundance of audio-only data corpora, audio-visual speech databases are very limited, due to the inherent difficulty for the simultaneous compilation of audio and visual data. Among the main limitations of the existing audio-visual speech corpora, the following can be mentioned. They are specific to particular applications such as digit and isolated word recognition, they usually employ a single camera, and they are generally compiled from a small number of participants [Hazen et al., 2004] [Cooke et al., 2006] [Theobald et al., 2008]. A comprehensive overview of the existing audio-visual corpora can be obtained from [Fanelli et al., 2010], where a 3D audio-visual corpus for affective communication is described.
The main goal of this project is to compile a multimodal speech corpus containing synchronized audio-visual data recorded from talking individuals. The corpus will incorporate several communication modes which appear in the communication among humans, such as the acoustic signal, facial movements and body gestures during speech. The proposed corpus will be recorded in at least 2 different languages (Spanish and French), with a relatively large number of participants, and different data acquisition setups (single camera, multiple cameras, 3D cameras, motion capture system, and electromagnetography system). The corpus will also contemplate different communication situations such as isolated word, continuous speech and spontaneous speech. The proposed characteristics of the corpus would allow its use in a wide range of applications and research areas.
To reach these goals, we combine in a tri-national project (Argentina, Chile and France) the skills of four teams that are respectively specialized in cortex-inspired models of multimodal perception (Cortex-LORIA), multimodal communication (Parole-LORIA), audio-visual information integration (CIFASIS-UNR) and image synthesis (DCC-U Chile). The participants in this project are coming from multidisciplinary research teams and they have several complementary competences while sharing as a core topic the study of multimodal communication from production, synthesis and perception standpoints. In this context, several participants of this project have already been carrying out collaborative research activities in the field of bimodal human communication, working on audio-visual information integration, behavioral bio-inspired connectionist models and virtual head models animation. In this previous research project, the integration of audio-visual information was analyzed for applications of virtual face model animation driven by video [Cerda et al., 2010] and by speech [Terissi et al., 2013]. In both cases, the visual information was represented by the movements of the mouth during speech [Terissi et al., 2010] captured using conventional video cameras. The resulting animations were evaluated in terms of intelligibility of visual speech through subjective tests. These results show that the visual information provided by the animated avatar improves the recognition of speech in noisy environments. However, these experimental results also suggest that facial expressions, head movements [Schultz et al., 2009] and tongue [Terissi et al., 2013] play an important role in the communication process, indicating the importance of multimodal communication.
The availability of a multimodal speech corpus became then a common need for the individual and collaborative research activities of these teams. The research activities of the participating teams, related to the core topic of this project, are briefly described in the following. These activities do not lie within the scope of the currently submitted project, thus their description only highlights the long-term goals of our teams that will take advantage of a multimodal speech corpus. It also shows how the diversity of our research lines ensures that the definition of the corpus in this project will take into account sufficiently different viewpoints. The Cortex-LORIA group develops models of neural networks that are inspired by the architectural and behavioral structure of the human brain. These models mostly address perceptive and motor tasks, trying to both mimic and understand the way the sensorimotor loops are handled by the human cortex. These works have led to the definition of bio-inspired models of neural networks for different kinds of perception, mostly visual, as well as neural models of multimodal integration and self-organization. Details on neural models that we develop to detect patterns of motion and to deal with multiple modalities are given in section B7 since they are closely related to the multimodal speech corpus.
The Parole-LORIA group is working on the modeling of speech for facilitating oralbased communication. One goal is to model speech mechanisms: via a geometrical model of the speaker’s vocal tract and face, and acoustic simulation of speech production. The first impact is the incorporation of speech production modeling into audiovisual speech synthesis. In fact, the group works on developing highly intelligible talking head, (i.e., from an input text, a 3D virtual head is animated with acoustic speech corresponding to the input text). This is crucial to better understand the mechanism and the processes involved in audiovisual speech communication. The group at DCC-UChile is working on the animation of virtual faces driven by speech or video, and its applications. The group has experience in the creation and transformation of surface and volume meshes, and has incorporated the animation as a particular case of mesh deformation. Specifically, as face models have a limited number of points, face animation presents the research questions of: (1) how to match and track different face models, and (2) how to animate realistically any face models based on the animation of a simple one. In terms of geometric meshes, face animation presents an important challenge because these meshes have large and fast deformations, but at the same time it is critical to ensure properties like correctness and smoothness to have good quality animations. Among the possible applications, face animation and tracking present potential in clinical assessment and treatments (visual feedback) for facial paralysis, an approach that our group plans to explore in collaboration with local physiotherapy institutions.
The group at CIFASIS-UNR is working on the integration of audio-visual information for speech processing, mimicking the communication mode in humans. In particular, two applications are being considered, speech-driven face animation and audiovisual speech recognition. In these applications, a fundamental task is feature extraction from both the acoustic and visual signals. In this research, combined Hidden Markov Models (HMM) are proposed to integrate the audiovisual information. In the speech-driven face animation application, the HMM model is used to estimate the visual features from the speech ones. In audiovisual speech recognition the HMM model is used to enhance speech recognition using both audio and visual signals.
Each participating team will contribute to the present project by specifying the particular requirements for the corpus, and it is expected that they will have a balanced participation in all the stages of the project.
The compilation of a multimodal speech corpus is a very difficult challenge, since it involves a variety of complex tasks. It requires the design and definition of the content of the corpus, the selection of the data acquisition setups and recording equipment to be employed, the specification of the features to record, the registration format, and the recording protocol. It also requires post-recording processing associated to the segmentation and annotation of the data, and low-level pre-processing of raw data. In particular, the following tasks will be carried out for the compilation of the corpus.

Project scope

This project is focused on the design and recording of a multimodal speech corpus that satisfies the needs of the participating research teams. Since a multimodal speech corpus is fundamental in several research areas, it is expected that the proposed corpora could be employed in a wide range of applications and research lines. As mentioned before, the design and compilation of such corpora are complex and time-consuming tasks, thus it is expected that the complete compilation of the proposed corpus will extend beyond the two years duration of this project. This project is a first step where the main objectives are the definition of the central characteristics and recording protocols of the final corpus, through the evaluation of prototype corpora.