This project is oriented to the design and recording of a multimodal speech corpus for applications in areas such as audio-visual information integration, multimodal communication, cortex-inspired models of multimodal perception, speech recognition, 3D head model animation, image synthesis, etc. The availability of a multimodal speech corpus is fundamental for the development and evaluation of multimodal communication systems. This project would contribute to the long-term objective in these research areas, of designing Human Computer Interfaces allowing for a more natural interaction with the devices.
Project description
During the last decades there has been an increasing interest in the research on
communication among humans, motivated by the need of more natural Human Computer
Interfaces (HCI). Human communication is inherently multimodal. The audio channel (acoustic
signal) generally provides phonetic information necessary to communicate a linguistic message.
The visual channel (human face) provides additional information necessary to communicate the
message. Note, in particular, that lips communicate most aspects of the visual face information.
The visual channel provides information related to prosody, expression, or emotion. The suprasegmental
information can become essential when the perception of the acoustic signal is
degraded, for hard of hearing people for instance, or when the perception of emotions is impaired
for autistic children. The third channel is gesture. Gesture is a universal feature of human
communication, usually associated to speech production. Humans frequently provide acoustic
utterances while they also perform symbolic gestures, which express the same meaning as the
utterance and usually are redundant, but indispensable in cued speech for instance.
The availability of a multimodal speech corpus is fundamental for the development of
audiovisual human computer interfaces. In contrast to the abundance of audio-only data corpora,
audio-visual speech databases are very limited, due to the inherent difficulty for the simultaneous
compilation of audio and visual data. Among the main limitations of the existing audio-visual
speech corpora, the following can be mentioned. They are specific to particular applications such
as digit and isolated word recognition, they usually employ a single camera, and they are
generally compiled from a small number of participants [Hazen et al., 2004] [Cooke et al., 2006]
[Theobald et al., 2008]. A comprehensive overview of the existing audio-visual corpora can be
obtained from [Fanelli et al., 2010], where a 3D audio-visual corpus for affective communication
is described.
The main goal of this project is to compile a multimodal speech corpus containing
synchronized audio-visual data recorded from talking individuals. The corpus will incorporate
several communication modes which appear in the communication among humans, such as the
acoustic signal, facial movements and body gestures during speech. The proposed corpus will be
recorded in at least 2 different languages (Spanish and French), with a relatively large number of
participants, and different data acquisition setups (single camera, multiple cameras, 3D cameras,
motion capture system, and electromagnetography system). The corpus will also contemplate
different communication situations such as isolated word, continuous speech and spontaneous
speech. The proposed characteristics of the corpus would allow its use in a wide range of
applications and research areas.
To reach these goals, we combine in a tri-national project (Argentina, Chile and France)
the skills of four teams that are respectively specialized in cortex-inspired models of multimodal
perception (Cortex-LORIA), multimodal communication (Parole-LORIA), audio-visual
information integration (CIFASIS-UNR) and image synthesis (DCC-U Chile). The participants
in this project are coming from multidisciplinary research teams and they have several
complementary competences while sharing as a core topic the study of multimodal
communication from production, synthesis and perception standpoints. In this context, several
participants of this project have already been carrying out collaborative research activities in the
field of bimodal human communication, working on audio-visual information integration,
behavioral bio-inspired connectionist models and virtual head models animation. In this previous
research project, the integration of audio-visual information was analyzed for applications of
virtual face model animation driven by video [Cerda et al., 2010] and by speech [Terissi et al.,
2013]. In both cases, the visual information was represented by the movements of the mouth
during speech [Terissi et al., 2010] captured using conventional video cameras. The resulting
animations were evaluated in terms of intelligibility of visual speech through subjective tests.
These results show that the visual information provided by the animated avatar improves the
recognition of speech in noisy environments. However, these experimental results also suggest
that facial expressions, head movements [Schultz et al., 2009] and tongue [Terissi et al., 2013]
play an important role in the communication process, indicating the importance of multimodal
communication.
The availability of a multimodal speech corpus became then a common need for the
individual and collaborative research activities of these teams. The research activities of the
participating teams, related to the core topic of this project, are briefly described in the
following. These activities do not lie within the scope of the currently submitted project, thus
their description only highlights the long-term goals of our teams that will take advantage of a
multimodal speech corpus. It also shows how the diversity of our research lines ensures that the
definition of the corpus in this project will take into account sufficiently different viewpoints.
The Cortex-LORIA group develops models of neural networks that are inspired by the
architectural and behavioral structure of the human brain. These models mostly address
perceptive and motor tasks, trying to both mimic and understand the way the sensorimotor loops
are handled by the human cortex. These works have led to the definition of bio-inspired models
of neural networks for different kinds of perception, mostly visual, as well as neural models of
multimodal integration and self-organization. Details on neural models that we develop to detect
patterns of motion and to deal with multiple modalities are given in section B7 since they are
closely related to the multimodal speech corpus.
The Parole-LORIA group is working on the modeling of speech for facilitating oralbased
communication. One goal is to model speech mechanisms: via a geometrical model of the
speaker’s vocal tract and face, and acoustic simulation of speech production. The first impact is
the incorporation of speech production modeling into audiovisual speech synthesis. In fact, the
group works on developing highly intelligible talking head, (i.e., from an input text, a 3D virtual
head is animated with acoustic speech corresponding to the input text). This is crucial to better
understand the mechanism and the processes involved in audiovisual speech communication.
The group at DCC-UChile is working on the animation of virtual faces driven by speech
or video, and its applications. The group has experience in the creation and transformation of
surface and volume meshes, and has incorporated the animation as a particular case of mesh
deformation. Specifically, as face models have a limited number of points, face animation
presents the research questions of: (1) how to match and track different face models, and (2) how
to animate realistically any face models based on the animation of a simple one. In terms of
geometric meshes, face animation presents an important challenge because these meshes have
large and fast deformations, but at the same time it is critical to ensure properties like correctness
and smoothness to have good quality animations. Among the possible applications, face
animation and tracking present potential in clinical assessment and treatments (visual feedback)
for facial paralysis, an approach that our group plans to explore in collaboration with local
physiotherapy institutions.
The group at CIFASIS-UNR is working on the integration of audio-visual information
for speech processing, mimicking the communication mode in humans. In particular, two
applications are being considered, speech-driven face animation and audiovisual speech
recognition. In these applications, a fundamental task is feature extraction from both the acoustic
and visual signals. In this research, combined Hidden Markov Models (HMM) are proposed to
integrate the audiovisual information. In the speech-driven face animation application, the HMM
model is used to estimate the visual features from the speech ones. In audiovisual speech
recognition the HMM model is used to enhance speech recognition using both audio and visual
signals.
Each participating team will contribute to the present project by specifying the particular
requirements for the corpus, and it is expected that they will have a balanced participation in all
the stages of the project.
The compilation of a multimodal speech corpus is a very difficult challenge, since it
involves a variety of complex tasks. It requires the design and definition of the content of the
corpus, the selection of the data acquisition setups and recording equipment to be employed, the
specification of the features to record, the registration format, and the recording protocol. It also
requires post-recording processing associated to the segmentation and annotation of the data, and
low-level pre-processing of raw data. In particular, the following tasks will be carried out for the
compilation of the corpus.
This project is focused on the design and recording of a multimodal speech corpus that satisfies the needs of the participating research teams. Since a multimodal speech corpus is fundamental in several research areas, it is expected that the proposed corpora could be employed in a wide range of applications and research lines. As mentioned before, the design and compilation of such corpora are complex and time-consuming tasks, thus it is expected that the complete compilation of the proposed corpus will extend beyond the two years duration of this project. This project is a first step where the main objectives are the definition of the central characteristics and recording protocols of the final corpus, through the evaluation of prototype corpora.