Project BAVI aims at exploring the possible contributions of
computational neuroscience to more classical approaches for
audio-visual information integration and for visual animation of face
motions. This project does not include the participation of an
industrial partner. It mostly consists of the initial definition of new
bio-inspired computational models and computer imaging concepts for
audio-visual processing, though practical applications of these
concepts are expected in the near future of this project.
The goal of the project gathers three very active lines of research:
audio-visual information integration, behavioral bio-inspired
connectionist models, and visual animation. Based on previous
cooperation activities on image analysis and motion detection, we
intend to build a strong collaboration between our teams, covering each
one of the three complementary research fields mentioned above. This
collaboration will directly benefit from research tasks that are
carried out by each academic partner. In order to combine and
coordinate these research activities, we request funding from the
Stic-AmSud program in order to facilitate the scientific exchanges
between our teams through yearly meetings and exchanges of PhD and
Master students.
The use of bio-inspired models for the processing of visual information is
currently a topic of great interest, due to the natural robustness, parallelism
and precision that can be achieved with these models. One of the amazing properties
of the visual processing performed by the brain is its ability to rapidly explore
and focus on a set of relevant visual features that are related to a specific
cognitive task. Our main goal is to study, to model and to take advantage of
this ability in the specific case of the visual perception of motion that is
involved in audio-visual information integration. It includes the
visual extraction of retinotopically-distributed patterns of motion, defined in
a cortically-inspired way. Taking our inspiration from the functionalities
of the brain cortex, we consider that audio-visual integration partially lies within
some duality between perception and visualization. Therefore, audio-visual
pattern-based animation of speech is essential in our project.
The various aspects of this research are first organized around complementary bilateral collaborations between our different teams.
The Cortex group develops models of neural networks that are inspired by the
architectural and behavioral structure of the human brain. These models mostly address
perceptive and motor tasks, trying to both mimic and understand the way the sensorimotor
loops are handled by the human cortex. These works have led to the definition of bio-inspired
models of neural networks for different kinds of visual perception. One aspect of the
visual processing that is of particular interest in this research group, is the study
of how the brain is able to detect particular patterns of motion, how they are coded
and how they are retrieved from previous knowledge, in a robust way which is tolerant
to natural deformations such as change of scale, partial rotations and translations.
Such patterns of motion emerge from local movements that are extracted and integrated
thanks to the dynamics of massively distributed and locally connected maps of neurons.
These patterns may correspond to complex motions associated to sets of moving objects,
or to combined motions and deformations of some articulated object (such as the face or mouth).
In a different but related line of work, the Laboratory for System Dynamics and Signal
Processing (LSD&SP) at the Universidad Nacional de Rosario in Argentina, is
working on the integration of audio-visual information for speech processing, mimicking
the communication mode in humans, which is bimodal in nature. In particular, two
applications are being considered, speech-driven face animation and audiovisual speech
recognition. In these applications, a fundamental task is feature extraction from both
the acoustic and visual signals. The standard cepstral domain features are employed
for the representation of the acoustic signals, while the visual information is represented
by the position of marks in the surrounding of the mouth region. Independent Component
Analysis is used to obtain a more compact representation of the visual information.
Combined Hidden Markov Models (HMM) are proposed to integrate the audiovisual information.
In the speech-driven face animation application, the HMM model is used to estimate the
visual features from the speech ones. In audiovisual speech recognition the HMM model
is used to enhance speech recognition using both audio and visual signals.
In the case of speech-driven facial animation, visual features are used to compute the
Facial Animation Parameters of the MPEG-4 standard to produce the animation of virtual
heads. An alternative way to extract visual information from videos would be to use
bio-inspired motion detection models developed at the Cortex group to get information
of the deformation of the mouth during speech.
In addition, the Department of Computer Science (DCC) at the Universidad de Chile
has a long experience working in the generation of 2, 2 ½ and 3D meshes, both using
triangulations and mixed element meshes. Efficient algorithms have been developed
for (a) the generation of an initial mesh to fit complex domains, (b) the refinement
of a mesh according to refinement criteria such as maximum edge length and maximum
area, (c) the improvement of a mesh according to improvement criteria such as maximum
angle or minimum angle, and (d) post-processing the mesh in order to adapt it to
particular requirements of the underlying numerical method. This work was mainly
done for fixed domain geometries, but in the last three years, the development of
a 2 ½ D mesh generator was started in order to model and visualize the tree growth process.
The development of this mesh generator required the design and implementation of algorithms
to handle small and large deformations of the tree geometry. After each growing
simulation step, new positions for each mesh point are computed. Then, the original
mesh must be deformed in order to get a mesh that represents the new tree geometry.
In order to visualize the tree growing process, the mesh generator includes also
visualization techniques that allow the animation of the whole process. This work
in mesh optimization, deformation and animation can be used as a basis for the development
of a tool for the animation of virtual heads using the visual information extracted from videos.
The same technique can also be used to animate virtual humans using body motion features extracted from videos.
In this way, the three lines of research would complement and enrich each other
leading to a synergetic collaboration of the groups. A schematic representation
of the collaboration proposal is depicted in Fig.1.
Our project is an attempt at designing a new kind of models for robust audio-visual cognitive tasks, based on the emerging bio-inspired research domain of computational neuroscience. Apart from "ambitious" long-term perspectives, we do not target a large set of such tasks within the two year duration of the project. Neither do we expect our models to outperform current speech recognition methods nor to reach the realism of up-to-date face animation projects. Our main goal is to prove the relevance of cortical inspiration in the field of audio-visual processing, opening promising perspectives in terms of robust multimodal integration. As a first step, we focus on the relation between the distributed processing that is performed along the visual cortical pathway and the definition of generic patterns of face motion that may both improve speech recognition and code for visual animation sequences.
As mentioned above, the main expected contributions range from bio-inspired models
for audio-visual integration to motion visualization. More precisely, our goal is
to strongly impact on our three main fields of research with new results that cover
the following aspects: