Project BAVI aims at exploring the possible contributions of computational neuroscience to more classical approaches for audio-visual information integration and for visual animation of face motions. This project does not include the participation of an industrial partner. It mostly consists of the initial definition of new bio-inspired computational models and computer imaging concepts for audio-visual processing, though practical applications of these concepts are expected in the near future of this project.

The goal of the project gathers three very active lines of research: audio-visual information integration, behavioral bio-inspired connectionist models, and visual animation. Based on previous cooperation activities on image analysis and motion detection, we intend to build a strong collaboration between our teams, covering each one of the three complementary research fields mentioned above. This collaboration will directly benefit from research tasks that are carried out by each academic partner. In order to combine and coordinate these research activities, we request funding from the Stic-AmSud program in order to facilitate the scientific exchanges between our teams through yearly meetings and exchanges of PhD and Master students.


BAVI Project description

The use of bio-inspired models for the processing of visual information is currently a topic of great interest, due to the natural robustness, parallelism and precision that can be achieved with these models. One of the amazing properties of the visual processing performed by the brain is its ability to rapidly explore and focus on a set of relevant visual features that are related to a specific cognitive task. Our main goal is to study, to model and to take advantage of this ability in the specific case of the visual perception of motion that is involved in audio-visual information integration. It includes the visual extraction of retinotopically-distributed patterns of motion, defined in a cortically-inspired way. Taking our inspiration from the functionalities of the brain cortex, we consider that audio-visual integration partially lies within some duality between perception and visualization. Therefore, audio-visual pattern-based animation of speech is essential in our project.

The various aspects of this research are first organized around complementary bilateral collaborations between our different teams.

The Cortex group develops models of neural networks that are inspired by the architectural and behavioral structure of the human brain. These models mostly address perceptive and motor tasks, trying to both mimic and understand the way the sensorimotor loops are handled by the human cortex. These works have led to the definition of bio-inspired models of neural networks for different kinds of visual perception. One aspect of the visual processing that is of particular interest in this research group, is the study of how the brain is able to detect particular patterns of motion, how they are coded and how they are retrieved from previous knowledge, in a robust way which is tolerant to natural deformations such as change of scale, partial rotations and translations. Such patterns of motion emerge from local movements that are extracted and integrated thanks to the dynamics of massively distributed and locally connected maps of neurons. These patterns may correspond to complex motions associated to sets of moving objects, or to combined motions and deformations of some articulated object (such as the face or mouth).

In a different but related line of work, the Laboratory for System Dynamics and Signal Processing (LSD&SP) at the Universidad Nacional de Rosario in Argentina, is working on the integration of audio-visual information for speech processing, mimicking the communication mode in humans, which is bimodal in nature. In particular, two applications are being considered, speech-driven face animation and audiovisual speech recognition. In these applications, a fundamental task is feature extraction from both the acoustic and visual signals. The standard cepstral domain features are employed for the representation of the acoustic signals, while the visual information is represented by the position of marks in the surrounding of the mouth region. Independent Component Analysis is used to obtain a more compact representation of the visual information. Combined Hidden Markov Models (HMM) are proposed to integrate the audiovisual information. In the speech-driven face animation application, the HMM model is used to estimate the visual features from the speech ones. In audiovisual speech recognition the HMM model is used to enhance speech recognition using both audio and visual signals.
In the case of speech-driven facial animation, visual features are used to compute the Facial Animation Parameters of the MPEG-4 standard to produce the animation of virtual heads. An alternative way to extract visual information from videos would be to use bio-inspired motion detection models developed at the Cortex group to get information of the deformation of the mouth during speech.

In addition, the Department of Computer Science (DCC) at the Universidad de Chile has a long experience working in the generation of 2, 2 ½ and 3D meshes, both using triangulations and mixed element meshes. Efficient algorithms have been developed for (a) the generation of an initial mesh to fit complex domains, (b) the refinement of a mesh according to refinement criteria such as maximum edge length and maximum area, (c) the improvement of a mesh according to improvement criteria such as maximum angle or minimum angle, and (d) post-processing the mesh in order to adapt it to particular requirements of the underlying numerical method. This work was mainly done for fixed domain geometries, but in the last three years, the development of a 2 ½ D mesh generator was started in order to model and visualize the tree growth process. The development of this mesh generator required the design and implementation of algorithms to handle small and large deformations of the tree geometry. After each growing simulation step, new positions for each mesh point are computed. Then, the original mesh must be deformed in order to get a mesh that represents the new tree geometry. In order to visualize the tree growing process, the mesh generator includes also visualization techniques that allow the animation of the whole process. This work in mesh optimization, deformation and animation can be used as a basis for the development of a tool for the animation of virtual heads using the visual information extracted from videos. The same technique can also be used to animate virtual humans using body motion features extracted from videos.

In this way, the three lines of research would complement and enrich each other leading to a synergetic collaboration of the groups. A schematic representation of the collaboration proposal is depicted in Fig.1.


Fig. 1: Flow chart of the collaboration proposal.



Project scope

Our project is an attempt at designing a new kind of models for robust audio-visual cognitive tasks, based on the emerging bio-inspired research domain of computational neuroscience. Apart from "ambitious" long-term perspectives, we do not target a large set of such tasks within the two year duration of the project. Neither do we expect our models to outperform current speech recognition methods nor to reach the realism of up-to-date face animation projects. Our main goal is to prove the relevance of cortical inspiration in the field of audio-visual processing, opening promising perspectives in terms of robust multimodal integration. As a first step, we focus on the relation between the distributed processing that is performed along the visual cortical pathway and the definition of generic patterns of face motion that may both improve speech recognition and code for visual animation sequences.


Expected results

As mentioned above, the main expected contributions range from bio-inspired models for audio-visual integration to motion visualization. More precisely, our goal is to strongly impact on our three main fields of research with new results that cover the following aspects:

  • Analysis of related neurophysiologic experiments and theories.
  • Definition of bio-inspired patterns of distributed motion for articulated and deformed objects.
  • Definition of patterns of motion based on extracted motion features.
  • Visualization of motion features.
  • Audio-visual statistical model construction using bio-inspired motion features and the associated acoustic features.
  • Audio-to-visual conversion based on inversion of statistical audio-visual models.
  • Face and body animation based on the conversion of motion features into animation parameters.
  • Mesh generation tool for the modeling of speech-driven facial animation.