Multimodal Co-Speech Gesture Generation System

Introduction

This repository hosts a collaborative project focused on 3D Co-Speech Gesture Generation. The system aims to synthesize natural, diverse, and semantically aligned gestures for virtual avatars based on different input modalities.

The project is developed by a team of three, building upon a shared baseline architecture. Each member has branched out to optimize specific components of the generation pipeline: generative algorithms, semantic representation, and multimodal input processing.

Module Overview

This repository is organized into three distinct sub-modules. For detailed instructions regarding environment setup, model training, and inference scripts, please navigate to the README.md file located within each specific subdirectory.

1. Diffusion-based Gesture Generation

Directory: ./Diffusion_Text2Gesture/
Maintainer: [WEI LI]
Description: This module replaces the traditional regression-based output layer with a Transformer-based Gaussian Diffusion Model. It addresses the "Mean Collapse" problem common in deterministic models, resulting in gestures with realistic dynamics and high diversity.
Key Features: 12D feature representation, iterative denoising, and text-conditioned generation.
Documentation: Read Module Documentation

2. BERT-based Semantic Embedding

Directory: ./BERT_Embedding_Module/
Maintainer: [Mateus DE BRITO GUIRARDELLO]
Description: This module integrates pre-trained BERT embeddings to replace standard word vectors. It enhances the model's ability to capture complex linguistic nuances, emotional tone, and semantic context from long text inputs.
Key Features: Context-aware text encoding and semantic alignment optimization.
Documentation: Read Module Documentation

3. Voice-driven Motion Interface

Directory: ./Voice_Input_Module/ (Please adjust folder name if necessary)
Maintainer: [Yujia GUO]
Description: This module expands the system's input modality to support direct audio signals. It extracts acoustic features (such as MFCC and prosody) to synchronize gesture rhythm and intensity with speech audio.
Key Features: Audio feature extraction, cross-modal alignment, and speech-driven synthesis.
Documentation: Read Module Documentation

Contributors

[WEI LI]: Diffusion Model Implementation & Optimization.
[Mateus DE BRITO GUIRARDELLO]: BERT Representation Learning & Text Encoder.
[Yujia GUO]: Audio Feature Extraction & Voice Pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
BERT_Embedding_Module		BERT_Embedding_Module
Diffusion_Text2Gesture		Diffusion_Text2Gesture
Voice_input_Module		Voice_input_Module
.gitattributes		.gitattributes
.gitignore		.gitignore
FinalPre.pdf		FinalPre.pdf
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Co-Speech Gesture Generation System

Introduction

Module Overview

1. Diffusion-based Gesture Generation

2. BERT-based Semantic Embedding

3. Voice-driven Motion Interface

Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodal Co-Speech Gesture Generation System

Introduction

Module Overview

1. Diffusion-based Gesture Generation

2. BERT-based Semantic Embedding

3. Voice-driven Motion Interface

Contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages