ChatAnyone: Alibaba’s HumanAIGC Team Unveils Open-Source Real-Time Digital Avatar Project
Introduction
ChatAnyone is an open-source project developed by Alibaba’s HumanAIGC team, designed to generate real-time stylized upper-body animation videos from a single portrait photo and audio input. Launched in 2025, the project is further detailed in the paper ChatAnyone: Stylized Real-Time Portrait Video Generation with Hierarchical Motion Diffusion Model, authored by Jinwei Qi et al. and published on ArXiv. The project addresses the growing demand for virtual streamers, digital humans, and real-time interactive applications, leveraging a Hierarchical Motion Diffusion Model for efficient, high-quality video generation.
Project Background: Why ChatAnyone Matters
Industry Trends Driving Development
In recent years, the rise of virtual streamers, digital humans, and online education has fueled demand for immersive experiences. Static avatars and simple voice interactions no longer suffice—real-time, dynamic, and realistic video generation has become the new industry standard.
Technical Challenges Addressed
ChatAnyone tackles key challenges in real-time video generation:
- • Lip Syncing: Ensures precise synchronization between audio and lip movements.
- • Natural Expressions: Generates lifelike facial expressions driven by audio inputs.
- • Stylized Output: Supports customizable styles like cartoon rendering.
- • Dual-Host Scenarios: Enables multi-person interaction, such as dual-host podcast videos.
Research shows that ChatAnyone achieves 30fps generation speed on an RTX 4090 GPU, with resolutions up to 512×768, meeting the demands of real-time applications [Post ID: 0].
Use Cases
The project supports generating upper-body animations from a single photo, making it ideal for:
- • Virtual streamer live broadcasts
- • Podcast video creation
- • Interactive online education
For example, content creators can use ChatAnyone to produce dual-host podcast videos, significantly reducing production costs.
Team Background: HumanAIGC at Alibaba
HumanAIGC is part of Alibaba Group’s Tongyi Lab, specializing in human-centered generative AI technologies. The team has made significant strides in areas like real-time portrait video generation, virtual try-ons, and character animation.
However, there has been some skepticism about the team’s commitment to open-sourcing. For instance, related projects like AnimateAnyone and Emote Portrait Alive were initially touted as open-source but ultimately did not fully release their source code, sparking community debates [Web ID: 22]. This reflects the delicate balance large tech companies must strike between innovation and commercial interests.
Key Highlights of ChatAnyone
- 1. Real-Time Generation
- • Achieves 30fps on RTX 4090 GPUs.
- • Supports resolutions up to 512×768, suitable for consumer-grade hardware.
- 2. Stylized Output
- • Offers multiple styles, including cartoon rendering, enhancing customization options.
- 3. Dual-Host Support
- • Enables multi-person collaboration scenarios, such as dual-host podcast videos.
- 4. Audio-Driven Animation
- • Combines voice feature extraction with lip-syncing and expression generation for hyper-realistic results.
Technical Architecture: How ChatAnyone Works
ChatAnyone’s architecture is built on a Hierarchical Motion Diffusion Model, breaking down the video generation process into layers for efficient real-time rendering. Here’s a closer look:
Core Technology: Hierarchical Motion Diffusion Model
- • Layered Design: Decomposes video generation into layers (e.g., overall pose, facial expressions, hand gestures). Each layer uses an independent diffusion model, reducing computational complexity while enhancing naturalness.
- • Motion Modeling: Handles different types of motion (e.g., head, hands, body) separately, ensuring smooth transitions and fluid movements.
- • Diffusion Models: Leverages the power of diffusion models combined with conditional control (e.g., audio input) for audio-driven video generation.
Audio Processing & Driving
- • Voice Feature Extraction: Extracts features like Mel-spectrograms from input audio to drive lip movements and facial expressions.
- • Lip Syncing: Aligns lip movements with audio inputs for perfect synchronization.
- • Expression Generation: Uses emotional cues from audio to create natural expression changes, enhancing realism.
Image Generation & Stylization
- • Portrait Generation: Creates dynamic portraits from a single photo, supporting stylized outputs like cartoons.
- • Resolution Support: Handles resolutions up to 512×768, ideal for HD video output.
Real-Time Performance Optimization
- • Hardware Acceleration: Achieves 30fps on RTX 4090 GPUs through GPU acceleration.
- • Model Optimization: Reduces latency via layered design and efficient diffusion models, enabling real-time applications.
Performance Comparison
For a detailed performance analysis, refer to the technical report linked below.
Official Resources
- • Project Website: https://humanaigc.github.io/chat-anyone/
- • Technical Paper: https://arxiv.org/pdf/2503.21144
(Note: Code updates are pending; currently, only the technical principles have been open-sourced.)
Takeaways
ChatAnyone represents a significant leap forward in real-time digital human technology, offering creators powerful tools to produce immersive, interactive content. While questions remain about its full open-source status, the project underscores Alibaba’s commitment to advancing generative AI and pushing the boundaries of what’s possible in virtual storytelling.
Stay tuned for updates as the codebase evolves!
Reproduction without permission is prohibited:AI LAB » Alibaba’s HumanAIGC Team Open-Sources Real-Time Digital Human Project ChatAnyone