Hunyuan Announces HunyuanVideo-Avatar With Audio Support

HunyuanVideo-Avatar is capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos.

Jun 03, 2025

∙ Paid

Tencent’s Hunyuan releases a new feature, HunyuanVideo-Avatar, which lets you transform your photo into a video with audio support. You upload a photo and a voice clip, and the AI figures out the context, emotion, and lip movements to create a realistic animated video.

It sounds a lot like what Google’s Veo 3 can do. The difference is that, HunyuanVideo-Avatar runs on open-weights, and you can run it on your own local machine if you have a powerful hardware.

Let’s explore the details of this new video model.

What Is HunyuanVideo-Avatar?

HunyuanVideo-Avatar is based on a multimodal diffusion transformer (MM-DiT) architecture capable of simultaneously generating dynamic, emotion-controllable, and multi-character dialogue videos.

It brings three major improvements:

Character Image Injection Module — Instead of using the usual method of adding character information, this new approach avoids mismatches between training and real use. It helps keep the character’s appearance consistent while allowing natural and expressive movement.
Audio Emotion Module (AEM) — This module takes emotional cues from a reference image and applies them to the generated video. It allows for more precise and detailed control over the emotional expression in the character’s voice and face.
Face-Aware Audio Adapter (FAA) — This feature separates each character’s face at a deeper level using a face mask. It lets the system inject different audio for different characters using cross-attention, making it work better for scenes with multiple people.

Here’s the overview of the overall architecture:

Keep reading with a 7-day free trial

Subscribe to Generative AI Publication to keep reading this post and get 7 days of free access to the full post archives.