Today we're announcing the open-source release of

Today we're announcing the open-source release of HunyuanVideo-Foley, our new end-to-end Text-Video-to-Audio (TV2A) framework for generating high-fidelity audio.🚀

This tool empowers creators in video production, filmmaking, and game development to generate professional-grade audio that precisely aligns with visual dynamics and semantic context, addressing key challenges in V2A generation.🔊

Key Innovations:

🔹Exceptional Generalization: Trained on a massive 100k-hour multimodal dataset, the model generates contextually-aware soundscapes for a wide range of scenes, from natural landscapes to animated shorts.

🔹Balanced Multimodal Response: Our innovative multimodal diffusion transformer (MMDiT) architecture ensures the model balances video and text cues, generating rich, layered sound effects that capture every detail—from the main subject to subtle background elements.

🔹High-Fidelity Audio: Using a Representation Alignment (REPA) loss function and a powerful Audio VAE, we've improved generation stability and producing professional-grade audio, free of noise and inconsistencies.

HunyuanVideo-Foley achieves SOTA on multiple benchmarks, surpassing all open-source models in audio quality, visual-semantic alignment, and temporal alignment.

👉Try it now: https://hunyuan.tencent.com/video/zh?tabIndex=0
🌐Project Page: https://szczesnys.github.io/hunyuanvideo-foley/
🔗Code: https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley
📄Technical Report: https://arxiv.org/abs/2508.16930
🤗Hugging Face: https://huggingface.co/tencent/HunyuanVideo-Foley