FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
Abstract
DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high resolution outputs, further amplifying computational demands especially for single stage DiT models. To address these challenges, we propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage establishes flow matching between low and high resolutions, effectively generating fine details with minimal NFEs. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output before committing to full resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability .
Community
project page:https://jshilong.github.io/flashvideo-page/
model&code:https://github.com/FoundationVision/FlashVideo
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CascadeV: An Implementation of Wurstchen Architecture for Video Generation (2025)
- FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching (2025)
- Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models (2025)
- Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion (2025)
- RelightVid: Temporal-Consistent Diffusion Model for Video Relighting (2025)
- DiffVSR: Enhancing Real-World Video Super-Resolution with Diffusion Models for Advanced Visual Quality and Temporal Consistency (2025)
- Enhancing Image Generation Fidelity via Progressive Prompts (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper