PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion

1 University of Science and Technology of China     2 HiDream.ai Inc    

Abstract

Video Super-Resolution (VSR) fundamentally struggles with a critical trade-off: single-step models offer unmatched efficiency but often lack the high-frequency detail, creativity, and visual quality of their multi-step diffusion counterparts, which are computationally prohibitive for practical use. In this paper, we propose PS-SR, a novel "pseudo" single-step VSR framework that transcends this trade-off through a computationally asymmetric sampling pipeline. The key to PS-SR lies in its speculative diffusion mechanism: a powerful base model performs only a single, comprehensive sampling step, establishing the global structure and content fidelity, after which a lightweight draft model, directly augmented by the base model's features, speculatively performs subsequent refinements. Crucially, we further enforce a frequency-domain update rule that constrains these refinements to exclusively inject high-frequency details, preserving the foundational low-frequency content and preventing semantic drift across sampling steps. By doing so, PS-SR creates the "illusion" of a single-step model—delivering the similar inference speeds and input-output content consistency—while achieving the visual richness and creativity typically reserved for costly multi-step generative models. We demonstrate that our "pseudo-single-step" paradigm achieves state-of-the-art quality with a comparable speed to single-step models, paving the way for real-time, high-fidelity video enhancement.

Overview

Framework

Given a low-quality video input \( \mathbf{x}_L \), it is first encoded into a latent representation \( \mathbf{z}_L \). Our computationally asymmetric sampling pipeline concludes: (1) Base Model Execution: A powerful base model \( \phi_{\text{base}} \) performs a single, comprehensive denoising step, transforming \( \mathbf{z}_L \) into an intermediate latent \( \mathbf{z}_{T-1} \) that establishes the global structure and content. (2) Draft Model Refinement: The latent \( \mathbf{z}_{T-1} \) is then iteratively refined over multiple steps by a lightweight draft model \( \phi_{\text{draft}} \). Crucially, each draft model prediction is guided by features inherited from the base model. (3) Frequency-Domain Update: After each draft step, the prediction is converted to pixel space, and our frequency-domain update rule is applied: it preserves the low-frequency content from the previous step while adaptively blending only the high-frequency components from the new prediction.

Video Examples

Input
Result
Input
Result
Input
Result
Input
Result
Input
Result
Input
Result
Input
Result
Input
Result
Input
Result
Input
Result
Input
Result
Input
Result