PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion

Aiqiu Wu¹ Zhaofan Qiu² Ting Yao² Tao Mei²

¹ University of Science and Technology of China ² HiDream.ai Inc

Abstract

Video Super-Resolution (VSR) fundamentally struggles with a critical trade-off: single-step models offer unmatched efficiency but often lack the high-frequency detail, creativity, and visual quality of their multi-step diffusion counterparts, which are computationally prohibitive for practical use. In this paper, we propose PS-SR, a novel "pseudo" single-step VSR framework that transcends this trade-off through a computationally asymmetric sampling pipeline. The key to PS-SR lies in its speculative diffusion mechanism: a powerful base model performs only a single, comprehensive sampling step, establishing the global structure and content fidelity, after which a lightweight draft model, directly augmented by the base model's features, speculatively performs subsequent refinements. Crucially, we further enforce a frequency-domain update rule that constrains these refinements to exclusively inject high-frequency details, preserving the foundational low-frequency content and preventing semantic drift across sampling steps. By doing so, PS-SR creates the "illusion" of a single-step model—delivering the similar inference speeds and input-output content consistency—while achieving the visual richness and creativity typically reserved for costly multi-step generative models. We demonstrate that our "pseudo-single-step" paradigm achieves state-of-the-art quality with a comparable speed to single-step models, paving the way for real-time, high-fidelity video enhancement.

Overview

Given a low-quality video input \( \mathbf{x}_L \), it is first encoded into a latent representation \( \mathbf{z}_L \). Our computationally asymmetric sampling pipeline concludes: (1) Base Model Execution: A powerful base model \( \phi_{\text{base}} \) performs a single, comprehensive denoising step, transforming \( \mathbf{z}_L \) into an intermediate latent \( \mathbf{z}_{T-1} \) that establishes the global structure and content. (2) Draft Model Refinement: The latent \( \mathbf{z}_{T-1} \) is then iteratively refined over multiple steps by a lightweight draft model \( \phi_{\text{draft}} \). Crucially, each draft model prediction is guided by features inherited from the base model. (3) Frequency-Domain Update: After each draft step, the prediction is converted to pixel space, and our frequency-domain update rule is applied: it preserves the low-frequency content from the previous step while adaptively blending only the high-frequency components from the new prediction.

Video Examples

Input

Result

Input

Result

Input

Result

Input

Result

Input

Result

Input

Result

Input

Result

Input

Result

Input

Result

Input

Result

Input

Result

Input

Result