PS-SR: Pseudo-Single-Step Video Super-Resolution via Speculative Diffusion
Abstract
Video Super-Resolution (VSR) fundamentally struggles with a critical trade-off: single-step models offer unmatched efficiency but often lack the high-frequency detail, creativity, and visual quality of their multi-step diffusion counterparts, which are computationally prohibitive for practical use. In this paper, we propose PS-SR, a novel "pseudo" single-step VSR framework that transcends this trade-off through a computationally asymmetric sampling pipeline. The key to PS-SR lies in its speculative diffusion mechanism: a powerful base model performs only a single, comprehensive sampling step, establishing the global structure and content fidelity, after which a lightweight draft model, directly augmented by the base model's features, speculatively performs subsequent refinements. Crucially, we further enforce a frequency-domain update rule that constrains these refinements to exclusively inject high-frequency details, preserving the foundational low-frequency content and preventing semantic drift across sampling steps. By doing so, PS-SR creates the "illusion" of a single-step model—delivering the similar inference speeds and input-output content consistency—while achieving the visual richness and creativity typically reserved for costly multi-step generative models. We demonstrate that our "pseudo-single-step" paradigm achieves state-of-the-art quality with a comparable speed to single-step models, paving the way for real-time, high-fidelity video enhancement.
Overview
Given a low-quality video input \( \mathbf{x}_L \), it is first encoded into a latent representation \( \mathbf{z}_L \). Our computationally asymmetric sampling pipeline concludes: (1) Base Model Execution: A powerful base model \( \phi_{\text{base}} \) performs a single, comprehensive denoising step, transforming \( \mathbf{z}_L \) into an intermediate latent \( \mathbf{z}_{T-1} \) that establishes the global structure and content. (2) Draft Model Refinement: The latent \( \mathbf{z}_{T-1} \) is then iteratively refined over multiple steps by a lightweight draft model \( \phi_{\text{draft}} \). Crucially, each draft model prediction is guided by features inherited from the base model. (3) Frequency-Domain Update: After each draft step, the prediction is converted to pixel space, and our frequency-domain update rule is applied: it preserves the low-frequency content from the previous step while adaptively blending only the high-frequency components from the new prediction.