RefVSR++: A High-Precision Video Super-Resolution Method Utilizing Reference Inputs

RefVSR++: A High-Precision Video Super-Resolution Method Utilizing Reference Inputs

       

This study proposes a novel method called RefVSR++ for Reference-based VSR(*), which leverages the characteristics of multi-camera systems found in modern smartphones to restore low-resolution videos into high-resolution ones. Traditional VSR enhances resolution by utilizing temporal information from a single low-resolution (LR) video stream. In contrast, multi-camera setups allow the same scene to be captured simultaneously from different fields of view (FoVs), enabling the use of high-resolution reference (Ref) videos as auxiliary input for more accurate super-resolution.

(*) VSR = Video Super Resolution

This study points out two key limitations of the existing RefVSR method. First, RefVSR fuses and propagates LR and Ref features within a single stream, even though the content captured by different FoVs varies. While overlapping regions may provide high-quality information, non-overlapping areas are more prone to misalignment and missing details, leading to degraded output quality. Second, RefVSR heuristically propagates a confidence map to maintain feature consistency, but this approach often accumulates temporal errors, which negatively affect performance.

RefVSR++ addresses these issues with two major improvements. First, it introduces a dual-stream architecture where Ref and LR features are aggregated and propagated independently over time. The Ref stream accumulates high-frequency details (as residuals), while the SR stream fuses LR and Ref features to generate the final super-resolved images. This design prevents information degradation caused by mixed feature propagation and enables more effective use of high-frequency cues.

Second, RefVSR++ incorporates Deformable Convolutional Networks (DCN) and Patch Matching to handle both temporal and FoV misalignments. In particular, the Ref stream uses Patch Matching to accurately align LR and Ref features, and integrates them using confidence-based fusion. Furthermore, it eliminates the heuristic confidence map propagation used in prior work and recalculates alignment confidence at each time step, reducing accumulated errors.

Experiments were conducted using the RealMCVSR dataset, which consists of multi-camera videos captured by an iPhone 12 Pro Max. RefVSR++ was compared against state-of-the-art methods including RCAN, BasicVSR++, DCSR, ERVSR, and RefVSR. Results show that RefVSR++ achieved up to 1.5dB improvement in PSNR, and demonstrated superior performance even in non-overlapping FoV regions. Additionally, evaluations on the Inter4K dataset, composed of real-world 4K videos from the internet, confirmed the method’s high generalization capability.

An ablation study was also conducted to quantify the contribution of each component. Variants with and without the SR stream, Ref stream, and residual propagation were compared. Results clearly demonstrated the importance of each module. Notably, propagating Ref residuals proved effective in removing redundant low-frequency information and enhancing cooperation with the SR stream.

Despite its strengths, RefVSR++ also presents a limitation in terms of GPU memory consumption (approximately 19GB) due to its dual-stream architecture. Future work may focus on designing more efficient architectures, incorporating lower-precision computation, or selectively using reference frames to reduce computational costs while maintaining high performance.

In conclusion, RefVSR++ makes a meaningful contribution to the field of video super-resolution in multi-camera environments. By effectively leveraging both spatial and temporal cues from LR and Ref inputs, the method demonstrates both architectural and theoretical advancements. As demand for high-resolution video content grows, RefVSR++ holds strong potential for practical applications, particularly in maximizing the imaging capabilities of smartphone cameras.

Publication

Zou, Han, Masanori Suganuma, and Takayuki Okatani. “RefVSR++: Exploiting Reference Inputs for Reference-based Video Super-resolution.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2025.

@inproceedings{zou2025refvsr++,
title={RefVSR++: Exploiting Reference Inputs for Reference-based Video Super-resolution},
author={Zou, Han and Suganuma, Masanori and Okatani, Takayuki},
journal={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
year={2025}
}