CVPR（Computer Vision and Pattern Recognition）是人工智能与计算机视觉领域最高级别的国际顶级学术会议之一，也是中国计算机学会（CCF）推荐的A类国际学术会议，拥有卓越的h指数和广泛的影响力，每年六月召开一次。今年会议于6月18日-22日在加拿大温哥华召开。
在众多的优秀研究中，王林教授团队的一篇论文《Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective》经过CVPR 2023评委会的筛选，入选了CVPR “highlight”名单，在全球研究人员提交的论文中获得这一殊荣的论文仅占约2.5%的比例。
In June of this year, the IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), recognized as the one of the top international academic conference, witnessed the remarkable achievements of Professor Lin Wang’s team. With a total of four papers accepted, including one paper selected for the CVPR “highlight” , Professor Wang served as the corresponding author, leaving a lasting impression with their groundbreaking contributions.
CVPR, recognized as one of the highest-level international academic conferences in the field of artificial intelligence and computer vision, is also recommended by the Chinese Computer Federation (CCF) as an A-level conference. With an exceptional h-index and extensive influence, the conference is held annually in June. This year, CVPR took place in Vancouver, Canada from June 18th to 22nd.
Among the numerous outstanding research works, Professor Wang’s team made a groundbreaking contribution with their paper titled “Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective.” Recognized as a standout highlight, this paper underwent a rigorous selection process by CVPR 2023’s program committee, with only approximately 2.5% of all submitted papers receiving this honor.
This notable paper has garnered special attention and recognition on CVPR 2023’s online conference platform and will also be showcased during the poster session.
Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective
Abstract: Endeavors have been recently made to leverage the vision transformer (ViT) for the challenging unsupervised domain adaptation (UDA) task. They typically adopt the cross-attention in ViT for direct domain alignment. However, as the performance of cross-attention highly relies on the quality of pseudo labels for targeted samples, it becomes less effective when the domain gap becomes large. We solve this problem from a game theory’s perspective with the proposed model dubbed as PMTrans, which bridges source and target domains with an intermediate domain. Specifically, we propose a novel ViT-based module called PatchMix that effectively builds up the intermediate domain, i.e., probability distribution, by learning to sample patches from both domains based on the game-theoretical models. This way, it learns to mix the patches from the source and target domains to maximize the cross entropy (CE), while exploiting two semi-supervised mixup losses in the feature and label spaces to minimize it. As such, we interpret the process of UDA as a min-max CE game with three players, including the feature extractor, classifier, and PatchMix, to find the Nash Equilibria. Moreover, we leverage attention maps from ViT to re-weight the label of each patch by its importance, making it possible to obtain more domain-discriminative feature representations. We conduct extensive experiments on four benchmark datasets, and the results show that PMTrans significantly surpasses the ViT-based and CNN-based SoTA methods by +3.6% on Office-Home, +1.4% on Office-31, and +7.2% on DomainNet, respectively.
Learn more at: https://vlis2022.github.io/cvpr23/PMTrans
Learning Spatial-Temporal Implicit Neural Representations for Event-Guided Video Super-Resolution
Abstract: Event cameras sense the intensity changes asynchronously and produce event streams with high dynamic range and low latency. This has inspired research endeavors utilizing events to guide the challenging video super-resolution (VSR) task. In this paper, we make the first attempt to address a novel problem of achieving VSR at random scales by taking advantages of the high temporal resolution property of events. This is hampered by the difficulties of representing the spatial-temporal information of events when guiding VSR. To this end, we propose a novel framework that incorporates the spatial-temporal interpolation of events to VSR in a unified framework. Our key idea is to learn implicit neural representations from queried spatial-temporal coordinates and features from both RGB frames and events. Our method contains three parts. Specifically, the Spatial-Temporal Fusion (STF) module first learns the 3D features from events and RGB frames. Then, the Temporal Filter (TF) module unlocks more explicit motion information from the events near the queried timestamp and generates the 2D features. Lastly, the Spatial Temporal Implicit Representation (STIR) module recovers the SR frame in arbitrary resolutions from the outputs of these two modules. In addition, we collect a real-world dataset with spatially aligned events and RGB frames. Extensive experiments show that our method significantly surpass the prior-arts and achieves VSR with random scales, e.g., 6.5.
Learn more at: https://vlis2022.github.io/cvpr23/egvsr
HRDFuse: Monocular 360°Depth Estimation by Collaboratively Learning Holistic-with-Regional Depth Distributions
Abstract: Depth estimation from a monocular 360◦ image is a burgeoning problem owing to its holistic sensing of a scene. Recently, some methods, e.g., OmniFusion, have applied the tangent projection (TP) to represent a 360◦ image and predicted depth values via patch-wise regressions, which are merged to get a depth map with equirectangular projection (ERP) format. However, these methods suffer from 1) non-trivial process of merging plenty of patches; 2) capturing less holistic-with-regional contextual information by directly regressing the depth value of each pixel. In this paper, we propose a novel framework, HRDFuse, that subtly combines the potential of convolutional neural networks (CNNs) and transformers by collaboratively learning the holistic contextual information from the ERP and the regional structural information from the TP. Firstly, we propose a spatial feature alignment (SFA) module that learns feature similarities between the TP and ERP to aggregate the TP features into a complete ERP feature map in a pixelwise manner. Secondly, we propose a collaborative depth distribution classification (CDDC) module that learns the holistic-with-regional histograms capturing the ERP and TP depth distributions. As such, the final depth values can be predicted as a linear combination of histogram bin centers. Lastly, we adaptively combine the depth predictions from ERP and TP to obtain the final depth map. Extensive experiments show that our method predicts more smooth and accurate depth results while achieving favorably better results than the SOTA methods.
Learn more at: https://vlis2022.github.io/HRDFuse/
Both Style and Distortion Matter: Dual-Path Unsupervised Domain Adaptation for Panoramic Semantic Segmentation
Abstract: The ability of scene understanding has sparked active research for panoramic image semantic segmentation. However, the performance is hampered by distortion of the equirectangular projection (ERP) and a lack of pixel-wise annotations. For this reason, some works treat the ERP and pinhole images equally and transfer knowledge from the pinhole to ERP images via unsupervised domain adaptation (UDA). However, they fail to handle the domain gaps caused by: 1) the inherent differences between camera sensors and captured scenes; 2) the distinct image formats (e.g., ERP and pinhole images). In this paper, we propose a novel yet flexible dual-path UDA framework, DPPASS, taking ERP and tangent projection (TP) images as inputs. To reduce the domain gaps, we propose cross-projection and intra-projection training. The cross-projection training includes tangent-wise feature contrastive training and prediction consistency training. That is, the former formulates the features with the same projection locations as positive examples and vice versa, for the models’ awareness of distortion, while the latter ensures the consistency of cross-model predictions between the ERP and TP. Moreover, adversarial intra-projection training is proposed to reduce the inherent gap, between the features of the pinhole images and those of the ERP and TP images, respectively. Importantly, the TP path can be freely removed after training, leading to no additional inference cost. Extensive experiments on two benchmarks show that our DPPASS achieves +1.06% mIoU increment than the state-of-the-art approaches.
Learn more at: https://vlis2022.github.io/cvpr23/DPPASS