April 14, 2023 Friday
Abstract:
Cross-modal generation is an important task under the generative AI umbrella, in which I have focused on the visual-to-text and text-to-visual generation. To translate semantic information across modalities freely, the challenges include (1) handling unrecognizable visual instances, and (2) generating controllable complex contents with high quality. In this talk, solutions from various viewpoints will be introduced. First, the approaches of unsupervised language structure inference and uncovering domain-specific concepts will be discussed, to enhance the visual-to-text generation model performance. Afterwards, to simultaneously achieve high-fidelity visual generation and cross-modal semantic matching, the inversion and online alignment frameworks will be presented. These research findings have been validated on various scenarios, which are potentially promising to help promote the domains of game development, health care, etc.
Speaker Bio:
Mr. Hao WANG
PhD Candidate (final year), School of Computer Science and Engineering
Nanyang Technological University, Singapore
Hao WANG is a final year PhD candidate in the School of Computer Science and Engineering at Nanyang Technological University, Singapore. He received the B.E. degree from Huazhong University of Science and Technology. His research interest is developing AI-powered perception and generation algorithms for the multimodal domain. In particular, his recent work investigates the translation between visual and text data, to generate controllable contents with efficiency and robustness. He has published first-authored top-tier conference and journal work in computer vision and multimedia fields, including CVPR, ECCV, ACM MM, IEEE TPAMI, IEEE TIP, IEEE TMM, etc.