Efficient and Multi-modal Agent with Large Language Models

Talk By Yukang CHEN

Mar 22, 2024 Friday


Multi-modal AI agents refer to artificial intelligence systems that are proficient in planning and reasoning with long-context large language models (LLMs) through multi-modal understanding, which includes the processing of 2D images, videos, and potentially other formats such as 3D point clouds. These agents hold the potential to significantly impact various aspects of human life, ranging from everyday convenience to advancements in specialized fields. In this presentation, I will primarily focus on three key aspects of my research: perception, reasoning, and efficient deep learning, all of which are crucial for the development of multi-modal AI agents. I will detail three representative works, including multi-modal perception, long-context reasoning with LLMs, and automatic machine learning. Lastly, I will outline our future research endeavors aimed at advancing multi-modal reasoning in the domains of robotics and generation.


Mar 22, 2024 Friday



Rm W1-101, GZ Campus

Online Zoom

Join Zoom athttps://hkust-gz-edu-cn.zoom.us/j/4236852791 OR 423 685 2791

Speaker Bio:

Yukang CHEN

Final-year PhD candidate

The Chinese University of Hong Kong

Yukang Chen is a final-year PhD candidate at the Chinese University of Hong Kong. His research focuses on efficient deep learning, large language models, and computer vision. He has contributed to more than 20 publications in leading conferences and journals, with 10 of these being first-authored. His work has been selected for multiple oral presentations at prestigious conferences such as ICLR and CVPR. His first-authored open-source projects have garnered nearly 5,000 stars on GitHub, indicating their significant impact and recognition within the community. Additionally, Yukang has achieved notable success in several renowned competitions and leaderboards, securing multiple winning positions or ranking first in leaderboards, such as the Microsoft COCO, ScanNet, and nuScenes.