Advancing Self-supervised Audio Learning and Its Integration with Large Language Models

Talk By Xie CHEN

Mar 15, 2024 Friday

Abstract:

Recently, self-supervised learning has received widespread research interest in speech and audio processing. It demonstrates great potential for learning underlying structure information from large amounts of unlabeled audio. In this talk, I will introduce our recent progress in self-supervised learning on audio and emotional speech data, by introducing utterance and frame-level joint learning, we could achieve significant performance improvement in audio classification and speech emotion recognition. Next, I will introduce our effort to integrate the powerful speech representation with a large language model, to extend the ability of LLMs on speech recognition and spatial audio understanding. We demonstrate that the powerful audio representation plays a vital role, and a simple combination between the audio representation is sufficient to yield promising performance.

Time:

Mar 15, 2024 Friday

11:00-11:50

Location:

Rm W1-101, GZ Campus

Online Zoom

Join Zoom athttps://hkust-gz-edu-cn.zoom.us/j/4236852791 OR 423 685 2791

Speaker Bio:

Xie CHEN

Tenure-Track Associate Professor, Department of Computer Science and Engineering,

Shanghai Jiao Tong University, China

Xie Chen is currently a Tenure-Track Associate Professor in the Department of Computer Science and Engineering at Shanghai Jiao Tong University, China. He obtained his Bachelor degree in Electronic Engineering department from Xiamen University in 2009, Master degree in the Electronic Engineering department from Tsinghua University in 2012 and PhD degree in the information engineering department in Cambridge University (U.K.) in 2017. Prior to joining SJTU, he worked in Cambridge University as a Research Associate from 2017 to 2018, and a senior and principal research in the speech and language research group in Microsoft as a researcher form 2018 to 2021. His main research interest lies in deep learning, especially its application on speech processing, including speech recognition and synthesis.