About me

🎓 Academic Background

I am currently an undergraduate student in the Department of Computer Science and Technology at Tsinghua University, a member of the Class of 2023.

As of the end of my sophomore year, my academic standing is GPA 3.90 (ranked 23/171 in the cohort). Notably, all core Computer Science courses and fundamental Mathematics courses were completed with a 4.0.

Prior to this, I completed my high school education in Jinan, Shandong Province, where I achieved a score of 704 on the National College Entrance Examination, ranking among the Top 10 students in the province.

👩‍💻 Research Experience

During my undergraduate studies, I have been actively involved in research internships under the guidance of Prof. Song-Hai Zhang, Prof. Xiao-Lin Hu, and Prof. Hao Zhao.

Key Achievements

  • SRT Project (Prof. Zhang): My individual work on the Student Research Training (SRT) project under Prof. Zhang received the A+ grade (Highest), demonstrating exceptional research output.
  • Research Grant (Prof. Hu): During my internship with Prof. Hu, I successfully initiated and secured a 20,000 RMB grant through the competitive university-level “Xuetui Plan” (Student Research Promotion Program).

🔬 Research Interests

Broadly speaking, my research centers on the perception, understanding, and generation of visual and auditory modalities. My ultimate goal is to construct a highly realistic and interactive audiovisual world, serving as a foundation to achieve Spatial Intelligence capable of robust perception, understanding, and reasoning.

Specifically, I focus (or plan to focus) on the following areas:

1. 3D Vision

  • 3D Representation Learning: Considering the current fragmented landscape of 3D representations (e.g., Voxels, NeRF, Gaussian Splatting), I aim to explore and define a unified 3D representation paradigm.
  • 3D Scene Generation & Reconstruction: Generating interactive and Physically Realistic 3D scenes.

2. Spatial Audio

(For a comprehensive overview of this field, refer to the survey: ASAudio: A Survey of Advanced Spatial Audio Research)

  • Acoustic Field Reconstruction & Generation: Inspired by works such as NeRAF and AV-DAR, I plan to develop a feedforward Audio-Visual Gaussian Splatting (AV-GS) framework, which supports multiple sound sources and generalizes across scenes without requiring per-scene optimization.
  • Unified Spatial Audio Generation: Developing a unified framework capable of generating spatial audio from diverse inputs, including text, egocentric video, 360° video, and audio.
  • Spatial Audio Perception & Reasoning: Investigating how Multimodal LLMs (MLLMs) perform reasoning using spatial audio, and how Embodied AI frameworks (e.g., Vision-Language-Action (VLA) models) utilize spatial audio for decision-making.

3. Multimodal Large Language Models (MLLM)

  • Robustness in Audiovisual Contexts: Benchmarking and enhancing MLLM robustness under challenging conditions, such as noisy environments, multi-speaker audio, or extremely low-quality visual inputs.
  • Enhancing Spatial Intelligence: Improving performance on downstream spatial tasks, including but not limited to:
    • Vision-and-Language Navigation (VLN)
    • 3D Object Detection & Grounding
    • Spatial Question Answering (Spatial QA)
  • MLLM Paradigm Research: Critically examining current multimodal alignment paradigms (e.g., designing specific discrete tokenizers or continuous encoders for each modality), I aim to investigate whether this is the path to true multimodal intelligence or merely an expedient, short-sighted solution for simplicity.

📚 Publications / Works

📝 论文发表 (Publications)

  • Efficient Audio-Visual Speech Separation with Discrete Lip Semantics
    • Authors: Kai Li*, Kejun Gao*, et al. (申请人:共同一作)
    • Project Page https://dolphin-avss.github.io/Dolphin/
    • Venue: 投稿至 ICLR 2026 审稿中 (当前评分: 6/6/6/4,位列所有在投论文约10%,往年会议中稿率约30%)
    • Highlights: 提出了一种基于“语义-重建”双路径视觉编码框架的轻量级多模态语音分离模型。性能指标全面超越 SOTA:参数量减少 >50%,计算量 (MACs) 降低 >2.4×,推理速度提升 >6× (GitHub 159+ stars)。
  • LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
    • Authors: Junhao Chen*, Kejun Gao*, et al. (申请人:共同一作)
    • Venue: 投稿至 CVPR 2026 审稿中
    • Highlights: 构建了首个基于 Lottie 矢量动画格式的自回归生成框架。在单帧 SVG 生成任务上优于现有 SOTA 方法 (如 OmniSVG),在矢量视频生成上超越了现有闭源模型 (如 Sora2, Kling)。
  • ViewSeeker: Locating Camera via Monocular RGB Image
    • Authors: Hanxi Zhu, Kejun Gao, et al. (申请人:二作)
    • Venue: 投稿至 IEEE TVCG (CCF-A类期刊) 审稿中
    • Highlights: 设计了基于检测/分割输出的 MaskXY 导数,并提出了一种用于目标视角导航的可微渲染框架。
  • 《叙事工坊:交互式叙事场景构建》 (Narrative Workshop)
    • Authors: Hanxi Zhu, Kejun Gao, et al. (申请人:二作)
    • Venue: Chinagraph 2024 最佳论文奖;已中稿《计算机学报》(CCF-A类中文期刊), 中国知网下载量309
    • Highlights: 提出了一种基于大语言模型 (LLM) 的叙事场景布局优化策略。

🏆 Awards

  • 2024-2025 Academic Year: Awarded the “KuanDe” Comprehensive Excellence Scholarship (宽德综合优秀奖学金).
  • 2023-2024 Academic Year: Awarded the “Tsinghua Friend-Huawei” Comprehensive Excellence Scholarship (清华之友-华为综合优秀奖学金).