About me

🎓 Academic Background

I am currently an undergraduate student in the Department of Computer Science and Technology at Tsinghua University, a member of the Class of 2023.

As of the end of my sophomore year, my academic standing is GPA 3.90 (ranked 23/171 in the cohort). Notably, all core Computer Science courses and fundamental Mathematics courses were completed with a 4.0.

Prior to this, I completed my high school education in Jinan, Shandong Province, where I achieved a score of 704 on the National College Entrance Examination, ranking among the Top 10 students in the province.

👩‍💻 Research Experience

During my undergraduate studies, I have been actively involved in research internships under the guidance of Prof. Song-Hai Zhang, Prof. Xiao-Lin Hu, and Prof. Hao Zhao.

Key Achievements

SRT Project (Prof. Zhang): My individual work on the Student Research Training (SRT) project under Prof. Zhang received the A+ grade (Highest), demonstrating exceptional research output.
Research Grant (Prof. Hu): During my internship with Prof. Hu, I successfully initiated and secured a 20,000 RMB grant through the competitive university-level “Xuetui Plan” (Student Research Promotion Program).

🔬 Research Interests

Broadly speaking, my research centers on the perception, understanding, and generation of visual and auditory modalities. My ultimate goal is to construct a highly realistic and interactive audiovisual world, serving as a foundation to achieve Spatial Intelligence capable of robust perception, understanding, and reasoning.

Specifically, I focus (or plan to focus) on the following areas:

1. 3D Vision

3D Representation Learning: Considering the current fragmented landscape of 3D representations (e.g., Voxels, NeRF, Gaussian Splatting), I aim to explore and define a unified 3D representation paradigm.
3D Scene Generation & Reconstruction: Generating interactive and Physically Realistic 3D scenes.

2. Spatial Audio

(For a comprehensive overview of this field, refer to the survey: ASAudio: A Survey of Advanced Spatial Audio Research)

Acoustic Field Reconstruction & Generation: Inspired by works such as NeRAF and AV-DAR, I plan to develop a feedforward Audio-Visual Gaussian Splatting (AV-GS) framework, which supports multiple sound sources and generalizes across scenes without requiring per-scene optimization.
Unified Spatial Audio Generation: Developing a unified framework capable of generating spatial audio from diverse inputs, including text, egocentric video, 360° video, and audio.
Spatial Audio Perception & Reasoning: Investigating how Multimodal LLMs (MLLMs) perform reasoning using spatial audio, and how Embodied AI frameworks (e.g., Vision-Language-Action (VLA) models) utilize spatial audio for decision-making.

3. Multimodal Large Language Models (MLLM)

Robustness in Audiovisual Contexts: Benchmarking and enhancing MLLM robustness under challenging conditions, such as noisy environments, multi-speaker audio, or extremely low-quality visual inputs.
Enhancing Spatial Intelligence: Improving performance on downstream spatial tasks, including but not limited to:
- Vision-and-Language Navigation (VLN)
- 3D Object Detection & Grounding
- Spatial Question Answering (Spatial QA)
MLLM Paradigm Research: Critically examining current multimodal alignment paradigms (e.g., designing specific discrete tokenizers or continuous encoders for each modality)， I aim to investigate whether this is the path to true multimodal intelligence or merely an expedient, short-sighted solution for simplicity.

📚 Publications / Works

📝 论文发表 (Publications)

Efficient Audio-Visual Speech Separation with Discrete Lip Semantics
- Authors: Kai Li*, Kejun Gao*, et al. （申请人：共同一作）
- Project Page https://dolphin-avss.github.io/Dolphin/
- Venue: 投稿至 ICLR 2026 审稿中 (当前评分: 6/6/6/4，位列所有在投论文约10%，往年会议中稿率约30%)
- Highlights: 提出了一种基于“语义-重建”双路径视觉编码框架的轻量级多模态语音分离模型。性能指标全面超越 SOTA：参数量减少 >50%，计算量 (MACs) 降低 >2.4×，推理速度提升 >6× (GitHub 159+ stars)。
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
- Authors: Junhao Chen*, Kejun Gao*, et al. （申请人：共同一作）
- Venue: 投稿至 CVPR 2026 审稿中
- Highlights: 构建了首个基于 Lottie 矢量动画格式的自回归生成框架。在单帧 SVG 生成任务上优于现有 SOTA 方法 (如 OmniSVG)，在矢量视频生成上超越了现有闭源模型 (如 Sora2, Kling)。
ViewSeeker: Locating Camera via Monocular RGB Image
- Authors: Hanxi Zhu, Kejun Gao, et al. （申请人：二作）
- Venue: 投稿至 IEEE TVCG (CCF-A类期刊) 审稿中
- Highlights: 设计了基于检测/分割输出的 MaskXY 导数，并提出了一种用于目标视角导航的可微渲染框架。
《叙事工坊：交互式叙事场景构建》 (Narrative Workshop)
- Authors: Hanxi Zhu, Kejun Gao, et al. （申请人：二作）
- Venue: Chinagraph 2024 最佳论文奖；已中稿《计算机学报》（CCF-A类中文期刊）, 中国知网下载量309
- Highlights: 提出了一种基于大语言模型 (LLM) 的叙事场景布局优化策略。

🏆 Awards

2024-2025 Academic Year: Awarded the “KuanDe” Comprehensive Excellence Scholarship (宽德综合优秀奖学金).
2023-2024 Academic Year: Awarded the “Tsinghua Friend-Huawei” Comprehensive Excellence Scholarship (清华之友-华为综合优秀奖学金).

Kejun Gao