I am currently an undergraduate student in computer science at the University of Minnesota Twin Cities. I am also a visiting student at the HCP Laboratory of Sun Yat-sen University.
Previously, my research interests were in learning with noisy labels, semi-supervised learning, and test-time adaptation. My current research interests include Multimodal Large Models, Reinforcement Learning, Embodied Instruction Following and Prompt Learning.
π» Experience
- 2024.12 - present, Sun Yat-sen University, HCP Lab, Visiting Student.
- Research Focus: Multimodal Large Models, Reinforcement Learning, Embodied Instruction Following
π₯ News
- 2025.06: Β πππOne paper accepted at ICCV 2025 on test-time adaptation.
- 2025.03: Β πππOne paper accepted at ICME 2025 on test-time adaptation.
- 2025.03: Β πππOne paper accepted at ICLR 2025 FM-Wild Workshop on test-time adaptation.
π Publications (* Equal Contribution)

Multi-Cache enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models
We observed that cache-based test-time adaptation performance is positively correlated with intra-class compactness. To address the unreliability of low-entropy samples under distribution shifts, we propose MCP, which uses an entropy cache for prototype initialization, an align cache to fuse visual and textual information and tighten intra-class distributions, and a negative cache to calibrate high-entropy predictions. We further extend this into the MCP++ framework by introducing cross-modal prototype alignment and residual learning, achieving state-of-the-art generalization on 15 downstream tasks.
Haotian Zhai* et al. (co-first author)

Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models
Also accepted at ICLR 2025 FM-Wild Workshop
We analyzed the root causes of the performance gap between zero-shot and few-shot TTA, identifying noisy cache labels as a critical bottleneck. We then propose the CRG framework, which maintains positive and negative visual prototypes alongside text prototypes, employs learnable residuals to align modalities, and leverages Gaussian Discriminant Analysis to dynamically model class distributions and suppress noisy samples. Finally, by jointly minimizing prediction entropy and maximizing inter-prototype distances, CRG achieves superior robustness and generalization across 13 benchmarks..
Haotian Zhai* et al. (co-first author)
π¨ Submissions
-
ActionLMM: A Large Multimodal Model for Detailed Action Description in Long Videos β Introduces a dual-branch Q-Former that jointly learns from raw video frames and 3-D pose sequences, supported by video- and motion-memory banks, and trained on the new 30 k-video ActionCap-30k dataset. ActionLMM delivers fine-grained action captions and surpasses Video-LLaMA and other baselines on VQA, video captioning, and action captioning benchmarks.
-
FASMM: Frame-Aware Sparse Multimodal Model for Scalable Long-Video Comprehension β Proposes Frame-Aware Sparse Attention (FASA) with an importance-driven block selector, cutting KV-cache memory by β 8.8 Γ while retaining fidelity. FASMM processes tens-of-thousands-frame videos end-to-end and achieves state-of-the-art results on multiple long-video understanding tasks.
π Honors and Awards
-
2025.05: Β πππ Selected for the Spring 2025 Deanβs List, University of Minnesota.
-
2023.10: Β πππ Achieved a silver (π₯) and a bronze (π₯) medal at the ICPC Asia Regional Contest.
π Educations
- present - 2027.6 (expected), Bachelor of Arts in Computer Science, University of Minnesota Twin Cities