I am currently an undergraduate student in computer science at the University of Minnesota Twin Cities. I am also a visiting student at the HCP Laboratory of Sun Yat-sen University.

Previously, my research interests were in learning with noisy labels, semi-supervised learning, and test-time adaptation. My current research interests include Multimodal Large Models, Reinforcement Learning, Embodied Instruction Following and Prompt Learning.

πŸ’» Experience

  • 2024.12 - present, Sun Yat-sen University, HCP Lab, Visiting Student.
    • Research Focus: Multimodal Large Models, Reinforcement Learning, Embodied Instruction Following

πŸ”₯ News

  • 2025.06: Β  πŸŽ‰πŸŽ‰πŸŽ‰One paper accepted at ICCV 2025 on test-time adaptation.
  • 2025.03: Β  πŸŽ‰πŸŽ‰πŸŽ‰One paper accepted at ICME 2025 on test-time adaptation.
  • 2025.03: Β  πŸŽ‰πŸŽ‰πŸŽ‰One paper accepted at ICLR 2025 FM-Wild Workshop on test-time adaptation.

πŸ“ Publications (* Equal Contribution)

ICCV 2025 (Poster)
Paper Thumbnail

Multi-Cache enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models

Project Page

We observed that cache-based test-time adaptation performance is positively correlated with intra-class compactness. To address the unreliability of low-entropy samples under distribution shifts, we propose MCP, which uses an entropy cache for prototype initialization, an align cache to fuse visual and textual information and tighten intra-class distributions, and a negative cache to calibrate high-entropy predictions. We further extend this into the MCP++ framework by introducing cross-modal prototype alignment and residual learning, achieving state-of-the-art generalization on 15 downstream tasks.

Haotian Zhai* et al. (co-first author)

ICME 2025 (Oral)
Paper Thumbnail

Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models

Also accepted at ICLR 2025 FM-Wild Workshop

We analyzed the root causes of the performance gap between zero-shot and few-shot TTA, identifying noisy cache labels as a critical bottleneck. We then propose the CRG framework, which maintains positive and negative visual prototypes alongside text prototypes, employs learnable residuals to align modalities, and leverages Gaussian Discriminant Analysis to dynamically model class distributions and suppress noisy samples. Finally, by jointly minimizing prediction entropy and maximizing inter-prototype distances, CRG achieves superior robustness and generalization across 13 benchmarks..

Haotian Zhai* et al. (co-first author)

πŸ“¨ Submissions

  • ActionLMM: A Large Multimodal Model for Detailed Action Description in Long Videos β€” Introduces a dual-branch Q-Former that jointly learns from raw video frames and 3-D pose sequences, supported by video- and motion-memory banks, and trained on the new 30 k-video ActionCap-30k dataset. ActionLMM delivers fine-grained action captions and surpasses Video-LLaMA and other baselines on VQA, video captioning, and action captioning benchmarks.

  • FASMM: Frame-Aware Sparse Multimodal Model for Scalable Long-Video Comprehension β€” Proposes Frame-Aware Sparse Attention (FASA) with an importance-driven block selector, cutting KV-cache memory by β‰ˆ 8.8 Γ— while retaining fidelity. FASMM processes tens-of-thousands-frame videos end-to-end and achieves state-of-the-art results on multiple long-video understanding tasks.

πŸŽ– Honors and Awards

  • 2025.05: Β  πŸŽ‰πŸŽ‰πŸŽ‰ Selected for the Spring 2025 Dean’s List, University of Minnesota.

  • 2023.10: Β  πŸŽ‰πŸŽ‰πŸŽ‰ Achieved a silver (πŸ₯ˆ) and a bronze (πŸ₯‰) medal at the ICPC Asia Regional Contest.

πŸ“– Educations

  • present - 2027.6 (expected), Bachelor of Arts in Computer Science, University of Minnesota Twin Cities