I am currently an undergraduate student in computer science at the University of Minnesota Twin Cities.

πŸ’» Experience

  • 2024.11 - 2025.10, Sun Yat-sen University, HCP Lab
    • Research Focus: Multimodal Large Models, Embodied Instruction Following
  • 2024.1 - 2025.3, Beijing University of Chemical Technology, Vision Lab
    • Research Focus: Vision-Language Model, Noisy-Label Learning, Test-Time Adaptation
  • 2023.9 - 2023.12, Beijing University of Chemical Technology, ACM LabΓ₯.
    • Track: Data Structures / Graph Theory / Dynamic Programming β€” πŸ“ˆ Contest rating: 1942.

πŸ”₯ News

  • 2025.11: Β  πŸŽ‰ FAST-CAD accepted as an Oral at AAAI 2026. Congrats to Tommy Sha!
  • 2025.10: Β  ✈️ Attended ICCV 2025.
  • 2025.08: Β  πŸ“ Submitted one paper on healthcare to AAAI 2026.
  • 2025.06: Β  πŸŽ‰ One paper accepted at ICCV 2025 on test-time adaptation.
  • 2025.05: Β  πŸ“ Submitted one paper about video understanding to NeurIPS 2025.
  • 2025.03: Β  πŸŽ‰ One paper accepted at ICME 2025 on test-time adaptation.
  • 2025.03: Β  πŸ“ Submitted one paper on test-time adaptation to ICCV 2025.
  • 2025.03: Β  πŸŽ‰ One paper accepted at ICLR 2025 FM-Wild Workshop on test-time adaptation.
  • 2024.12: Β  πŸ“ Submitted two papers to ICME 2025, one on test-time adaptation, the other on video understanding of action.

πŸ“ Publications (* Equal Contribution)

AAAI 2026 (Oral β€’ 17.6% acceptance)
Paper Thumbnail

FAST-CAD: A Fairness-Aware Framework for Non-Contact Stroke Diagnosis

Project Page Β |Β MIT Tech Review

Stroke is an acute cerebrovascular disease, so we propose FAST-CAD, a DAT + Group-DRO framework that jointly enforces demographic-invariant representations and worst-group robustness for non-contact stroke diagnosis. Built on a 12-subgroup multimodal dataset, it couples adversarial domain discrimination with self-supervised encoders and optimizes worst-group risk, delivering 91.2% AUC and tight fairness bounds backed by domain adaptation and minimax theory.

Role: Collaborating Author.

ICCV 2025 (Poster β€’ 24% acceptance)
Paper Thumbnail

Multi-Cache enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models

Project Page

We observed that cache-based test-time adaptation performance is positively correlated with intra-class compactness. To address the unreliability of low-entropy samples under distribution shifts, we propose MCP, which uses an entropy cache for prototype initialization, an align cache to fuse visual and textual information and tighten intra-class distributions, and a negative cache to calibrate high-entropy predictions. We further extend this into the MCP++ framework by introducing cross-modal prototype alignment and residual learning, achieving state-of-the-art generalization on 15 downstream tasks.

Role: Co-first Author.

ICME 2025 (Oral β€’ 27% acceptance)
Paper Thumbnail

Mitigating Cache Noise in Test-Time Adaptation for Large Vision-Language Models

Also accepted at ICLR 2025 FM-Wild Workshop

We analyzed the root causes of the performance gap between zero-shot and few-shot TTA, identifying noisy cache labels as a critical bottleneck. We then propose the CRG framework, which maintains positive and negative visual prototypes alongside text prototypes, employs learnable residuals to align modalities, and leverages Gaussian Discriminant Analysis to dynamically model class distributions and suppress noisy samples. Finally, by jointly minimizing prediction entropy and maximizing inter-prototype distances, CRG achieves superior robustness and generalization across 13 benchmarks..

Role: First Author.

πŸ”¨ Project

CSCI 5561 Course Project
Feature-3DGS pipeline

β€œCan We Make Feature-3DGS Faster, Better, and Smaller?”

We accelerate and shrink Feature-3DGS with semantic-aware Gaussian pruning and consistency loss, keeping fine details while boosting FPS and mIoU across Replica and Gopher/LindHall scenes.

  • One Paper about Embodied Instruction Following and LLM Planning β€” TBA
  • One paper about computer-use-agent β€” TBA
CSCI 5541 Course Project
Abstain-R1 pipeline

Abstain-R1: Absent Recognition and Calibration in Post-Training of LLMs

Project Page

We propose Abstain-R1 that explicitly learns when to abstain on unanswerable queries and generates semantically meaningful post-refusal clarifications, improving refusal calibration while preserving strong performance on answerable prompts.

πŸ“¨ Submissions

ICME 2025 Submission β€’ Reviews 5/4/3/3/2
ActionLMM Framework

ActionLMM: A Large Multimodal Model for Detailed Action Description in Long Videos

We observe that existing video-language models (e.g., Video-LaMA, VideoChat) fail to capture fine-grained human actions because of data scarcity, modality misalignment, and the difficulty of modeling long clips. Introduces a dual-branch Q-Former that jointly learns from raw video frames and 3-D pose sequences, supported by video- and motion-memory banks, and trained on the new 30 k-video ActionCap-30k dataset. ActionLMM delivers fine-grained action captions and surpasses Video-LLaMA and other baselines on VQA, video captioning, and action captioning benchmarks.

  • FASMM: Frame-Aware Sparse Multimodal Model for Scalable Long-Video Comprehension β€” Proposes Frame-Aware Sparse Attention (FASA) with an importance-driven block selector, cutting KV-cache memory by β‰ˆ 8.8 Γ— while retaining fidelity. FASMM processes tens-of-thousands-frame videos end-to-end and achieves state-of-the-art results on multiple long-video understanding tasks.

πŸŽ– Honors and Awards

  • 2025.05: Β  πŸŽ‰πŸŽ‰πŸŽ‰ Selected for the Spring 2025 Dean’s List, University of Minnesota.

  • 2023.10: Β  πŸŽ‰πŸŽ‰πŸŽ‰ Achieved a silver (πŸ₯ˆ) and a bronze (πŸ₯‰) medal at the ICPC Asia Regional Contest.

πŸ“– Educations

  • present - 2027.6 (expected), Bachelor of Arts in Computer Science, University of Minnesota Twin Cities

🀝 Academic Service

  • Conferences: Reviewer β€” ICME 2025/2026, AAAI 2026, ICASSP 2026
  • Journals: Reviewer β€” IEEE Transactions on Industrial Informatics (TII)