
Figure 3: An overview of our MCP++ method. The Entropy Cache stores low-entropy samples, the Align Cache retains samples closest to the prototype center, and the Negative Cache preserves high-entropy samples after pseudo-label refinement through a reflecting mechanism. We introduce textual and visual prototypes refined with learnable residuals to construct the prototype center, which is optimized using alignment loss, contrastive loss and entropy loss. The final prediction is derived through a retrieval mechanism that aggregates similarity scores from negative cache features, prototype center and adaptive cache features.
Figure 1: Illustration of the average classification accuracy, test-time training GFLOPs, and FPS for different methods on cross-dataset classification tasks. For prompt-based and cache-based methods, the icon sizes denote the FPS values.
Figure 2: The x-axis represents the compactness of test data(defined as the inverse of the average distance between each sample and its class center), and the y-axis represents the accuracy improvement of TDA relative to zero-shot CLIP. A positive correlation is observed on the curve and two t-SNE visualizations on EuroSAT and Aircraft datasets.
In zero-shot setting, test-time adaptation adjusts pre-trained models using unlabeled data from the test phase to enhance performance on unknown test distributions. Existing cache-enhanced TTA methods rely on a low-entropy criterion to select samples for prototype construction, assuming intra-class compactness. However, low-entropy samples may be unreliable under distribution shifts, and the resulting prototypes may not ensure compact intra-class distributions. This study identifies a positive correlation between cache-enhanced performance and intra-class compactness. Based on this observation, we propose a Multi-Cache enhanced Prototype-based Test-Time Adaptation (MCP) featuring three caches: an entropy cache for initializing prototype representations with low-entropy samples, an align cache for integrating visual and textual information to achieve compact intra-class distributions, and a negative cache for prediction calibration using high-entropy samples. We further developed MCP++, a framework incorporating cross-modal prototype alignment and residual learning, introducing prototype residual fine-tuning. Comparative and ablation experiments across 15 downstream tasks demonstrate that the proposed method and framework achieve state-of-the-art generalization performance.
Table 1: Results on the Cross-Domain Benchmark. Top-1 accuracy (%) results are presented for all evaluated methods employing the ViT-B/16 visual backbone of CLIP. The best results are highlighted in bold.
Table 2: Results on the OOD Benchmark. Top-1 accuracy (%) results are presented for all evaluated methods employing the ViT-B/16 visual backbone of CLIP. The best results are highlighted in bold.
While prevailing methods for cache construction largely adhere to a per-class organizational model, our work has extended this paradigm by not only exploring selection metrics beyond the conventional entropy criterion but also by constructing a synergistic multi-cache mechanism. We believe future research can continue along this trajectory to investigate more diverse cache construction strategies and their corresponding evaluation metrics.
Once an effective cache is constructed, the subsequent challenge lies in designing its readout mechanism. We have proposed a solution that combines prototypes with an attention mechanism to balance the influence of class-level and instance-level information, and in separate work, have also validated the efficacy of Gaussian Discriminant Analysis (GDA). Crucially, once a rich cache is available, this stage of the task essentially transforms into a few-shot learning problem, implying that a broader range of mature methodologies from the few-shot domain can be adapted and applied.
Furthermore, when integrating unsupervised loss paradigms such as Test-Time Prompt Tuning (TPT), exploring alternative loss functions or investigating optimal strategies for selecting augmented views for the unsupervised objective remain highly valuable research directions. In summary, as an emerging research paradigm, the Test-Time Adaptation of Vision-Language Models (TTA of VLMs) is not only challenging but also encompasses a multitude of fundamental topics worthy of in-depth exploration.
If you find our work helpful or this project page inspiring, please consider citing our paper.
@article{chen2025multi,
title={Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models},
author={Chen, Xinyu and Zhai, Haotian and Zhang, Can and Shi, Xiupeng and Li, Ruirui},
journal={arXiv preprint arXiv:2508.01225},
year={2025}
}