Prompting Vision Foundation Models with Cascaded Semantics

Xiao, Xi; Li, Xingjian; Han, Cheng; Wang, Tianyang; Hu, Guosheng; Zhang, Yunbei; Zhao, Lin; Jiang, Runmin; Li, Xi; Wang, Xiao; Xu, Min

Prompting Vision Foundation Models with Cascaded Semantics

Cascaded Semantics injects color, texture, shape, and attention-derived priors into visual prompt tuning so frozen vision backbones adapt with richer semantic guidance.

Accepted at TMLR 2026

Xi Xiao¹, Xingjian Li², Cheng Han³, Tianyang Wang¹, Guosheng Hu⁴, Yunbei Zhang⁵, Lin Zhao⁶, Runmin Jiang², Xi Li¹, Xiao Wang⁷, Min Xu²

¹University of Alabama at Birmingham · ²Carnegie Mellon University · ³University of Missouri-Kansas City · ⁴University of Bristol · ⁵Tulane University · ⁶Northeastern University · ⁷Oak Ridge National Laboratory

Paper Code coming soon GitHub BibTeX

Overview of cascaded semantic prompting for vision foundation models. — Cascaded Semantics enriches visual prompt tuning by deriving complementary semantic priors and feeding them into frozen vision foundation models layer by layer.

Abstract

Prompt tuning needs semantic structure, not only learnable tokens.

Visual prompt tuning adapts large frozen vision backbones with a small number of trainable tokens, but conventional prompts can be semantically under-specified and brittle across datasets.

This work builds a cascaded prompting framework that extracts color, texture, shape, and attention-map priors, then integrates them as semantic prompts. The result is parameter-efficient transfer that improves classification and localization behavior while tuning only a small fraction of parameters.

Three claims

What this paper changes

Prompts should carry visual meaning.

Explicit semantic priors give prompt tokens a stronger inductive bias than randomly initialized embeddings alone.

Semantics should be cascaded.

Combining low-level appearance, shape, and attention cues lets each transformer stage receive information matched to its role.

Efficiency can coexist with generality.

The approach tunes about 0.74% of parameters while improving results across FGVC, HTA, and VTAB-style transfer tasks.

Method

Building prompts from complementary visual priors

Color and texture

Low-level cues capture appearance regularities that are useful for fine-grained and specialized categories.

Shape and structure

Shape priors encode object boundaries and geometry, complementing appearance prompts.

Attention semantics

Self-attention maps expose where the frozen model already looks, allowing prompts to reinforce task-relevant regions.

Building prompts from complementary visual priors — The semantic prior module converts hand-crafted and model-derived cues into prompt tokens, then cascades them through the frozen visual backbone.

Prompt Budget

semantic prompts + frozen backbone = transfer with 0.74% tuned parameters

Results

Broad transfer gains with a small parameter budget

76.30 Mean score across 34 image classification datasets

90.20 FGVC average under the cascaded semantic prompt setting

82.95 Harmonic mean for semantic versus text prompt transfer

32.9 Mean localization IoU, improving over VPT at 26.5

Ablation table for cascaded semantic prompting. — Ablation results show how different semantic priors contribute to the final transfer performance.

Visualization examples for cascaded semantic prompting. — Visual diagnostics indicate that semantic prompting improves attention localization on task-relevant regions.

Why it matters

Parameter-efficient adaptation should still be semantically expressive.

Prompt tuning is attractive because it avoids full fine-tuning, but a small prompt budget makes the choice of prompt information especially important.

Cascaded Semantics shows that frozen vision models can be steered more reliably when prompt tokens are grounded in visual priors that the task can actually use.

Citation

Cite Cascaded Semantics

If our work is helpful to your research, please consider citing this paper. Thank you.

@article{xiao2026cascadedsemantics,
  title        = {Prompting Vision Foundation Models with Cascaded Semantics},
  author       = {Xi Xiao and Xingjian Li and Cheng Han and Tianyang Wang and Guosheng Hu and Yunbei Zhang and Lin Zhao and Runmin Jiang and Xi Li and Xiao Wang and Min Xu},
  journal      = {Transactions on Machine Learning Research},
  year         = {2026},
  url          = {https://openreview.net/forum?id=SSsobNZJPO}
}