Prompting Vision Foundation Models with Cascaded Semantics

Cascaded Semantics injects color, texture, shape, and attention-derived priors into visual prompt tuning so frozen vision backbones adapt with richer semantic guidance.

Xi Xiao1, Xingjian Li2, Cheng Han3, Tianyang Wang1, Guosheng Hu4, Yunbei Zhang5, Lin Zhao6, Runmin Jiang2, Xi Li1, Xiao Wang7, Min Xu2

1University of Alabama at Birmingham · 2Carnegie Mellon University · 3University of Missouri-Kansas City · 4University of Bristol · 5Tulane University · 6Northeastern University · 7Oak Ridge National Laboratory

Overview of cascaded semantic prompting for vision foundation models.
Cascaded Semantics enriches visual prompt tuning by deriving complementary semantic priors and feeding them into frozen vision foundation models layer by layer.

Abstract

Prompt tuning needs semantic structure, not only learnable tokens.

Visual prompt tuning adapts large frozen vision backbones with a small number of trainable tokens, but conventional prompts can be semantically under-specified and brittle across datasets.

This work builds a cascaded prompting framework that extracts color, texture, shape, and attention-map priors, then integrates them as semantic prompts. The result is parameter-efficient transfer that improves classification and localization behavior while tuning only a small fraction of parameters.

Three claims

What this paper changes

01

Prompts should carry visual meaning.

Explicit semantic priors give prompt tokens a stronger inductive bias than randomly initialized embeddings alone.

02

Semantics should be cascaded.

Combining low-level appearance, shape, and attention cues lets each transformer stage receive information matched to its role.

03

Efficiency can coexist with generality.

The approach tunes about 0.74% of parameters while improving results across FGVC, HTA, and VTAB-style transfer tasks.

Method

Building prompts from complementary visual priors

Color and texture

Low-level cues capture appearance regularities that are useful for fine-grained and specialized categories.

Shape and structure

Shape priors encode object boundaries and geometry, complementing appearance prompts.

Attention semantics

Self-attention maps expose where the frozen model already looks, allowing prompts to reinforce task-relevant regions.

Building prompts from complementary visual priors
The semantic prior module converts hand-crafted and model-derived cues into prompt tokens, then cascades them through the frozen visual backbone.

Prompt Budget

semantic prompts + frozen backbone = transfer with 0.74% tuned parameters

Results

Broad transfer gains with a small parameter budget

76.30 Mean score across 34 image classification datasets
90.20 FGVC average under the cascaded semantic prompt setting
82.95 Harmonic mean for semantic versus text prompt transfer
32.9 Mean localization IoU, improving over VPT at 26.5
Ablation table for cascaded semantic prompting.
Ablation results show how different semantic priors contribute to the final transfer performance.
Visualization examples for cascaded semantic prompting.
Visual diagnostics indicate that semantic prompting improves attention localization on task-relevant regions.

Why it matters

Parameter-efficient adaptation should still be semantically expressive.

Prompt tuning is attractive because it avoids full fine-tuning, but a small prompt budget makes the choice of prompt information especially important.

Cascaded Semantics shows that frozen vision models can be steered more reliably when prompt tokens are grounded in visual priors that the task can actually use.

Citation

Cite Cascaded Semantics

If our work is helpful to your research, please consider citing this paper. Thank you.

@misc{xiao2026cascadedsemantics,
  title        = {Prompting Vision Foundation Models with Cascaded Semantics},
  author       = {Xi Xiao and Xingjian Li and Cheng Han and Tianyang Wang and Guosheng Hu and Yunbei Zhang and Lin Zhao and Runmin Jiang and Xi Li and Xiao Wang and Min Xu},
  year         = {2026},
  note         = {Project page: https://xixiaouab.github.io/Cascaded-Semantics/}
}