Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

VIGIL teaches multimodal models to rely on visual evidence by contrasting what they say when they can see with what they still claim when made counterfactually blind.

Xi Xiao1,*, Chen Liu2,*, Chih-Ting Liao3, Yunbei Zhang4, Qizhen Lan1, Yuxiang Wei5, Lin Zhao6, Janet Wang4, Jianyang Gu7, Muchao Ye8, Tianyang Wang1,†, Hao Xu9,†

1University of Alabama at Birmingham · 2Yale University · 3University of New South Wales · 4Tulane University · 5Georgia Institute of Technology · 6Northeastern University · 7The Ohio State University · 8University of Iowa · 9Harvard University

*Equal contribution. †Co-advising.

Overview of VIGIL for mitigating visual laziness in multimodal large language models.
VIGIL aligns MLLMs by penalizing blind confidence: if the model remains certain after visual evidence is masked, the answer is likely driven by language priors rather than pixels.

Abstract

The model often sees the evidence, then answers from priors.

Multimodal large language models extend LLMs with visual perception, yet they remain prone to hallucinations that contradict the input image. A key failure mode is visual laziness: the model may encode useful visual evidence internally while decoding from strong language priors.

VIGIL is an offline post-training framework that aligns MLLMs by maximizing Visual Information Gain. It constructs a matched counterfactual blind state through attention masking, then penalizes cases where the preferred response stays high-confidence even without visual access. This seeing-versus-blind contrast anchors high-confidence predictions to visual evidence rather than linguistic shortcuts.

Three claims

What VIGIL changes in multimodal alignment

01

Visual laziness is a dependency failure.

Outcome-level preference losses can reward correct strings even when the model reaches them through language priors instead of visual evidence.

02

Blind confidence exposes ungrounded predictions.

VIGIL compares the same response under seeing and blind states, making visual dependence directly observable during training.

03

Counterfactual alignment scales efficiently.

The visual anchor improves hallucination mitigation across model scales and architectures while matching full-data baselines with only 25% of preference data.

Method

Learning visual dependence from seeing and blind states

Seeing path

The model receives the full multimodal input and can attend from text tokens to visual tokens.

Blind path

Attention masking blocks text-vision interaction while keeping input tensors matched, creating a counterfactual blind state.

VIG alignment

The objective enlarges the log-likelihood gap between seeing and blind states and suppresses high-confidence blind responses.

VIGIL method diagram showing dual-path forward pass, visual information gain, and dynamic gating.
VIGIL combines the standard DPO objective with a Counterfactual Visual Decoupling constraint, gated by the seeing-blind gap.

Visual Information Gain

VIG(y) = log p(y | image, text) - log p(y | blind, text)

Results

Stronger grounding with less post-training data

+4.1 POPE Adv gain on Qwen2.5-VL-7B
+5.3 POPE Adv gain on Qwen2.5-VL-72B
25% Preference data needed to match full-data DA-DPO
+4.3 Zero-shot RefCOCOg Acc@0.5 improvement
VIGIL data efficiency and hyperparameter analysis figure.
VIGIL matches strong full-data baselines with 25% of the post-training data and remains stable across KL and grounding weights.
Qualitative examples comparing DPO and VIGIL on perception, counting, and causal reasoning.
Qualitative examples show VIGIL reducing prior-driven guesses in fine-grained perception, counting, and physical-state reasoning.

Why it matters

Alignment should constrain the dependency path, not only the answer.

Standard multimodal preference optimization can improve final answers while leaving visual dependence under-specified. VIGIL instead asks whether the same answer remains plausible when visual evidence is causally removed.

This turns visual grounding into a measurable training signal: high confidence is rewarded only when the model actually needs the image to sustain that confidence.

Citation

Cite VIGIL

If our work is helpful to your research, please consider citing VIGIL. Thank you.

@misc{xiao2026vigil,
  title        = {Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs},
  author       = {Xiao, Xi and Liu, Chen and Liao, Chih-Ting and Zhang, Yunbei and Lan, Qizhen and Wei, Yuxiang and Zhao, Lin and Wang, Janet and Gu, Jianyang and Ye, Muchao and Wang, Tianyang and Xu, Hao},
  year         = {2026},
  note         = {Project page: https://xixiaouab.github.io/VIGIL/}
}