Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

Xiao, Xi; Liu, Chen; Liao, Chih-Ting; Zhang, Yunbei; Lan, Qizhen; Wei, Yuxiang; Zhao, Lin; Wang, Janet; Gu, Jianyang; Ye, Muchao; Wang, Tianyang; Xu, Hao

Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs

VIGIL teaches multimodal models to rely on visual evidence by contrasting what they say when they can see with what they still claim when made counterfactually blind.

Xi Xiao^1,*, Chen Liu^2,*, Chih-Ting Liao³, Yunbei Zhang⁴, Qizhen Lan¹, Yuxiang Wei⁵, Lin Zhao⁶, Janet Wang⁴, Jianyang Gu⁷, Muchao Ye⁸, Tianyang Wang^1,†, Hao Xu^9,†

¹University of Alabama at Birmingham · ²Yale University · ³University of New South Wales · ⁴Tulane University · ⁵Georgia Institute of Technology · ⁶Northeastern University · ⁷The Ohio State University · ⁸University of Iowa · ⁹Harvard University

*Equal contribution. †Co-advising.

ECCV 2026

PDF Code coming soon BibTeX

Demo of visual laziness in multimodal large language models.

Abstract

The model often sees the evidence, then answers from priors.

Multimodal large language models extend LLMs with visual perception, yet they remain prone to hallucinations that contradict the input image. A key failure mode is visual laziness: the model may encode useful visual evidence internally while decoding from strong language priors.

VIGIL is an offline post-training framework that aligns MLLMs by maximizing Visual Information Gain. It constructs a matched counterfactual blind state through attention masking, then penalizes cases where the preferred response stays high-confidence even without visual access. This seeing-versus-blind contrast anchors high-confidence predictions to visual evidence rather than linguistic shortcuts.

Overview of VIGIL for mitigating visual laziness in multimodal large language models. — Overview of VIGIL for visually grounded MLLMs. (a) A longstanding limitation of MLLMs is visual laziness, namely the heavy reliance on textual knowledge priors over visual evidence that can lead to hallucination. As a remedy, we introduce VIGIL, a post-training RL framework that promotes dependence on visual evidence. (b) Rather than altering text-based rewards, VIGIL penalizes insufficient reliance on visual information. (c) Qwen2.5-VL trained with VIGIL rapidly learns to leverage visual inputs during generation. (d) VIGIL consistently improves hallucination mitigation and multimodal reasoning across model scales, architectures, and benchmarks.

Three claims

What VIGIL changes in multimodal alignment

Visual laziness is a dependency failure.

Outcome-level preference losses can reward correct strings even when the model reaches them through language priors instead of visual evidence.

Blind confidence exposes ungrounded predictions.

VIGIL compares the same response under seeing and blind states, making visual dependence directly observable during training.

Counterfactual alignment scales efficiently.

The visual anchor improves hallucination mitigation across model scales and architectures while matching full-data baselines with only 25% of preference data.

Method

Learning visual dependence from seeing and blind states

Seeing path

The model receives the full multimodal input and can attend from text tokens to visual tokens.

Blind path

Attention masking blocks text-vision interaction while keeping input tensors matched, creating a counterfactual blind state.

VIG alignment

The objective enlarges the log-likelihood gap between seeing and blind states and suppresses high-confidence blind responses.

Visual Information Gain

VIG(y) = log p(y | image, text) - log p(y | blind, text)

Results

Stronger grounding with less post-training data

+4.1 POPE Adv gain on Qwen2.5-VL-7B

+5.3 POPE Adv gain on Qwen2.5-VL-72B

25% Preference data needed to match full-data DA-DPO

+4.3 Zero-shot RefCOCOg Acc@0.5 improvement

Generalization across base model architectures.

VIGIL data efficiency and hyperparameter analysis figure. — VIGIL matches strong full-data baselines with 25% of the post-training data and remains stable across KL and grounding weights.

Qualitative examples comparing DPO and VIGIL on perception, counting, and causal reasoning. — Qualitative examples show VIGIL reducing prior-driven guesses in fine-grained perception, counting, and physical-state reasoning.

Why it matters

Alignment should constrain the dependency path, not only the answer.

Standard multimodal preference optimization can improve final answers while leaving visual dependence under-specified. VIGIL instead asks whether the same answer remains plausible when visual evidence is causally removed.

This turns visual grounding into a measurable training signal: high confidence is rewarded only when the model actually needs the image to sustain that confidence.

Citation

Cite VIGIL

If our work is helpful to your research, please consider citing VIGIL. Thank you.

@inproceedings{xiao2026vigil,
    title={Staying VIGILant: Mitigating Visual Laziness via Counterfactual Visual Alignment in MLLMs},
    author={Xiao, Xi and Liu, Chen and Liao, Chih-Ting and Zhang, Yunbei and Lan, Qizhen and Wei, Yuxiang and Zhao, Lin and Wang, Janet and Gu, Jianyang and Ye, Muchao and Wang, Tianyang and Xu, Hao},
    booktitle={European Conference on Computer Vision},
    year={2026},
    organization={Springer}
}