Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

1Monash University, 2MBZUAI 3XJTLU 4Shanghai Jiaotong University 5Fudan University 6University of Minnesota 7Cornell University
image
(a): Attention Collapse in MLLMs: Outlier tokens from different modalities are assigned disproportionately high attention scores, hindering interaction between relevant tokens. (b): Positional Information Decay: As text generation progresses, attention to visual information gradually diminishes. (c): Our FarSight, as a plug-in, mitigates these issues by effectively reducing attention interference from outlier tokens and improving response accuracy.

Abstract

Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.

Methodology

image
The scheme of the proposed FarSight strategy, which integrates with the softmax operation, replaces the traditional causal mask. Specifically, the attention score matrix \( \omega \) is cleared of attention values in the upper triangular part, then register-attention scores are added using the matrix \( \mathcal{P} \), followed by the softmax computation. \( \mathcal{P} \) has a linear decay in the upper triangular part and zeros in the lower triangular part. After the softmax operation, the remaining attention probabilities in the upper triangular part are cleared to ensure the causal decoding property is preserved.

Visualization

image
Qualitative Visualization of FarSight in Image Understanding Task on LLaVA-1.5. (a) Comparison of the average attention allocation to images during text generation among Base (Vanilla MLLMs), EDVT and our FarSight; (b) Visual attention decay across different methods within the generation of 60 text tokens; (c) FarSight's attention distribution on images under varying decay rat $\sigma$. More detailed visualizations of images and videos are provided in Appendix F.

BibTeX


      @misc{tang2025seeingfarclearlymitigating,
      title={Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding}, 
      author={Feilong Tang and Chengzhi Liu and Zhongxing Xu and Ming Hu and Zelin Peng and Zhiwei Yang and Jionglong Su and Minquan Lin and Yifan Peng and Xuelian Cheng and Imran Razzak and Zongyuan Ge},
      year={2025},
      eprint={2505.16652},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.16652}, 
}