Systematic Reward Gap Optimization for Mitigating VLM Hallucinations

Lehan He^*1,2 , Zeren Chen^*1,3 , Zhelun Shi¹ , Tianyu Yu⁴ , Jing Shao^†2,3 , Lu Sheng^†1

¹School of Software, Beihang University, ²Shanghai Innovation Institute
³Shanghai AI Laboratory, ⁴Tsinghua University
^*Indicates Equal Contribution , ^†Corresponding authors

Paper Code Dataset Model

🎬 Video Demonstration

Background and Motivation

Despite their success, Vision Language Models (VLMs) suffer from a critical limitation: visual hallucinations. They might confidently describe non-existent objects, misrepresent attributes, or misjudge spatial relationships, posing significant risks in safety-critical scenarios.

Recent efforts to mitigate hallucinations increasingly leverage alignment techniques like Direct Preference Optimization (DPO). However, the success of DPO critically hinges on the true reward gaps within preference pairs. Current methods, relying on ranking or rewriting strategies, struggle to optimize these reward gaps systematically. A core difficulty lies in precisely characterizing and strategically manipulating the reward gap configuration to guide the model effectively.

(a) Topic-level Preference Rewriting. Based on varying chosen strategies, TPR selectively replaces each topic using the model's internally resampled candidates to adjust the reward gap. (b) Data Efficiency. Apart from manual annotation, TPR achieves the best data efficiency on visual hallucination reduction.

Topic-level Preference Rewriting

To address these limitations, we propose Topic-level Preference Rewriting (TPR), a novel framework designed for the systematic optimization of reward gap configuration. TPR operates at the topic-level to provide precise, fine-grained control over semantic details by first obtaining high-quality and diverse topic-level alternatives.

These alternatives are then used to construct preference pairs through selective replacement. By strategically choosing between the highest- and lowest-scoring alternatives for each topic, we can deliberately control the resulting reward gap, creating high-quality preference pairs for subsequent model alignment.

The overall pipeline can be summarized in the following steps:

(1) Decomposing candidate responses from the VLM into fine-grained semantic units. (2) Grouping these units into distinct topic clusters based on textual consistency and visual correlation. (3) Generating a diverse pool of topic-level alternatives via intra-topic self-resampling. (4) Constructing preference pairs by selectively rewriting a template response with high- and low-scoring alternatives for each topic. (5) The reference model is fine-tuned with the curated preference data via DPO.

📃 Highlights

Without bells and whistles, TPR achieves state-of-the-art performance on multiple hallucination benchmarks, outperforming previous methods by an average of ~20%. Notably, it significantly reduces hallucinations by up to 93% on ObjectHal-Bench and ~41% on MMHal-Bench, while also exhibiting superior data efficiency.

Data and Feedback Quality Analysis: Our meticulous curation process results in higher quality data. As shown in Figure 4, we evaluate the quality of the overall constructed responses against GPT-4V, the quality of the intermediate topic alternatives, and the distribution of hallucination types. TPR's topic-focused curation not only leads to better performance but also genuinely enhances the intrinsic quality of the preference data for robust VLM alignment.

(a) Quality of Overall Constructed Responses. (b) Quality of Topic Alternatives. (c) Statistics of Hallucination Types.

🖌 Examples

Correct content and hallucinations are highlighted in color respectively. Below are qualitative results comparing our aligned model (TPR-7B) with the baseline and a more powerful model.