TemMed-Bench

TemMed-Bench -- Composition

Examples of the three tasks in TemMed-Bench. Each question in these tasks is designed to challenge LVLMs' ability to analyze condition changes, providing a comprehensive evaluation of their temporal medical image reasoning ability.

Comparison with previous works. TemMed-Bench focuses on evaluating LVLMs in temporal reasoning over multiple medical images.

TemMed-Bench -- Composition

Key statistics and keywords distribution of TemMed-Bench

Abstract

Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient's condition based on an image from a single visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient's historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits, which challenges large vision-language models (LVLMs) to reason over temporal medical images. TemMed-Bench consists of a test set comprising three tasks - visual question-answering (VQA), report generation, and image-pair selection - and a supplementary knowledge corpus of over 17,000 instances. With TemMed-Bench, we conduct an evaluation of twelve LVLMs, comprising six proprietary and six open-source models. Our results show that most LVLMs lack the ability to analyze patients' condition changes over temporal medical images, and a large proportion perform only at a random-guessing level in the closed-book setting. In contrast, GPT o3, o4-mini and Claude 3.5 Sonnet demonstrate comparatively decent performance, though they have yet to reach the desired level. To enhance the tracking of condition changes, we explore augmenting the input with both retrieved visual and textual modalities in the medical domain. We also show that multi-modal retrieval augmentation yields notably higher performance gains than no retrieval and textual retrieval alone across most models on our benchmark, with the VQA task showing an average improvement of 2.59%. Overall, we compose a benchmark grounded on real-world clinical practice, and it reveals LVLMs' limitations in temporal medical image reasoning, as well as highlighting the use of multi-modal retrieval augmentation as a potentially promising direction worth exploring to address this challenge.

Evaluation Results

Evaluation results on TemMed-Bench in the closed-book setting. TemMed-Bench clearly reveals the limitations of current LVLMs in temporal medical image reasoning. Most LVLMs perform at around the random guess level in the VQA and image-pair selection tasks, and achieve relatively low average scores in the report generation task.

Evaluation results on TemMed-Bench with text-only and multi-modal retrieval augmentation (using top-1 retrieval). The evaluation results demonstrate that multi-modal retrieval augmentation generally yields greater performance improvements across most models compared to text-only retrieval augmentation. Notably, compared to their text-only counterparts, HealthGPT, Claude 3.5 Sonnet, and GPT-4o demonstrate substantial gains in the multi-modal setting, with increases in VQA accuracy of 10.85%, 7.90%, and 4.75%, and improvements in report generation average score of 0.73, 0.68, and 1.59, respectively.

Analysis and Discussion

Ablation Study on Retrieval Methods

Results indicate that pairwise image retrieval achieved the highest performance, primarily due to two factors. First, for image-to-text retrieval, the report feature in TemMed-Bench does not correspond to a single image, as the report describes condition changes between two images. Consequently, directly calculating the feature similarity between the report and a single image introduces bias. Second, image-to-image retrieval relies solely on the similarity of current-visit images, ensuring similarity in current conditions between the target and retrieved instances, but does not guarantee similarity in condition changes. Therefore, in TemMed-Bench, only by considering both historical and current images during retrieval and ensuring similarity in both, can retrieved instances reflect similar condition changes.

Impact of Top-k Retrieval Augmentation

To further analyze the impact of varying top-k retrievals on augmentation performance, we conducted experiments to evaluate the performance of LVLMs under top-1 to top-5 retrieval augmentation settings. Notably, multi-modal retrieval augmentation consistently outperforms text-only retrieval augmentation across top-1 to top-5 settings, further confirming the effectiveness of incorporating multi-modal retrieved information in enhancing LVLM performance in the medical domain.

Furthermore, by comparing the performance improvement from top-1 to top-5, we observe that GPT-4o demonstrates a significantly higher increase in accuracy (6.6%) compared to HealthGPT (2.37%). These results suggest that LVLMs with superior multi-image processing capabilities derive greater benefits from an increased number of retrieved instances, underscoring the importance of enhancing multi-image processing ability to fully leverage multi-modal retrieved information.

BibTeX

@misc{zhang2025temmedbenchevaluatingtemporalmedical,
            title={TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models}, 
            author={Junyi Zhang and Jia-Chen Gu and Wenbo Hu and Yu Zhou and Robinson Piramuthu and Nanyun Peng},
            year={2025},
            eprint={2509.25143},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2509.25143}, 
      }