Papers
arxiv:2406.19280

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Published on Jun 27, 2024
Β· Submitted by jymcc on Jul 1, 2024
#2 Paper of the day
Authors:
,
Ke Ji ,
,
,

Abstract

The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.

Community

Paper author Paper submitter

This study introduces PubMedVision, a dataset of 1.3 million high-quality medical image-text samples, crafted to overcome the challenges faced by multimodal large language models (MLLMs) in medical scenarios. We refined data from image-text pairs on PubMed papers and employed a GPT4V-powered reformatting method to enhance this data. Experiments demonstrate that: (1) PubMedVision could significantly improve the medical multimodal capabilities of MLLMs, enabling models like LLaVA-v1.5-LLaMA-3-8B to outperform other open-source MLLMs in medical multimodal scenarios. (2) Manual checks by medical experts validate the superior data quality of PubMedVision. Based on PubMedVision, we constructe our medical multimodal models, HuatuoGPT-Vision. We open-source our dataset and models.

Paper author Paper submitter

Snipaste_2024-07-01_10-43-34.png

Snipaste_2024-07-01_10-44-39.png

Snipaste_2024-07-01_10-45-57.png

That's cool! I don't see performance against GPT-4o using PubMedVision though. It could be quite interesting to see initial performance of GPT, then use PubMedVision for Few shot learning (maybe a lot of those, like https://arxiv.org/abs/2405.09798) and see how competitive open-source models are against GPT-4o enhanced with PubMedVision

HuatuoGPT-Vision is very interesting! Will it also be integrated into hospital medical systems soon, like HuatuoGPT?

Β·
Paper author

Thank you for your interest! Integration into hospital systems is currently being planned.

Β·
Paper author

Thank you very much for sharing. It's a great summary!

Awesome! That work is really interesting, and valuable!

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 3

Spaces citing this paper 2

Collections including this paper 13