Large Language Models Evaluation for PubMed Extractive Summarisation

Xia, Tian Cheng; Bertini, Flavio; Montesi, Danilo

doi:10.1145/3766905

The increasingly large amount of available biomedical literature is making it difficult to gather and synthesise all the necessary information. Moreover, this domain-specific task demands a high level of reliability in the generated text and concepts. Pre-trained large language models have recently shown promising results. Given the specific requirements of biomedical text summarisation, our evaluation focuses on extractive models, prioritizing the accuracy of the generated text. In this paper, we evaluate the capabilities of eighteen general-domain and biomedical pre-trained language models in various configurations on the biomedical extractive summarisation task using one single and two multi-document datasets consisting of 33,000, 5,000, and 470,000 PubMed articles, respectively. We performed the comparison using several well-known metrics, namely ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, BLEU, and METEOR. The main contribution of this work lies in providing a detailed performance analysis, highlighting the differences between general-domain and biomedical models, and identifying key factors that influence model performance in extractive summarisation tasks within the biomedical domain. Experimental results show that biomedical models tend to result in higher recall, while general-domain models produce higher precision, while general-domain models produce higher precision. This corresponds to more expressive summaries for biomedical models and shorter summaries for general-domain models.

Xia, T.C., Bertini, F., Montesi, D. (2025). Large Language Models Evaluation for PubMed Extractive Summarisation. ACM TRANSACTIONS ON COMPUTING FOR HEALTHCARE, 7(1), 1-23 [10.1145/3766905].