The increasingly large amount of available biomedical literature is making it difficult to gather and synthesise all the necessary information. Moreover, this domain-specific task demands a high level of reliability in the generated text and concepts. Pre-trained large language models have recently shown promising results. Given the specific requirements of biomedical text summarisation, our evaluation focuses on extractive models, prioritizing the accuracy of the generated text. In this paper, we evaluate the capabilities of eighteen general-domain and biomedical pre-trained language models in various configurations on the biomedical extractive summarisation task using one single and two multi-document datasets consisting of 33,000, 5,000, and 470,000 PubMed articles, respectively. We performed the comparison using several well-known metrics, namely ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, BLEU, and METEOR. The main contribution of this work lies in providing a detailed performance analysis, highlighting the differences between general-domain and biomedical models, and identifying key factors that influence model performance in extractive summarisation tasks within the biomedical domain. Experimental results show that biomedical models tend to result in higher recall, while general-domain models produce higher precision, while general-domain models produce higher precision. This corresponds to more expressive summaries for biomedical models and shorter summaries for general-domain models.
Xia, T.C., Bertini, F., Montesi, D. (2025). Large Language Models Evaluation for PubMed Extractive Summarisation. ACM TRANSACTIONS ON COMPUTING FOR HEALTHCARE, 7(1), 1-23 [10.1145/3766905].
Large Language Models Evaluation for PubMed Extractive Summarisation
Xia, Tian Cheng;Montesi, Danilo
2025
Abstract
The increasingly large amount of available biomedical literature is making it difficult to gather and synthesise all the necessary information. Moreover, this domain-specific task demands a high level of reliability in the generated text and concepts. Pre-trained large language models have recently shown promising results. Given the specific requirements of biomedical text summarisation, our evaluation focuses on extractive models, prioritizing the accuracy of the generated text. In this paper, we evaluate the capabilities of eighteen general-domain and biomedical pre-trained language models in various configurations on the biomedical extractive summarisation task using one single and two multi-document datasets consisting of 33,000, 5,000, and 470,000 PubMed articles, respectively. We performed the comparison using several well-known metrics, namely ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, BLEU, and METEOR. The main contribution of this work lies in providing a detailed performance analysis, highlighting the differences between general-domain and biomedical models, and identifying key factors that influence model performance in extractive summarisation tasks within the biomedical domain. Experimental results show that biomedical models tend to result in higher recall, while general-domain models produce higher precision, while general-domain models produce higher precision. This corresponds to more expressive summaries for biomedical models and shorter summaries for general-domain models.| File | Dimensione | Formato | |
|---|---|---|---|
|
3766905.pdf
accesso aperto
Tipo:
Versione (PDF) editoriale / Version Of Record
Licenza:
Licenza per Accesso Aperto. Creative Commons Attribuzione - Non commerciale - Non opere derivate (CCBYNCND)
Dimensione
831.93 kB
Formato
Adobe PDF
|
831.93 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



