Named entity recognition (NER) in medicine is challenging due to specialized terminology, inconsistent annotation guidelines, and the continuous emergence of new entity types—requiring models that can adapt to unseen targets. Large language models (LLMs) exhibit strong generalization but are impractical for scalable deployment, whereas recent encoder-only approaches leverage entity names for zero-shot inference but struggle with disambiguation in complex domains. We introduce OpenBioNER-v2, a family of lightweight transformer encoders (15M-110M parameters) designed for zero-shot recognition of biomedical and clinical entities by conditioning on natural language descriptions of target types. Our cross-encoder architecture jointly models input text and entity-type descriptions, enabling semantic matches. Pretrained on LLM-generated silver annotations and multi-view descriptions covering thousands of medical types, OpenBioNER-v2 achieves state-of-the-art results across 11 benchmarks—including a new dataset for personal de-identification. Variants with <56M parameters outperform both large and small language models, such as UniversalNER and GliNER. Ablation studies reveal effective strategies for formulating descriptions. All data, code, and model checkpoints are publicly released under open-science principles.
Cocchieri, A., Frisoni, G., Zangrillo, F., Ragazzi, L., Martínez Galindo, M., Tagliavini, G., et al. (2026). OpenBioNER-v2: A Suite of Lightweight Models for Zero-Shot Medical Named Entity Recognition via Type Descriptions. EXPERT SYSTEMS WITH APPLICATIONS, 318, 131725-131754 [10.1016/j.eswa.2026.131725].
OpenBioNER-v2: A Suite of Lightweight Models for Zero-Shot Medical Named Entity Recognition via Type Descriptions
Alessio CocchieriCo-primo
;Giacomo FrisoniCo-primo
;Francesco ZangrilloCo-primo
;Luca RagazziCo-primo
;Giuseppe Tagliavini;Gianluca MoroCo-primo
2026
Abstract
Named entity recognition (NER) in medicine is challenging due to specialized terminology, inconsistent annotation guidelines, and the continuous emergence of new entity types—requiring models that can adapt to unseen targets. Large language models (LLMs) exhibit strong generalization but are impractical for scalable deployment, whereas recent encoder-only approaches leverage entity names for zero-shot inference but struggle with disambiguation in complex domains. We introduce OpenBioNER-v2, a family of lightweight transformer encoders (15M-110M parameters) designed for zero-shot recognition of biomedical and clinical entities by conditioning on natural language descriptions of target types. Our cross-encoder architecture jointly models input text and entity-type descriptions, enabling semantic matches. Pretrained on LLM-generated silver annotations and multi-view descriptions covering thousands of medical types, OpenBioNER-v2 achieves state-of-the-art results across 11 benchmarks—including a new dataset for personal de-identification. Variants with <56M parameters outperform both large and small language models, such as UniversalNER and GliNER. Ablation studies reveal effective strategies for formulating descriptions. All data, code, and model checkpoints are publicly released under open-science principles.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


