To alleviate the scarcity of manually annotated data in Named Entity Recognition (NER) tasks, data augmentation methods can be applied to automatically generate labeled data and improve performance of existing methods. However, based on manipulations of the input text, current techniques may generate too many noisy and mislabeled samples. In this paper we propose COntext SImilarity-based data augmentation for NER (COSINER), a method for NER data augmentation based on context similarity, i.e. we replace entity mentions with the most plausible ones based on the available training data and the contexts in which entities usually appear. We conduct experiments on popular benchmark datasets, showing that our method outperforms current baselines in various few-shot scenarios, where training data is assumed to be strongly limited. Experimental results show that not only does COSINER overcome baselines in terms of NER performances in highly-limited scenarios (2%, 5%), but also its computing times are comparable to simplest augmentation methods.
Ilaria Bartolini, V.M. (2022). COSINER: COntext SImilarity data augmentation for Named Entity Recognition.
COSINER: COntext SImilarity data augmentation for Named Entity Recognition
Ilaria Bartolini;
2022
Abstract
To alleviate the scarcity of manually annotated data in Named Entity Recognition (NER) tasks, data augmentation methods can be applied to automatically generate labeled data and improve performance of existing methods. However, based on manipulations of the input text, current techniques may generate too many noisy and mislabeled samples. In this paper we propose COntext SImilarity-based data augmentation for NER (COSINER), a method for NER data augmentation based on context similarity, i.e. we replace entity mentions with the most plausible ones based on the available training data and the contexts in which entities usually appear. We conduct experiments on popular benchmark datasets, showing that our method outperforms current baselines in various few-shot scenarios, where training data is assumed to be strongly limited. Experimental results show that not only does COSINER overcome baselines in terms of NER performances in highly-limited scenarios (2%, 5%), but also its computing times are comparable to simplest augmentation methods.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.