Recent studies have demonstrated the effectiveness of cross-lingual language model pre-training on different NLP tasks, such as natural language inference and machine translation. In our work, we test this approach on social media data, which are particularly challenging to process within this framework, since the limited length of the textual messages and the irregularity of the language make it harder to learn meaningful encodings. More specifically, we propose a hybrid emoji-based Masked Language Model (MLM) to leverage the common information conveyed by emojis across different languages and improve the learned cross-lingual representation of short text messages, with the goal to perform zero- shot abusive language detection. We compare the results obtained with the original MLM to the ones obtained by our method, showing improved performance on German, Italian and Spanish.

Corazza, M., Menini, S., Cabrio, E., Tonelli, S., Villata, S. (2020). Hybrid Emoji-Based Masked Language Models for Zero-Shot Abusive Language Detection. Association for Computational Linguistics [10.18653/v1/2020.findings-emnlp.84].

Hybrid Emoji-Based Masked Language Models for Zero-Shot Abusive Language Detection

Michele Corazza;
2020

Abstract

Recent studies have demonstrated the effectiveness of cross-lingual language model pre-training on different NLP tasks, such as natural language inference and machine translation. In our work, we test this approach on social media data, which are particularly challenging to process within this framework, since the limited length of the textual messages and the irregularity of the language make it harder to learn meaningful encodings. More specifically, we propose a hybrid emoji-based Masked Language Model (MLM) to leverage the common information conveyed by emojis across different languages and improve the learned cross-lingual representation of short text messages, with the goal to perform zero- shot abusive language detection. We compare the results obtained with the original MLM to the ones obtained by our method, showing improved performance on German, Italian and Spanish.
2020
Findings of the Association for Computational Linguistics: EMNLP 2020
943
949
Corazza, M., Menini, S., Cabrio, E., Tonelli, S., Villata, S. (2020). Hybrid Emoji-Based Masked Language Models for Zero-Shot Abusive Language Detection. Association for Computational Linguistics [10.18653/v1/2020.findings-emnlp.84].
Corazza, Michele; Menini, Stefano; Cabrio, Elena; Tonelli, Sara; Villata, Serena
File in questo prodotto:
File Dimensione Formato  
2020.findings-emnlp.84.pdf

accesso aperto

Tipo: Versione (PDF) editoriale
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY)
Dimensione 215.2 kB
Formato Adobe PDF
215.2 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/836682
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 12
  • ???jsp.display-item.citation.isi??? 6
social impact