CRIS Current Research Information System

In recent years, there has been an exponential growth of applications, including dialogue systems, that handle sensitive personal information. This has brought to light the extremely important issue of personal data protection in virtual environments. Sensitive information detection (SID) covers different domains and languages in literature. However, if we refer to the personal data domain, the absence of a shared standard benchmark makes comparison with the state-of-the-art difficult for this task. To fill this gap, we introduce and release SPEDAC, a new annotated resource for the identification of sensitive personal data categories in the English language. SPEDAC enables the evaluation of computational models for three different SID subtasks with increasing levels of complexity. SPEDAC 1 regards binary classification, a model has to detect if a sentence contains sensitive information or not; in SPEDAC 2 we collected labeled sentences using 5 categories that relate to macro-domains of personal information; in SPEDAC 3, the labeling is fine-grained and includes 61 personal data categories. We conduct an extensive evaluation of the resource using different state-of-the-art-classifiers. The results show that SPEDAC is challenging, particularly with regard to fine-grained classification. Classifiers based on the transformer architectures achieve good results on SPEDAC 1 and 2 but have difficulties to discern among fine-grained classes in SPEDAC 3.

Gambarelli, G., Gangemi, A., Tripodi, R. (2023). Is Your Model Sensitive? SPeDaC: A New Resource and Benchmark for Training Sensitive Personal Data Classifiers. IEEE ACCESS, 11, 10864-10880 [10.1109/ACCESS.2023.3240089].

Is Your Model Sensitive? SPeDaC: A New Resource and Benchmark for Training Sensitive Personal Data Classifiers

Gambarelli, G.;Gangemi, A.;Tripodi, R.

2023

Abstract

In recent years, there has been an exponential growth of applications, including dialogue systems, that handle sensitive personal information. This has brought to light the extremely important issue of personal data protection in virtual environments. Sensitive information detection (SID) covers different domains and languages in literature. However, if we refer to the personal data domain, the absence of a shared standard benchmark makes comparison with the state-of-the-art difficult for this task. To fill this gap, we introduce and release SPEDAC, a new annotated resource for the identification of sensitive personal data categories in the English language. SPEDAC enables the evaluation of computational models for three different SID subtasks with increasing levels of complexity. SPEDAC 1 regards binary classification, a model has to detect if a sentence contains sensitive information or not; in SPEDAC 2 we collected labeled sentences using 5 categories that relate to macro-domains of personal information; in SPEDAC 3, the labeling is fine-grained and includes 61 personal data categories. We conduct an extensive evaluation of the resource using different state-of-the-art-classifiers. The results show that SPEDAC is challenging, particularly with regard to fine-grained classification. Classifiers based on the transformer architectures achieve good results on SPEDAC 1 and 2 but have difficulties to discern among fine-grained classes in SPEDAC 3.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Rivista
	
				IEEE ACCESS
			
	Codice DOI
	
				https://dx.doi.org/10.1109/ACCESS.2023.3240089
			
	Citazione
	
				Gambarelli, G., Gangemi, A., Tripodi, R. (2023). Is Your Model Sensitive? SPeDaC: A New Resource and Benchmark for Training Sensitive Personal Data Classifiers. IEEE ACCESS, 11, 10864-10880 [10.1109/ACCESS.2023.3240089].
			
	Tutti gli autori
	
						Gambarelli, G.; Gangemi, A.; Tripodi, R.
					
	Appare nelle tipologie:
	
				1.01 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
Is_Your_Model_Sensitive_SPEDAC_A_New_Resource_for_the_Automatic_Classification_of_Sensitive_Personal_Data.pdf accesso aperto Descrizione: Articolo Tipo: Versione (PDF) editoriale Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY) Dimensione 3.34 MB Formato Adobe PDF Visualizza/Apri	3.34 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/916339

Citazioni

ND

7

2

social impact