Privacy policies often fall short of providing a comprehensive account of how personal data is used, thus failing to comply with GDPR requirements. By doing so, they hamper the users’ ability to make informed decisions about using services while ensuring that their data is used properly and fairly. This calls for automatic tools that can effectively identify potentially unlawful policies. Here we present a new corpus of Italian privacy policies, with clauses labelled by experts in data protection law, to indicate the level of comprehensiveness of information. We focus on the categories of data processed, classifying each clause as either sufficiently or insufficiently informative (“vague”). We perform 6 different classification and detection tasks, comparing the performance of BERT-based models and generative Large Language Models. Addressing multilingualism is crucial in the EU, whose 24 spoken languages are an integral part of its cultural heritage. Consequentely, we also perform cross-language experiments to evaluate whether a pre-existing English corpus or classifiers can be leveraged for Italian and, vice versa, whether our corpus is informative enough to generalize to other languages.

Grundler, G., Musicco, M., Galassi, A., Lagioia, F., Liepina, R., Resta, G., et al. (2025). Detecting Vague Clauses in Italian Privacy Policies Using Transformers, LLMs, and Cross-Lingual Techniques [10.3233/FAIA251362].

Detecting Vague Clauses in Italian Privacy Policies Using Transformers, LLMs, and Cross-Lingual Techniques

Giulia Grundler;Mariaceleste Musicco;Andrea Galassi;Francesca Lagioia;Ruta Liepina;Giorgio Resta;Giovanni Sartor;Paolo Torroni
2025

Abstract

Privacy policies often fall short of providing a comprehensive account of how personal data is used, thus failing to comply with GDPR requirements. By doing so, they hamper the users’ ability to make informed decisions about using services while ensuring that their data is used properly and fairly. This calls for automatic tools that can effectively identify potentially unlawful policies. Here we present a new corpus of Italian privacy policies, with clauses labelled by experts in data protection law, to indicate the level of comprehensiveness of information. We focus on the categories of data processed, classifying each clause as either sufficiently or insufficiently informative (“vague”). We perform 6 different classification and detection tasks, comparing the performance of BERT-based models and generative Large Language Models. Addressing multilingualism is crucial in the EU, whose 24 spoken languages are an integral part of its cultural heritage. Consequentely, we also perform cross-language experiments to evaluate whether a pre-existing English corpus or classifiers can be leveraged for Italian and, vice versa, whether our corpus is informative enough to generalize to other languages.
2025
Frontiers in Artificial Intelligence and Applications: ECAI 2025
4594
4602
Grundler, G., Musicco, M., Galassi, A., Lagioia, F., Liepina, R., Resta, G., et al. (2025). Detecting Vague Clauses in Italian Privacy Policies Using Transformers, LLMs, and Cross-Lingual Techniques [10.3233/FAIA251362].
Grundler, Giulia; Musicco, Mariaceleste; Galassi, Andrea; Lagioia, Francesca; Liepina, Ruta; Resta, Giorgio; Roccu, Sara; Sartor, Giovanni; Torroni, P...espandi
File in questo prodotto:
File Dimensione Formato  
FAIA-413-FAIA251362.pdf

accesso aperto

Tipo: Versione (PDF) editoriale / Version Of Record
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione - Non commerciale (CCBYNC)
Dimensione 311.99 kB
Formato Adobe PDF
311.99 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1027194
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact