Privacy policies often fall short of providing a comprehensive account of how personal data is used, thus failing to comply with GDPR requirements. By doing so, they hamper the users’ ability to make informed decisions about using services while ensuring that their data is used properly and fairly. This calls for automatic tools that can effectively identify potentially unlawful policies. Here we present a new corpus of Italian privacy policies, with clauses labelled by experts in data protection law, to indicate the level of comprehensiveness of information. We focus on the categories of data processed, classifying each clause as either sufficiently or insufficiently informative (“vague”). We perform 6 different classification and detection tasks, comparing the performance of BERT-based models and generative Large Language Models. Addressing multilingualism is crucial in the EU, whose 24 spoken languages are an integral part of its cultural heritage. Consequentely, we also perform cross-language experiments to evaluate whether a pre-existing English corpus or classifiers can be leveraged for Italian and, vice versa, whether our corpus is informative enough to generalize to other languages.
Grundler, G., Musicco, M., Galassi, A., Lagioia, F., Liepina, R., Resta, G., et al. (2025). Detecting Vague Clauses in Italian Privacy Policies Using Transformers, LLMs, and Cross-Lingual Techniques [10.3233/FAIA251362].
Detecting Vague Clauses in Italian Privacy Policies Using Transformers, LLMs, and Cross-Lingual Techniques
Giulia Grundler;Mariaceleste Musicco;Andrea Galassi;Francesca Lagioia;Ruta Liepina;Giorgio Resta;Giovanni Sartor;Paolo Torroni
2025
Abstract
Privacy policies often fall short of providing a comprehensive account of how personal data is used, thus failing to comply with GDPR requirements. By doing so, they hamper the users’ ability to make informed decisions about using services while ensuring that their data is used properly and fairly. This calls for automatic tools that can effectively identify potentially unlawful policies. Here we present a new corpus of Italian privacy policies, with clauses labelled by experts in data protection law, to indicate the level of comprehensiveness of information. We focus on the categories of data processed, classifying each clause as either sufficiently or insufficiently informative (“vague”). We perform 6 different classification and detection tasks, comparing the performance of BERT-based models and generative Large Language Models. Addressing multilingualism is crucial in the EU, whose 24 spoken languages are an integral part of its cultural heritage. Consequentely, we also perform cross-language experiments to evaluate whether a pre-existing English corpus or classifiers can be leveraged for Italian and, vice versa, whether our corpus is informative enough to generalize to other languages.| File | Dimensione | Formato | |
|---|---|---|---|
|
FAIA-413-FAIA251362.pdf
accesso aperto
Tipo:
Versione (PDF) editoriale / Version Of Record
Licenza:
Licenza per Accesso Aperto. Creative Commons Attribuzione - Non commerciale (CCBYNC)
Dimensione
311.99 kB
Formato
Adobe PDF
|
311.99 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


