This paper introduces a novel approach to data collection from social media by presenting a fully automated pipeline that scrapes and cleans data from Reddit. Designed to complement traditional survey methods and administrative data, the workflow efficiently extracts large volumes of user-generated content, identifies relevant submissions via targeted keywords, and systematically cleans the resulting text for subsequent analysis. Through customized modules that remove duplicates, standardize linguistic features, and handle domain-specific stopwords, the pipeline produces a high-quality dataset readily applicable to sentiment analysis, topic modeling, or other statistical techniques. By capturing timely, organic discussions on rapidly evolving topics, this method offers a valuable supplement to conventional data-collection strategies for researchers, policymakers, and statistical agencies seeking richer, more immediate insights from online communities.

Stracqualursi, L. (2025). Harnessing social media for innovative data collection: A Reddit scraping and data-cleaning pipeline. Bologna : Alma Mater Studiorum - Università di Bologna.

Harnessing social media for innovative data collection: A Reddit scraping and data-cleaning pipeline

STRACQUALURSI LUISA
Primo
2025

Abstract

This paper introduces a novel approach to data collection from social media by presenting a fully automated pipeline that scrapes and cleans data from Reddit. Designed to complement traditional survey methods and administrative data, the workflow efficiently extracts large volumes of user-generated content, identifies relevant submissions via targeted keywords, and systematically cleans the resulting text for subsequent analysis. Through customized modules that remove duplicates, standardize linguistic features, and handle domain-specific stopwords, the pipeline produces a high-quality dataset readily applicable to sentiment analysis, topic modeling, or other statistical techniques. By capturing timely, organic discussions on rapidly evolving topics, this method offers a valuable supplement to conventional data-collection strategies for researchers, policymakers, and statistical agencies seeking richer, more immediate insights from online communities.
2025
Book of Abstract - ITACOSM2025-IASS Satellite Conference Shaping the future of survey statistics in the data-driven era
102
102
Stracqualursi, L. (2025). Harnessing social media for innovative data collection: A Reddit scraping and data-cleaning pipeline. Bologna : Alma Mater Studiorum - Università di Bologna.
Stracqualursi, Luisa
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/1018463
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact