This paper introduces a novel approach to data collection from social media by presenting a fully automated pipeline that scrapes and cleans data from Reddit. Designed to complement traditional survey methods and administrative data, the workflow efficiently extracts large volumes of user-generated content, identifies relevant submissions via targeted keywords, and systematically cleans the resulting text for subsequent analysis. Through customized modules that remove duplicates, standardize linguistic features, and handle domain-specific stopwords, the pipeline produces a high-quality dataset readily applicable to sentiment analysis, topic modeling, or other statistical techniques. By capturing timely, organic discussions on rapidly evolving topics, this method offers a valuable supplement to conventional data-collection strategies for researchers, policymakers, and statistical agencies seeking richer, more immediate insights from online communities.
Stracqualursi, L. (2025). Harnessing social media for innovative data collection: A Reddit scraping and data-cleaning pipeline. Bologna : Alma Mater Studiorum - Università di Bologna.
Harnessing social media for innovative data collection: A Reddit scraping and data-cleaning pipeline
STRACQUALURSI LUISA
Primo
2025
Abstract
This paper introduces a novel approach to data collection from social media by presenting a fully automated pipeline that scrapes and cleans data from Reddit. Designed to complement traditional survey methods and administrative data, the workflow efficiently extracts large volumes of user-generated content, identifies relevant submissions via targeted keywords, and systematically cleans the resulting text for subsequent analysis. Through customized modules that remove duplicates, standardize linguistic features, and handle domain-specific stopwords, the pipeline produces a high-quality dataset readily applicable to sentiment analysis, topic modeling, or other statistical techniques. By capturing timely, organic discussions on rapidly evolving topics, this method offers a valuable supplement to conventional data-collection strategies for researchers, policymakers, and statistical agencies seeking richer, more immediate insights from online communities.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


