Biodiversity databases provide unprecedented opportunities for the use of species occurrence data for the development of large scale biodiversity analyses. However, these records often contain taxonomic uncertainties that can ultimately affect the outcomes of downstream analyses. Although several tools have been developed to address these issues, there is limited guidance on how to efficiently use and integrate them. Here, we present a reproducible workflow for handling vascular plant occurrence data, and provide the first comparative analysis of R packages for the taxonomic harmonisation of vascular plant names. Our goal is to assess the differences in performance across the tested tools and to highlight best practices for leveraging large biodiversity databases. We first downloaded occurrence data for vascular plants in Italy from the Botanical Information and Ecology Network (BIEN) and Global Biodiversity Information Facility (GBIF). We then compared seven R packages for taxonomic harmonisation, evaluating their ability to resolve names to accepted taxa and their overall performance. Our results highlight heterogeneity in the number of names resolved by the different tools, with packages relying on plant-specific databases and implementing fuzzy matching outperforming those based on generalist databases and with no possibility of fuzzy matching. These findings underscore that the choice of both packages and taxonomic authorities can have a strong influence on data cleaning outcomes.
Santovito, D., Chiarucci, A., Rocchini, D., Santi, F., Cortès Lobos, R.B., Testolin, R. (2026). Bridging biodiversity gaps: Assessing R tools for harmonising vascular plant records. ECOLOGICAL INFORMATICS, 93(103543), 1-10 [10.1016/j.ecoinf.2025.103543].
Bridging biodiversity gaps: Assessing R tools for harmonising vascular plant records
Santovito, Diletta;Chiarucci, Alessandro;Rocchini, Duccio;Santi, Francesco;Testolin, Riccardo
2026
Abstract
Biodiversity databases provide unprecedented opportunities for the use of species occurrence data for the development of large scale biodiversity analyses. However, these records often contain taxonomic uncertainties that can ultimately affect the outcomes of downstream analyses. Although several tools have been developed to address these issues, there is limited guidance on how to efficiently use and integrate them. Here, we present a reproducible workflow for handling vascular plant occurrence data, and provide the first comparative analysis of R packages for the taxonomic harmonisation of vascular plant names. Our goal is to assess the differences in performance across the tested tools and to highlight best practices for leveraging large biodiversity databases. We first downloaded occurrence data for vascular plants in Italy from the Botanical Information and Ecology Network (BIEN) and Global Biodiversity Information Facility (GBIF). We then compared seven R packages for taxonomic harmonisation, evaluating their ability to resolve names to accepted taxa and their overall performance. Our results highlight heterogeneity in the number of names resolved by the different tools, with packages relying on plant-specific databases and implementing fuzzy matching outperforming those based on generalist databases and with no possibility of fuzzy matching. These findings underscore that the choice of both packages and taxonomic authorities can have a strong influence on data cleaning outcomes.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


