Single nucleotide polymorphisms (SNPs) able to describe population differences can be used for important applications in livestock, including breed assignment of individual animals, authentication of mono-breed products and parentage verification among several other applications. To identify the most discriminating SNPs among thousands of markers in the available commercial SNP chip tools, several methods have been used. Random forest (RF) is a machine learning technique that has been proposed for this purpose. In this study, we used RF to analyse PorcineSNP60 BeadChip array genotyping data obtained from a total of 2737 pigs of 7 Italian pig breeds (3 cosmopolitan-derived breeds: Italian Large White, Italian Duroc and Italian Landrace, and 4 autochthonous breeds: Apulo-Calabrese, Casertana, Cinta Senese and Nero Siciliano) to identify breed informative and reduced SNP panels using the mean decrease in the Gini Index and the Mean Decrease in Accuracy parameters with stability evaluation. Other reduced informative SNP panels were obtained using Delta, Fixation index and principal component analysis statistics, and their performances were compared with those obtained using the RF-defined panels using the RF classification method and its derived Out Of Bag rates and correct prediction proportions. Therefore, the performances of a total of six reduced panels were evaluated. The correct assignment of the animals to its breed was close to 100% for all tested approaches. Porcine chromosome 8 harboured the largest number of selected SNPs across all panels. Many SNPs were included in genomic regions in which previous studies identified signatures of selection or genes (e.g. ESR1, KITL and LCORL) that could contribute to explain, at least in part, phenotypically or economically relevant traits that might differentiate cosmopolitan and autochthonous pig breeds. Random forest used as preselection statistics highlighted informative SNPs that were not the same as those identified by other methods. This might be due to specific features of this machine learning methodology. It will be interesting to explore if the adaptation of RF methods for the identification of selection signature regions could be able to describe population-specific features that are not captured by other approaches.

A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: Application to several pig breeds / Schiavo G.; Bertolini F.; Galimberti G.; Bovo S.; Dall'olio S.; Nanni Costa L.; Gallo M.; Fontanesi L.. - In: ANIMAL. - ISSN 1751-7311. - ELETTRONICO. - 14:2(2020), pp. 223-232. [10.1017/S1751731119002167]

A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: Application to several pig breeds

Schiavo G.;Bertolini F.;Galimberti G.;Bovo S.;Dall'olio S.;Nanni Costa L.;Fontanesi L.
2020

Abstract

Single nucleotide polymorphisms (SNPs) able to describe population differences can be used for important applications in livestock, including breed assignment of individual animals, authentication of mono-breed products and parentage verification among several other applications. To identify the most discriminating SNPs among thousands of markers in the available commercial SNP chip tools, several methods have been used. Random forest (RF) is a machine learning technique that has been proposed for this purpose. In this study, we used RF to analyse PorcineSNP60 BeadChip array genotyping data obtained from a total of 2737 pigs of 7 Italian pig breeds (3 cosmopolitan-derived breeds: Italian Large White, Italian Duroc and Italian Landrace, and 4 autochthonous breeds: Apulo-Calabrese, Casertana, Cinta Senese and Nero Siciliano) to identify breed informative and reduced SNP panels using the mean decrease in the Gini Index and the Mean Decrease in Accuracy parameters with stability evaluation. Other reduced informative SNP panels were obtained using Delta, Fixation index and principal component analysis statistics, and their performances were compared with those obtained using the RF-defined panels using the RF classification method and its derived Out Of Bag rates and correct prediction proportions. Therefore, the performances of a total of six reduced panels were evaluated. The correct assignment of the animals to its breed was close to 100% for all tested approaches. Porcine chromosome 8 harboured the largest number of selected SNPs across all panels. Many SNPs were included in genomic regions in which previous studies identified signatures of selection or genes (e.g. ESR1, KITL and LCORL) that could contribute to explain, at least in part, phenotypically or economically relevant traits that might differentiate cosmopolitan and autochthonous pig breeds. Random forest used as preselection statistics highlighted informative SNPs that were not the same as those identified by other methods. This might be due to specific features of this machine learning methodology. It will be interesting to explore if the adaptation of RF methods for the identification of selection signature regions could be able to describe population-specific features that are not captured by other approaches.
2020
A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: Application to several pig breeds / Schiavo G.; Bertolini F.; Galimberti G.; Bovo S.; Dall'olio S.; Nanni Costa L.; Gallo M.; Fontanesi L.. - In: ANIMAL. - ISSN 1751-7311. - ELETTRONICO. - 14:2(2020), pp. 223-232. [10.1017/S1751731119002167]
Schiavo G.; Bertolini F.; Galimberti G.; Bovo S.; Dall'olio S.; Nanni Costa L.; Gallo M.; Fontanesi L.
File in questo prodotto:
File Dimensione Formato  
a machine learning approach 1-s2.0-S1751731119002167-main.pdf

accesso aperto

Tipo: Versione (PDF) editoriale
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione - Non commerciale - Non opere derivate (CCBYNCND)
Dimensione 307.3 kB
Formato Adobe PDF
307.3 kB Adobe PDF Visualizza/Apri
1-s2.0-S1751731119002167-mmc1.docx

accesso aperto

Tipo: File Supplementare
Licenza: Licenza per Accesso Aperto. Creative Commons Attribuzione - Non commerciale - Non opere derivate (CCBYNCND)
Dimensione 584.6 kB
Formato Microsoft Word XML
584.6 kB Microsoft Word XML Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/743115
Citazioni
  • ???jsp.display-item.citation.pmc??? 14
  • Scopus 34
  • ???jsp.display-item.citation.isi??? 33
social impact