This work handles the optimization of the grid computing performances for a data-intensive and high ”throughput” comparison of protein sequences. We use the word ”throughput” from the telecommu- nication science to mean the amount of concurrent independent jobs in grid. All the proteins of 355 completely sequenced prokaryotic organisms were compared to find common traits of prokaryotic life, producing in parallel tens of Gigabytes of information to store, duplicate, check and analyze. For supporting a large amount of concurrent runs with data ac-cess on shared storage devices and a manageable data format, the output information was stored in many flat files according to a semantic logi- cal/physical directory structure. As many concurrent runs could cause reading bottleneck on the same storage device, we propose methods to optimize the grid computing based on the balance between wide data access and emergence of reading bottlenecks. The proposed analytical approach has the following advantages: not only it optimizes the du-ration of the overall task, but also checks if the estimated duration is compliant with the scientific requirements and if the related grid com-puting is really advantageous compared to an execution on a local farm.

High Throughput Comparison of Prokaryotic Genomes

BARTOLI, LISA;FARISELLI, PIERO;MARTELLI, PIER LUIGI;MONTANUCCI, LUDOVICA;CASADIO, RITA
2007

Abstract

This work handles the optimization of the grid computing performances for a data-intensive and high ”throughput” comparison of protein sequences. We use the word ”throughput” from the telecommu- nication science to mean the amount of concurrent independent jobs in grid. All the proteins of 355 completely sequenced prokaryotic organisms were compared to find common traits of prokaryotic life, producing in parallel tens of Gigabytes of information to store, duplicate, check and analyze. For supporting a large amount of concurrent runs with data ac-cess on shared storage devices and a manageable data format, the output information was stored in many flat files according to a semantic logi- cal/physical directory structure. As many concurrent runs could cause reading bottleneck on the same storage device, we propose methods to optimize the grid computing based on the balance between wide data access and emergence of reading bottlenecks. The proposed analytical approach has the following advantages: not only it optimizes the du-ration of the overall task, but also checks if the estimated duration is compliant with the scientific requirements and if the related grid com-puting is really advantageous compared to an execution on a local farm.
PPAM 2007
69
69
Carota L.; Bartoli L.; Fariselli P.; Martelli P.L.; Montanucci L.; Maggi G.; Casadio R.
File in questo prodotto:
Eventuali allegati, non sono esposti

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11585/73328
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact