High-performance computing (HPC) is the cornerstone of technological advancements in our digital age, but its management is becoming increasingly challenging, particularly as systems approach exascale. Operational data analytics (ODA) and holistic monitoring frameworks aim to alleviate this burden by collecting live telemetry from HPC systems. ODA frameworks rely on NoSQL databases for scalability, with implicit data structures embedded in metric names, necessitating domain knowledge for navigating telemetry data relations. To address the imperative need for explicit representation of relations in telemetry data, we propose a novel ontology for ODA, which we apply to a real HPC installation. The proposed ontology captures relationships between topological components and links hardware components(compute nodes, rack, systems) with job's execution and allocations collected telemetry. This ontology forms the basis for constructing a knowledge graph, enabling graph queries for ODA. Moreover, we propose a comparative analysis of the complexity (expressed in lines of code) and domain knowledge requirement (qualitatively assessed by informed end-users) of complex query implementation with the proposed method and NoSQL methods commonly employed in today's ODAs. We focused on six queries informed by facility managers' daily operations, aiming to benefit not only facility managers but also system administrators and user support. Our comparative analysis demonstrates that the proposed ontology facilitates the implementation of complex queries with significantly fewer lines of code and domain knowledge required as compared to NoSQL methods.
Junaid Ahmed Khan, M.M. (2024). ExaQuery: Proving Data Structure to Unstructured Telemetry Data in Large-Scale HPC [10.1145/3629527.3652898].
ExaQuery: Proving Data Structure to Unstructured Telemetry Data in Large-Scale HPC
Junaid Ahmed Khan
;Martin Molan;Matteo Angelinelli;Andrea BartoliniSupervision
2024
Abstract
High-performance computing (HPC) is the cornerstone of technological advancements in our digital age, but its management is becoming increasingly challenging, particularly as systems approach exascale. Operational data analytics (ODA) and holistic monitoring frameworks aim to alleviate this burden by collecting live telemetry from HPC systems. ODA frameworks rely on NoSQL databases for scalability, with implicit data structures embedded in metric names, necessitating domain knowledge for navigating telemetry data relations. To address the imperative need for explicit representation of relations in telemetry data, we propose a novel ontology for ODA, which we apply to a real HPC installation. The proposed ontology captures relationships between topological components and links hardware components(compute nodes, rack, systems) with job's execution and allocations collected telemetry. This ontology forms the basis for constructing a knowledge graph, enabling graph queries for ODA. Moreover, we propose a comparative analysis of the complexity (expressed in lines of code) and domain knowledge requirement (qualitatively assessed by informed end-users) of complex query implementation with the proposed method and NoSQL methods commonly employed in today's ODAs. We focused on six queries informed by facility managers' daily operations, aiming to benefit not only facility managers but also system administrators and user support. Our comparative analysis demonstrates that the proposed ontology facilitates the implementation of complex queries with significantly fewer lines of code and domain knowledge required as compared to NoSQL methods.File | Dimensione | Formato | |
---|---|---|---|
ExaQuery_GraphSys_24.pdf
accesso aperto
Tipo:
Postprint
Licenza:
Licenza per Accesso Aperto. Creative Commons Attribuzione (CCBY)
Dimensione
1.56 MB
Formato
Adobe PDF
|
1.56 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.