Non-deterministic results often arise unexpectedly in High Performance Computing (HPC) applications.
These events can have a negative effect on the debugging process, compromising the correctness of scientific HPC simulations.
ANACIN-X is a software framework specifically designed to measure the degree of non-determinism in point-to-point communication within MPI applications, utilizing graph kernel distances.
This tutorial will enable developers and scientists to understand the origins of non-determinism using ANACIN-X.
Who is the tutorial for?
The tutorial is designed for undergraduate and graduate students in computer science, software developers of HPC codes, and scientists working in HPC. Often, undergraduate and graduate degree programs in computer science introduce students to the concept of parallel code, but leave a more in-depth understanding of the code executions on large scale HPC systems up to individual exploration.
This tutorial bridges the knowledge gap between data science and HPC domain science.
By the end of the module students should be able to:
- Understand the emergence of non-determinism in MPI applications and the impact of non-determinism on scientific code executions.(Chapter 1)
- Understand and identify how MPI communication events are modeled as event graphs (Chapter 2) and learn the process by which ANACIN-X utilizes event graphs to identify non-determinism (Chapter 3).
- Install and run ANACINX software (Chapter 5) across various benchmark applications (Chapter 4), and interpret the outcomes produced by ANACIN-X (Chapter 6)
- Evaluate their understanding through benchmark applications in use cases A, B, and C.
Introduces MPI and Non-Determinism; provides examples of
Shows how MPI communication events are modeled as event
Introduces different software modules from ANACIN-X for analysis
Defines four application benchmarks for testing ANACIN-X's ability
Provides a step-by-step guide to install and run ANACIN-X.
Demonstrates how to interpret results from ANACIN-X and identify
This project is a joint effort between the University of Tennessee Knoxville (UTK) and the University of North Texas (UNT).
This work was supported in part by the National Science Foundation under Grant 1900888, Grant 1900765 and Grant 1916454. Results were generated in part with support of the Tellico cluster computer and the XSEDE computational resources.