ANACIN-X Tutorial

Non-deterministic results often arise unexpectedly in High Performance Computing (HPC) applications. These events can have a negative effect on the debugging process, compromising the correctness of scientific HPC simulations.

ANACIN-X is a software framework specifically designed to measure the degree of non-determinism in point-to-point communication within MPI applications, utilizing graph kernel distances.

This tutorial will enable developers and scientists to understand the origins of non-determinism using ANACIN-X.

Who is the tutorial for?

The tutorial is designed for undergraduate and graduate students in computer science, software developers of HPC codes, and scientists working in HPC. Often, undergraduate and graduate degree programs in computer science introduce students to the concept of parallel code, but leave a more in-depth understanding of the code executions on large scale HPC systems up to individual exploration.

This tutorial bridges the knowledge gap between data science and HPC domain science.

Tutorial Objectives

By the end of the module students should be able to:

  • Understand the emergence of non-determinism in MPI applications and the impact of non-determinism on scientific code executions.(Chapter 1)
  • Understand and identify how MPI communication events are modeled as event graphs (Chapter 2) and learn the process by which ANACIN-X utilizes event graphs to identify non-determinism (Chapter 3).
  • Install and run ANACINX software (Chapter 5) across various benchmark applications (Chapter 4), and interpret the outcomes produced by ANACIN-X (Chapter 6)
  • Evaluate their understanding through benchmark applications in use cases A, B, and C.

Chapters

1.

MPI and Non-Determinism

Introduces MPI and Non-Determinism; provides examples of
how non-determinism impacts scientific applications.

2.

Graphs

Shows how MPI communication events are modeled as event
graphs; presents ways to analyze event graphs to identify
non-determinism.

3.

ANACIN-X

Introduces different software modules from ANACIN-X for analysis
of non-deterministic behavior in MPI applicaitons.

4.

Benchmarks

Defines four application benchmarks for testing ANACIN-X's ability
to characterize non-determinism.

5.

Running ANACIN-X

Provides a step-by-step guide to install and run ANACIN-X.

6.

Results Interpretation

Demonstrates how to interpret results from ANACIN-X and identify
the sources of non-determinism in an application.

Acknowledgment

This project is a joint effort between the University of Tennessee Knoxville (UTK) and the University of North Texas (UNT).

This work was supported in part by the National Science Foundation under Grant 1900888, Grant 1900765 and Grant 1916454. Results were generated in part with support of the Tellico cluster computer and the XSEDE computational resources.