Message Passing Interface (MPI) is the de facto coding standard for parallel applications on distributed systems at petascale and exascale. MPI is not a programming language; rather, it is a library of functions that developers
can use within C, C++, or Fortran code to write parallel programs.
MPI uses three different communication methods to share data among computing nodes:
Point-to-Point Communication where message is passed between two processes. One process performs a send operation while the other performs a matching read.
Collective Communication where communication involves a group of processes.
One Sided Communication where a process can directly access the memory space of another process without involving it.
When a parallel program exhibits non-deterministic behavior, it means that multiple executions of the same program with the same inputs can produce different results.
In the context of MPI, non-determinism can occur due to factors such as message ordering, synchronization,
and race conditions between processes.
These examples demonstrate how non-determinism arises in MPI applications.
When thread/process order influences arithmetic order, we can get some unexpected results across runs. Sure, 10-9 seems
like a small error, but it can get "not-so-innocent".
Here, when the sum is the denominator of something, it is the difference between a normal arithmetic execution and an FPE( floating point exception)
Petascale to Exascale
At petascale level, systems involve a high degree of parallelism and concurrency, but are not as massive and complex as exascale systems. At the exascale level, MPI applications commonly trade determinism for prioritizing on asynchrony and concurrency. This shift enables improved performance and efficiency in large-scale parallel computing tasks.
Non-deterministic programs are fundamentally difficult to reason about because their outcomes are not predictable based on the input. These challanges increases as the scale increases. Non-determinism can lead to two particularly urgent problems:
Lack of program correctness bugs can be expensive to localize and diagnose
Lack of scientific correctness: results can be difficult to replicate over multiple runs.
Impact on Program Correctness
In a case study 2 on HPC debugging, scientists encountered a non-deterministic bug causing intermittent hangs after several hours of execution. The bug, found in the linear algebra package HYPRE 2.10.1 used by their research application Diablo, disrupted their work for 18 months. It took over 10,000 hours (about a year) of compute time to locate and diagnose the issue.
Impact on Scientific Correctness
A second study 3 illustrates how non-determinism affects the reproducibility of scientific results.
In simulation of galaxy formation using the Enzo code 4, non-determinism led to major differences in detection of a galactic halo on multiple runs. Discrepancies in outputs undermine the reliability of scientific conclusions drawn from simulations.
MPI is a powerful tool for parallel programming, but it can also be non-deterministic,
which can impact scientific applications in several ways. This chapter introduces you
to MPI, non-determinism in MPI applications, and its impact on scientific applications. Next Chapter 2 shows how to model the communication MPI events through event graphs.
You can test your understanding of the chapters 1 and 2 through Use Case A.