MPI and Non-Determinism

In this chapter, we will introduce MPI and how non-determinism occurs in MPI applications. We will describe the impact of non-determinism on scientific applications.

Message Passing Interface 

Message Passing Interface (MPI) is the de facto coding standard for parallel applications on distributed systems at petascale and exascale. MPI is not a programming language; rather, it is a library of functions that developers can use within C, C++, or Fortran code to write parallel programs.

MPI uses three different communication methods to share data among computing nodes:

Point-to-Point Communication where message is passed between two processes. One process performs a send operation while the other performs a matching read.
Collective Communication where communication involves a group of processes.
One Sided Communication where a process can directly access the memory space of another process without involving it.

../images/mpi_nd_images/mpi_communications.png — Different communication methods in MPI. Source: ¹ 

Non-Determinism in MPI Applications

When a parallel program exhibits non-deterministic behavior, it means that multiple executions of the same program with the same inputs can produce different results.

In the context of MPI, non-determinism can occur due to factors such as message ordering, synchronization, and race conditions between processes.

These examples demonstrate how non-determinism arises in MPI applications.

../images/E01-pack_unpack.svg — An Example of Non-Determinism

When thread/process order influences arithmetic order, we can get some unexpected results across runs. Sure, 10^-9 seems like a small error, but it can get "not-so-innocent".

../images/mpi_nd_images/nd2.png — Non Deterministic Sum Execution

Here, when the sum is the denominator of something, it is the difference between a normal arithmetic execution and an FPE( floating point exception)

Petascale to Exascale

At petascale level, systems involve a high degree of parallelism and concurrency, but are not as massive and complex as exascale systems. At the exascale level, MPI applications commonly trade determinism for prioritizing on asynchrony and concurrency. This shift enables improved performance and efficiency in large-scale parallel computing tasks.

../images/mpi_nd_images/peta_exa.png — Message-Passing Applications: From Petascale to Exascale

Impact of Non-Determinism 

Non-deterministic programs are fundamentally difficult to reason about because their outcomes are not predictable based on the input. These challanges increases as the scale increases. Non-determinism can lead to two particularly urgent problems:

Lack of program correctness bugs can be expensive to localize and diagnose
Lack of scientific correctness: results can be difficult to replicate over multiple runs.

Impact on Program Correctness

../images/mpi_nd_images/halo.png — Real case example: Impact on program correctness

In a case study² on HPC debugging, scientists encountered a non-deterministic bug causing intermittent hangs after several hours of execution. The bug, found in the linear algebra package HYPRE 2.10.1 used by their research application Diablo, disrupted their work for 18 months. It took over 10,000 hours (about a year) of compute time to locate and diagnose the issue.

Impact on Scientific Correctness

A second study ³ illustrates how non-determinism affects the reproducibility of scientific results. In simulation of galaxy formation using the Enzo code⁴, non-determinism led to major differences in detection of a galactic halo on multiple runs. Discrepancies in outputs undermine the reliability of scientific conclusions drawn from simulations.

Conclusion

MPI is a powerful tool for parallel programming, but it can also be non-deterministic, which can impact scientific applications in several ways. This chapter introduces you to MPI, non-determinism in MPI applications, and its impact on scientific applications. Next Chapter 2 shows how to model the communication MPI events through event graphs.

You can test your understanding of the chapters 1 and 2 through Use Case A.