Objectives

  • Define MPI application, understand the emergence of non-determinism in MPI applications and articulate the impact of non-determinism on scientific code executions.

MPI and Non-Determinism

In this chapter, we will introduce MPI and how non-determinism occurs in MPI applications. We will describe the impact of non-determinism on scientific applications.

Message Passing Interface

Message Passing Interface (MPI) is the de facto coding standard for parallel applications on distributed systems at petascale and exascale. MPI is not a programming language; rather, it is a library of functions that developers can use within C, C++, or Fortran code to write parallel programs.

MPI uses three different communication methods to share data among computing nodes:

  • Point-to-Point Communication where message is passed between two processes. One process performs a send operation while the other performs a matching read.
  • Collective Communication where communication involves a group of processes.
  • One Sided Communication where a process can directly access the memory space of another process without involving it.
../images/mpi_nd_images/mpi_communications.png

Different communication methods in MPI. Source: 1

Non-Determinism in MPI Applications

When a parallel program exhibits non-deterministic behavior, it means that multiple executions of the same program with the same inputs can produce different results.

In the context of MPI, non-determinism can occur due to factors such as message ordering, synchronization, and race conditions between processes.

These examples demonstrate how non-determinism arises in MPI applications.

../images/E01-pack_unpack.svg

An Example of Non-Determinism

When thread/process order influences arithmetic order, we can get some unexpected results across runs. Sure, 10-9 seems like a small error, but it can get "not-so-innocent".

../images/mpi_nd_images/nd2.png

Non Deterministic Sum Execution

Here, when the sum is the denominator of something, it is the difference between a normal arithmetic execution and an FPE( floating point exception)

Petascale to Exascale

At petascale level, systems involve a high degree of parallelism and concurrency, but are not as massive and complex as exascale systems. At the exascale level, MPI applications commonly trade determinism for prioritizing on asynchrony and concurrency. This shift enables improved performance and efficiency in large-scale parallel computing tasks.

../images/mpi_nd_images/peta_exa.png

Message-Passing Applications: From Petascale to Exascale

Impact of Non-Determinism

Non-deterministic programs are fundamentally difficult to reason about because their outcomes are not predictable based on the input. These challanges increases as the scale increases. Non-determinism can lead to two particularly urgent problems:

  • Lack of program correctness bugs can be expensive to localize and diagnose
  • Lack of scientific correctness: results can be difficult to replicate over multiple runs.

Impact on Program Correctness

../images/mpi_nd_images/halo.png

Real case example: Impact on program correctness

In a case study 2 on HPC debugging, scientists encountered a non-deterministic bug causing intermittent hangs after several hours of execution. The bug, found in the linear algebra package HYPRE 2.10.1 used by their research application Diablo, disrupted their work for 18 months. It took over 10,000 hours (about a year) of compute time to locate and diagnose the issue.

Impact on Scientific Correctness

../images/mpi_nd_images/halo.png

Real case example: Impact on Scientific correctness

A second study 3 illustrates how non-determinism affects the reproducibility of scientific results. In simulation of galaxy formation using the Enzo code 4, non-determinism led to major differences in detection of a galactic halo on multiple runs. Discrepancies in outputs undermine the reliability of scientific conclusions drawn from simulations.

Conclusion

MPI is a powerful tool for parallel programming, but it can also be non-deterministic, which can impact scientific applications in several ways. This chapter introduces you to MPI, non-determinism in MPI applications, and its impact on scientific applications. Next Chapter 2 shows how to model the communication MPI events through event graphs.

You can test your understanding of the chapters 1 and 2 through Use Case A.

References

  1. https://hpc.nmsu.edu/discovery/mpi/introduction
  2. Noise Injection Techniques to Expose Subtle and Unintended Message Races
  3. Assessing Reproducibility: An Astrophysical Example of Computational Uncertainty in the HPC Context
  4. Enzo: An adaptive mesh refinement code for astrophysics