ANACIN-X: Analysis and Modeling of Nondeterminism and Associated Costs in eXtreme Scale Applications

Funded by the National Science Foundation (NSF) under grants number 1900888 and 1900765

About ANACIN-X

Nondeterminism is a growing and often forgotten challenge for HPC execution. In order to approach this problem, this project has four main goals:
Model non-deterministic executions by identifying sources of non-determinism in HPC applications, and detailing the characteristics of these sources in a given communication topology.
Improve understanding of non-deterministic features across various HPC applications via tracing techniques mixed with modeling and comparison of event graphs.
Study the types and sources of communication non-determinism via both application of existing tools for graph comparison and development of new tools for graph alignment.
Provide lightweight, portable methods for modeling and identifying application non-determinism, allowing for efficient analysis and debugging of arbitrary HPC codes.
Foster an understanding of non-determinism in the next generation of HPC experts via development and release of interactive and educational software and media about non-determinism, collaboration with computer science and HPC organizations such as the SIGHPC Resource Constrained Environments Chapter to reach out to edge communities, and engagement of students in research opportunities.

ANACIN-X Tutorial

Learn more about ANACIN-X with our Comprehensive Tutorial.

Selected Publications

Khan, A., Bhowmick, S., Taufer, M. Towards Scalable Identification of Motifs Representing Non-Determinism in HPC Simulations. in ACM Student Research Competition Posters Display SC'23, Denver, colorado [link]
Tan, N., Luettgau, J., Marquez, J., Teranishi, K., Morales, N., Bhowmick, S., Cappello, F., Taufer, M. and Nicolae, B. Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication. In Proceedings of the 52nd International Conference on Parallel Processing 2023, August. (pp. 665-674). doi: 10.1109/IPDPSW55747.2022.00067. [link]
Alexander, M., Bhowmick, S., Bogale, B., Diaz, G., Elster, A.C., Ellsworth, D.A., Hernandez, C.J.B., Jaffe, E., Marquez, J., Melton, A. and Pandey, A. EduHPC Lightning Talk Summary. in Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, 2023, November, doi:10.1109/TPDS.2021.3081530. [link]
Bell P, Suarez K, Fossum B, Chapp D, Bhowmick S, Taufer M. A Research-Based Course Module to Study Non-determinism in High Performance Applications. IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). (2022). doi: 10.1109/IPDPSW55747.2022.00067. [link forthcoming]
D. Chapp, N. Tan, S. Bhowmick and M. Taufer. Identifying Degree and Sources of Non-Determinism in MPI Applications Via Graph Kernels, in IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 12, pp. 2936-2952, 1 Dec. 2021, doi:10.1109/TPDS.2021.3081530. [link]
Bell P, Suarez K, Chapp D, Tan N, Bhowmick S, Taufer M. ANACIN-X: A Software Framework for Studying Non-determinism in MPI Applications. Software impacts. 9 Oct. 2021, doi: 10.1016/j.simpa.2021.100151. [link]
D. Chapp, D. Rorabaugh (#), K. Sato, D. Ahn, and M. Taufer. A Three-phase Workflow for General and Expressive Representations of Nondeterminism in HPC Applications. International Journal of High-Performance Computing Applications(IJHPCA), 1175-1184 (2019). [link]
D. Chapp, K. Sato, D. Ahn, and M. Taufer. Record-and-Replay Techniques for HPC Systems: A survey. Journal of Supercomputing Frontiers and Innovations, 5(1):11-30,. (2018). [link]
D. Chapp, T. Johnston, and M. Taufer. On the Need for Reproducible Numerical Accuracy through Intelligent Runtime Selection of Reduction Algorithms at the Extreme Scale. In Proceedings of IEEE Cluster Conference, pp. 166 – 175. Chicago, Illinois, USA. September 8 – 11, 2015. [link]