Artifact Review Summary: GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
Artifact Details
Badges Awarded
Artifact Available | Artifact Functional | Results Reproduced |
Description of the Artifact
GEMINI is a failure recovery mechanism for the distributed DNN training system. It solves the problems in handling failures efficiently when training large language models in a distributed manner. The artifact consists of the source code of GEMINI built atop DeepSpeed. It includes the design introduced in the paper and examples of different models adopting GEMINI. It also provides a script that automates the artifact evaluation process to generate the necessary results.
Environment(s) Used for Testing
- AWS instances provided by authors. In particular, the setup is a cluster of 4 AWS p3dn.24xlarge instances, in total having 32 V100 GPUs. OS: Amazon Linux 2 Kernel: 4.14.200-155.322. Nvidia Driver: 450.80.02 CUDA Version: 10.0
Step-By-Step Instructions to Exercise the Artifact
The authors have provided a well-organized working example for artifact evaluation. The reviewers followed the script described in the README file to run the programs.
How The Artifact Supports The Paper
Artifact Available
- The artifact is available on Github under the git commit
8450d6f
. - The artifact has a README file with a reference to the paper.
- The artifact has an associated MIT license.
Artifact Functional
The artifact has a README file. It includes a list of supported environments and running instructions. The description of the file structures meets the minimum requirement and could be further improved later. The authors have provided a minimal working example, both three language models and a dedicated script for the artifact evaluation process. The artifact contains the components the paper has introduced and can function properly.
Artifact Reproduced
The artifact evaluation example is a scaled-down version of the evaluation in the paper. The claims that can be reproduced in the testbed are reproduced correctly.
Additional Notes and Resources
It would be better if the authors could put more comments in the code so that it would be easy to locate the implementation of GEMINI.
The artifact was publicly available during the evaluation. However, it is not available during this summary preparation for unknown reasons.