This article describes latest work performed at NERSC in collaboration with Red Hat to change Podman (the pod supervisor instrument) to run at a big scale, a key requirement for high-performance computing (HPC). Podman is an open supply instrument for creating, managing, and operating containers on Linux programs. For extra particulars about this work, please see our paper which will probably be revealed within the CANOPIE-HPC Supercomputing 2022 proceedings.
In the next demo video, we stroll via pulling a picture onto Perlmutter from the NERSC registry, producing a squashed model of the picture utilizing podman-hpc, and operating the EXAALT benchmark at massive scale (900 nodes, 3600 GPUs) by way of our podman-exec
wrapper. NERSC’s flagship supercomputing system is Perlmutter, at present quantity 7 on the Top 500 record. It has a GPU partition with over 6000 NVIDIA A100 GPUs and a CPU partition with over 6000 AMD Milan CPUs. All of the work described on this weblog put up has been carried out on Perlmutter.
NERSC, the National Energy Research Scientific Computing heart, is the US Department of Energy’s manufacturing mission computing facility that serves the DOE Office of Science, which funds a variety of basic and utilized analysis. In the primary half of 2022, greater than 700 distinctive customers used Shifter, the present container answer at NERSC, and normal person curiosity in containers is rising.
Although NERSC has demonstrated close to naked metallic efficiency with Shifter at massive scales, a number of shortcomings have motivated us to discover Podman. The major issue is that Shifter doesn’t present any construct utilities. Users should construct containers on their very own native system and ship their photos to NERSC by way of a registry. Another impediment is that Shifter offers safety by limiting the operating container to the privileges of the person who launched it. Finally, Shifter is usually an “in-house” answer, so customers should study a brand new expertise, and NERSC workers have the extra burden of sustaining this software program.
Podman offers an answer to all of those main ache factors. Podman is an OCI-compliant framework that adheres to a set of neighborhood requirements. It will really feel acquainted to customers who’ve used different OCI-compliant instruments like Docker. It additionally has a big person and developer neighborhood with greater than 15k stars on GitHub as of October 2022. The main innovation that has drawn us to Podman is rootless containers. Rootless containers elegantly constrain privileges by utilizing a subuid/subgid map to allow the container to run within the person namespace however with what appears like full root privileges. Podman additionally offers container construct performance that may permit customers to construct photos immediately on the Perlmutter login nodes, eradicating a serious roadblock of their growth workflows.
[ Check out the latest Podman articles on Enable Sysadmin. ]
Enabling Podman at a big scale on Perlmutter with near-native efficiency required us to handle website integration, scalability, and efficiency. Additionally, now we have developed two wrapper scripts to attain two modes of operation: Podman container-per-process
and podman-exec
. Podman container-per-process
mode describes the scenario by which many processes are operating on the node (normally in an MPI utility), with one particular person container operating for every course of. The podman-exec
mode describes the scenario in which there’s a single container operating per node, even when there are a number of MPI processes.
We ran a number of benchmarks with podman-hpc
on Perlmutter to measure the efficiency of naked metallic implementations: Shifter, Podman container-per-process
, and podman-exec
mode. The EXAALT benchmark runs the LAMMPS molecular dynamics utility, the Pynamic benchmark simulates Python package deal imports and performance invocations, and the DeepCAM benchmark is a local weather knowledge segmentation deep studying utility. In normal, the benchmarks recommend comparable efficiency between naked metallic, Shifter, and podman-exec
instances. The startup overhead incurred in Podman container-per-process
might be seen within the outcomes of each Pynamic and DeepCAM. In normal, podman-exec
was our greatest performing configuration, so that is the mode on which we’ll focus our future growth efforts.
Results from our strong-scaling EXAALT benchmark at 32, 64, 128, and 256 nodes. The common of two naked metallic run outcomes are proven in pink, Shifter run outcomes are proven in blue, Podman container-per-process
run outcomes are proven in darkish inexperienced, and podman-exec
mode outcomes are proven in mild inexperienced with corresponding error bars.
The outcomes of the Pynamic benchmark for naked metallic (pink), Shifter (blue), podman-exec
mode (inexperienced), and Podman container-per-process
mode (light-green) over two job sizes (128 and 256 nodes) utilizing 64 duties per node. All configurations had been run thrice.
The outcomes of the MLPerf DeepCAM sturdy scaling benchmark for Shifter (blue), Podman container-per-process
(mild inexperienced), and podman-exec
mode (darkish inexperienced) over a variety of job sizes (16, 32, 64, and 128 Perlmutter GPU nodes). We separate the timing knowledge into container startup, coaching startup, and coaching runtime.
We are excited concerning the outcomes now we have seen thus far, however we nonetheless have work to do earlier than we will open Podman to all NERSC customers. To enhance the person expertise, we intention to discover including Slurm integration to take away a few of the complexity of working with nested wrapper scripts, particularly for the podman-exec
case. We additionally intention to get our podman-hpc
scripts and binaries into the Perlmutter boot photos of all nodes, so staging these to every node will not be essential. We hope to handle a few of the limitations of the OCI hook performance (for instance, the shortcoming to set setting variables) with the OCI neighborhood. Finally, our aim is to get a lot of our work upstreamed into Podman itself so the bigger Podman neighborhood can leverage our work.