github aws/aws-parallelcluster v3.14.0
AWS ParallelCluster v3.14.0

6 hours ago

We're excited to announce the release of AWS ParallelCluster 3.14.0

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

ENHANCEMENTS

  • Include drivers for P6e-GB200 and P6-B200 instances. ParallelCluster sets up Slurm topology plugin to handle P6e-GB200 UltraServers. See limitations section for important additional setup requirements.
  • Support prioritized and capacity-optimized-prioritized Allocation Strategy. This allows users to prioritize subnets for instance placement to optimize costs and performance.
  • Add build-image support for Amazon Linux 2023 AMIs based on kernel 6.12 (in addition to 6.1).
  • Support DCV on Amazon Linux 2023.
  • Echo chef-client logs in the instance console when a node fails to bootstrap. This helps with investigating bootstrap failures in cases CloudWatch logs are not available.

LIMITATIONS

  • P6e-GB200 instances are only tested on Amazon Linux 2023, Ubuntu 22.04 and Ubuntu 24.04.
  • Using IMEX on P6e-GB200 requires additional setup. Please refer to the dedicated tutorial in our public documentation.
  • P6-B200 instances are only tested on Amazon Linux 2023, RHEL 8 & 9, Rocky 8 & 9, Ubuntu 22.04 and Ubuntu 24.04.
  • GPU HealthChecks are not recommended for instances with GPU memory above 320GB (such as p6-b200.48xlarge). Health check duration can exceed 10 minutes, potentially causing job failures and significantly reducing the job throughput.

CHANGES

  • Install nvidia-imex for all OSs except Amazon Linux 2.
  • Remove UnkillableStepTimeout from slurm.conf and let slurm set this value.
  • Upgrade Python runtime used by Lambda functions to Python 3.12 (from 3.9). See Lambda Documentation for important information about Python 3.9 EOL: https://docs.aws.amazon.com/lambda/latest/dg/lambda-runtimes.html
  • Support encryption of EFS file system used for the head node internal shared storage via a new configuration parameter HeadNode/SharedStorageEfsSettings/Encrypted
  • Add validator that warns against using non GPU instances with DCV.
  • Upgrade Slurm to version 24.11.6 (from 24.05.8).
  • Upgrade EFA installer to 1.43.2 (from 1.41.0).
    • Efa-driver: efa-2.17.2-1
    • Efa-config: efa-config-1.18-1
    • Efa-profile: efa-profile-1.7-1
    • Libfabric-aws: libfabric-aws-2.1.0-5
    • Rdma-core: rdma-core-58.0-1
    • Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.6-11
  • Upgrade Cinc Client to version 18.4.12 (from 18.2.7).
  • Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except Amazon Linux 2.
  • Upgrade CUDA Toolkit to version 12.8.1 (from 12.8.0) for all OSs except Amazon Linux 2.
  • Upgrade DCGM to version 4.4.1 (from 3.3.6) for all OSs except Amazon Linux 2.
  • Upgrade Python to 3.12.11 (from 3.12.8) for all OSs except Amazon Linux 2.
  • Upgrade Python to 3.9.23 (from 3.9.20) for Amazon Linux 2.
  • Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1).
  • Upgrade DCV to version 2024.0-19030.
  • Upgrade the official ParallelCluster Amazon Linux 2023 AMIs to kernel 6.12 (from 6.1).

BUG FIXES

  • Prevent build-image stack deletion failures by deploying a global role that automatically deletes the build-image stack after images either succeed or fail the build.
    The role is meant to exist even after the stack has been deleted. See #5914.
  • Fix an issue where Security Group validation failed when a rule contained both IPv4 ranges (IpRanges) and security group references (UserIdGroupPairs).
  • Fix build-image failure on Rocky 9, occurring when the parent image does not ship the latest kernel version on the latest Rocky minor version.
  • Fix cluster id mismatch issue which causes cluster update failures when slurm accounting is used.
  • Fix a race condition in CloudWatch Agent startup that could cause node bootstrap failures.

DEPRECATIONS

  • The configuration parameter LoginNodes/Pools/Ssh/KeyName has been deprecated, and it will be removed in future releases. The CLI now returns a warning message when it is used in the cluster configuration.
    See #6811.
  • Ubuntu 20.04 is no longer supported.

Don't miss a new aws-parallelcluster release

NewReleases is sending notifications on new releases.