We're excited to announce the release of AWS ParallelCluster 3.2.0
Upgrade
How to upgrade?
sudo pip install --upgrade aws-parallelcluster
ENHANCEMENTS
- Add support for memory-based job scheduling in Slurm
- Configure compute nodes real memory in the Slurm cluster configuration.
- Add new configuration parameter
Scheduling/SlurmSettings/EnableMemoryBasedScheduling
to enable memory-based scheduling in Slurm. - Add new configuration parameter
Scheduling/SlurmQueues/ComputeResources/SchedulableMemory
to override default value of the memory seen by the scheduler on compute nodes.
- Improve flexibility on cluster configuration updates to avoid the stop and start of the entire cluster whenever possible.
- Add new configuration parameter
Scheduling/SlurmSettings/QueueUpdateStrategy
to set the preferred strategy to adopt for compute nodes needing a configuration update and replacement.
- Add new configuration parameter
- Improve failover mechanism over available compute resources when hitting insufficient capacity issues with EC2 instances. Disable compute nodes by a configurable amount of time (default 10 min) when a node launch fails due to insufficient capacity.
- Add support to mount existing FSx for ONTAP and FSx for OpenZFS file systems.
- Add support to mount multiple instances of existing EFS, FSx for Lustre / for ONTAP/ for OpenZFS file systems.
- Add support for FSx for Lustre Persistent_2 deployment type when creating a new file system.
- Prompt user to enable EFA for supported instance types when using
pcluster configure
wizard. - Add support for rebooting compute nodes via Slurm.
- Improved handling of Slurm power states to also account for manual powering down of nodes.
- Add NVIDIA GDRCopy 2.3 into the product AMIs to enable low-latency GPU memory copy.
CHANGES
- Upgrade EFA installer to version 1.17.2
- EFA driver:
efa-1.16.0-1
- EFA configuration:
efa-config-1.10-1
- EFA profile:
efa-profile-1.5-1
- Libfabric:
libfabric-aws-1.16.0~amzn2.0-1
- RDMA core:
rdma-core-41.0-2
- Open MPI:
openmpi40-aws-4.1.4-2
- EFA driver:
- Upgrade NICE DCV to version 2022.0-12760.
- Upgrade NVIDIA driver to version 470.129.06.
- Upgrade NVIDIA Fabric Manager to version 470.129.06.
- Change default EBS volume types from gp2 to gp3 for both the root and additional volumes.
- Changes to FSx for Lustre file systems created by ParallelCluster:
- Change the default deployment type to
Scratch_2
. - Change the Lustre server version to
2.12
.
- Change the default deployment type to
- Do not require
PlacementGroup/Enabled
to be set totrue
when passing an existingPlacementGroup/Id
. - Add
parallelcluster:cluster-name
tag to all the resources created by ParallelCluster. - Do not allow setting
PlacementGroup/Id
whenPlacementGroup/Enabled
is explicitly set tofalse
. - Add
lambda:ListTags
andlambda:UntagResource
toParallelClusterUserRole
used by ParallelCluster API stack for cluster update. - Restrict IPv6 access to IMDS to root and cluster admin users only, when configuration parameter
HeadNode/Imds/Secured
is true as by default. - With a custom AMI, use the AMI root volume size instead of the ParallelCluster default of 35 GiB. The value can be changed in cluster configuration file.
- Automatic disabling of the compute fleet when the configuration parameter
Scheduling/SlurmQueues/ComputeResources/SpotPrice
is lower than the minimum required Spot request fulfillment price. - Show
requested_value
andcurrent_value
values in the change set when adding or removing a section during an update. - Disable
aws-ubuntu-eni-helper
service in DLAMI to avoid conflicts withconfigure_nw_interface.sh
when configuring instances with multiple network cards. - Remove support for Python 3.6.
- Set MTU to 9001 for all the network interfaces when configuring instances with multiple network cards.
- Remove the trailing dot when configuring the compute node FQDN.
BUG FIXES
- Fix the default behavior to skip the ParallelCluster validation and test steps when building a custom AMI.
- Fix file handle leak in
computemgtd
. - Fix race condition that was sporadically causing launched instances to be immediately terminated because not available yet in EC2 DescribeInstances response
- Fix support for
DisableSimultaneousMultithreading
parameter on instance types with Arm processors. - Fix ParallelCluster API stack update failure when upgrading from a previus version. Add resource pattern used for the
ListImagePipelineImages
action in theEcrImageDeletionLambdaRole
. - Fix ParallelCluster API adding missing permissions needed to import/export from S3 when creating an FSx for Lustre storage.