github SchedMD/slurm slurm-26-05-0-1
v26.05.0

4 hours ago

Changes in 26.05.0

  • slurmctld - Fix interactive jobs erroneously killed by InactivityLimit when slurmctld is congested.
  • data_parser/v0.0.45 - Remove fields that were deprecated in from v0.0.44
  • slurmd - fix a potential crash during message forwarding
  • slurmctld - Avoid possible crash under heavy load due to pointer comparisons mis-matching.
  • Reject sbatch --external jobs when combined with --wrap
  • Skip external nodes in _slurm_rpc_node_alias_addrs().
  • Fix out-of-bounds array errors by resizing leaf_usage when tres_cnt changes.
  • Add the option to set StorageHost or StorageBackupHost in slurmdbd.conf, or JobCompHost in slurm.conf, to a unix socket. To do so, prefix with "unix:", e.g., StorageHost=unix:/path/to/socket.
  • Logs now better reflect mysql connection issues if connecting over a UNIX socket.
  • Cache uid lookups to speed controller/dbd startup/reconfigure in some cases.
  • All features will be tested before jobs are preempted.
  • Improve clarity of gres/shards in sinfo GRES_USED field.
  • Fix issue with SlurmctldParameters=max_powered_nodes affecting scontrol update nodename=... commands when it should not.
  • slurmctld - return 303 See Other from GET /metrics/* when in backup standby, pointing to the configured primary controller
  • slurmstepd - when a node fails on which the batch step is running, don't deallocate the batch step until after the job completes or is requeued.
  • Reject the job if num_tasks is lower than the partition min_nodes
  • Reject job if num_tasks is lower than min_nodes
  • Reject num_tasks update when num_tasks < min_nodes
  • Set max_nodes from num_tasks when not explicitly set
  • switch/hpe_slingshot - Fix memory leak when the fabric manager responds to a job-lookup GET with HTTP 404.
  • Fix node reboot with slurmd older than 26.05.
  • Prevent the slurmctld's background thread from waiting on the purge files thread while holding the job and node write locks.
  • Prevent slurmctld from crashing after removing all of a pending job's licenses from the configuration.
  • Avoid purging reservations that reserved HRES when restarting or reconfiguring slurmctld.
  • When restoring jobs from state, hold pending jobs whose license or HRES requests are no longer valid.
  • Prevent slurmctld crash when a HetJob component fails to start.
  • Avoid reverse DNS lookup's for connection logging unless DebugFlags=conmgr is configured.
  • Return a properly formatted value for DefMemPerCPU.
  • srun - Reject --async outside of an existing job allocation.
  • Fixed hdf5 'Malformed file' error for sh5util -I extraction.
  • Fixed memory leaks in sh5util.
  • scontrol show federation now brackets IPv6 control host literals so the address and port are no longer ambiguous.
  • Fix torus3d placements overlap detection for torus wrap.
  • Add torus3d node_count overflow guard
  • topology/torus3d - Add adaptive Morton encoding for large torus dimensions.
  • Fix torus3d and other topology parsers reporting DUMPING errors when parsing fails.
  • Requeue --no-requeue jobs when powering-up nodes are drained and requeue_on_resume_failure SchedulerParameter is set.
  • Set PrologFLags=Alloc automatically when PrologFlags=DeferBatch is set. Without Alloc, DeferBatch will have no effect.
  • Improve slurmdbd hourly rollup performance on large clusters.
  • configure - Rename --with-http-parser to --with-libhttp-parser.
  • configure - Add --with-llhttp-parser option.
  • http_parser/libhttp_parser - If rpaths are enabled when configuring slurm, add rpath to libhttp_parser plugin.
  • Add new http_parser/llhttp_parser plugin.
  • Add new url_parser/internal plugin.
  • interfaces/http_parser - If HttpParserType is not specified, no longer default to using the http_parser/libhttp_parser plugin. Instead try to first load http_parser/libhttp_parser then http_parser/llhttp_parser.
  • interfaces/url_parser - If UrlParserType is not specified, no longer default to using the url_parser/libhttp_parser plugin. Instead try to first load url_parser/libhttp_parser then url_parser/internal.
  • slurmrestd - Fix pipelined HTTP/1.1 requests after the first message on a keep-alive connection.
  • http_parser/libhttp_parser - Prevent memory leak if a connection ends early.
  • srun - Add --parsable to emit the bare step id for easier scripting of --async steps.
  • Enable case insensitive comparison to check for srun_exclusive_allocation in LaunchParameters.
  • Fix JobAccountGather failing on glibc 2.43+ due to a sscanf() %Nc behavior change.
  • auth/slurm - Fix missing symbol issues with libjwt 2.1 caused by importing private base64 functions.
  • auth/jwt - Fix missing symbol issues with libjwt 2.1 caused by importing private base64 functions.
  • task/affinity - Work on nodes with over 1024 CPUs.
  • Document SlurmctldHttpAuthParameters in slurm.conf(5).
  • Document SlurmdHttpAuthParameters in slurm.conf(5).
  • Fix sdiag RPC-by-user and RPC-by-type output for a full user stats table.
  • Fix a regression in slurm 25.05 that caused requeued jobs to lose their license/HRES requests, which results in Slurm allowing the job to run without having sufficient licenses/HRES.
  • Set SLURM_JOB_SLUID environment variable.
  • Fix treating "topology" in slurmd's --conf= options as case sensitive.
  • Fix losing scontrol-set Extra, InstanceId, and InstanceType on nodes across subsequent slurmd registrations.
  • Allow a node's topology to be updated based on the dynamic slurmd's reported topology after a reboot.
  • Allow llhttp-devel as an alternative to http-parser-devel when building RPMs.
  • Fix not setting an end time to steps in a resized job and properly display them under the original SLUID in sacct.
  • When using stepmgr and a job is resized, avoid allocating new steps in removed nodes.
  • Fix not clearing node reasons on resume when not using an accounting storage plugin.
  • Enforce distribution requirements (-m/--distribution on allocation cli commands) if job requests CountOnly GRES.
  • Restrict libjwt to >= 1.10.0, < 3 at build and package time.
  • auth/jwt and auth/slurm - Fix JWT authentication to work around a regression in libjwt 2.1.1 (and later).
  • Fix JWT authentication failures on libjwt 2.x for parse-only credential paths.
  • Fix Slurm Lua string to JSON/YAML (slurm.to_json or slurm.to_yaml) rejecting empty strings.
  • Fix regression in 26.05.0 that caused scrun to exit with a fatal error before starting the container.
  • Fix slurmctld crash and shutdown/reconfigure deadlock caused by accounting_storage callers racing the plugin teardown.
  • Fix potential deadlock when the controller is brought up when the dbd was not running on the controller's previous run and there are jobs with a new script or env that needs to be send to the dbd.
  • Add ESLURM_FILE_UNREADABLE error code to distinguish "file exists but cannot be read" from ENOENT.
  • Avoid logging parsing warnings in CLI when topology.yaml does not strictly conform to OpenAPI specification.
  • Avoid logging parsing warnings in CLI when namespace.yaml does not strictly conform to OpenAPI specification.
  • Avoid logging parsing warnings in CLI when resources.yaml does not strictly conform to OpenAPI specification.
  • Added swait, a client command to block until all of a job's steps have completed.
  • Deprecated options ExclusiveUser and ExclusiveTopo are now mutually exclusive.
  • Add warnings when creating or updating partitions that Exclusive=[NODE|TOPO] implies Oversubscribe=NO when Oversubscribe is set to YES or FORCE.
  • scontrol - 'EXCLUSIVE_USER' and 'EXCLUSIVE_TOPO' will no longer be dumped by the '.partitions[].flags' field of the following commands: 'scontrol show partition --json', 'scontrol show partition --yaml'.
  • slurmrestd - No longer parse or dump 'EXCLUSIVE_USER', 'EXC_USER_CLEAR', 'EXCLUSIVE_TOPO', or 'EXC_TOPO_CLEAR' as values for the '.partitions[].flags' field of the following endpoints: 'GET /slurm/v0.0.45/partition/{partition_name}', 'GET /slurm/v0.0.45/partitions', 'POST /slurm/v0.0.45/partitions'.
  • scontrol - Remove 'partitions[].maximums.oversubscribe.jobs' and 'partitions[].maximums.oversubscribe.flags' fields from the output of the following commands: 'scontrol show partition --json', 'scontrol show partition --yaml'.
  • slurmrestd - Remove 'partitions[].maximums.oversubscribe.jobs' and 'partitions[].maximums.oversubscribe.flags' fields from the following endpoints: 'GET /slurm/v0.0.45/partition/{partition_name}', 'GET /slurm/v0.0.45/partitions', 'POST /slurm/v0.0.45/partitions'
  • slurmrestd - Enable parsing for 'partitions[].partition.exclusive' and 'partitions[].partition.oversubscribe' fields of the following endpoints: 'GET /slurm/v0.0.45/partition/{partition_name}', 'GET /slurm/v0.0.45/partitions', 'POST /slurm/v0.0.45/partitions'.
  • No longer override a partition's OverSubscribe count when updating the partition with Exclusive=[NO|USER].
  • Fix regression in 26.05.0rc1 that caused slurmscriptd to crash on receiving SIGPROF.
  • Properly complete Slurm <= 25.11 jobs with a 26.05 slurmdbd.
  • Add missing index on the sluid column of the job_table.
  • Add sluid in archive dump/load.

Don't miss a new slurm release

NewReleases is sending notifications on new releases.