This is the first release candidate for the 1.2.0 branch of runc. It includes
all patches and bugfixes included in runc 1.1 patch releases (up to and
including 1.1.12). A fair few new features have been added, and some changes
have been made which may affect users. Please help us thoroughly test this
release before we release 1.2.0.
runc
now requires a minimum of Go 1.20 to compile.
NOTE: runc currently will not work properly when compiled with Go 1.22 or
newer. This is due to some unfortunate glibc behaviour that Go 1.22
exacerbates in a way that results in containers not being able to start on
some systems. See this issue for more information.
Breaking
-
Several aspects of how mount options work has been adjusted in a way that
could theoretically break users that have very strange mount option strings.
This was necessary to fix glaring issues in how mount options were being
treated. The key changes are:-
Mount options on bind-mounts that clear a mount flag are now always
applied. Previously, if a user requested a bind-mount with only clearing
options (such asrw,exec,dev
) the options would be ignored and the
original bind-mount options would be set. Unfortunately this also means
that container configurations which specified only clearing mount options
will now actually get what they asked for, which could break existing
containers (though it seems unlikely that a user who requested a specific
mount option would consider it "broken" to get the mount options they
asked foruser who requested a specific mount option would consider it
"broken" to get the mount options they asked for). This also allows us to
silently add locked mount flags the user did not explicitly request to be
cleared in rootless mode, allowing for easier use of bind-mounts for
rootless containers. (#3967) -
Container configurations using bind-mounts with superblock mount flags
(i.e. filesystem-specific mount flags, referred to as "data" in
mount(2)
, as opposed to VFS generic mount flags likeMS_NODEV
) will
now return an error. This is because superblock mount flags will also
affect the host mount (as the superblock is shared when bind-mounting),
which is obviously not acceptable. Previously, these flags were silently
ignored so this change simply tells users that runc cannot fulfil their
request rather than just ignoring it. (#3990)
If any of these changes cause problems in real-world workloads, please open
an issue so we
can adjust the behaviour to avoid compatibility issues. -
Added
- runc has been updated to OCI runtime-spec 1.2.0, and supports all Linux
features with a few minor exceptions. See
docs/spec-conformance.md
for more details. - runc now supports id-mapped mounts for bind-mounts (with no restrictions on
the mapping used for each mount). Other mount types are not currently
supported. This feature requiresMOUNT_ATTR_IDMAP
kernel support (Linux
5.12 or newer) as well as kernel support for the underlying filesystem used
for the bind-mount. Seemount_setattr(2)
for a list of
supported filesystems and other restrictions. (#3717, #3985, #3993) - Two new mechanisms for reducing the memory usage of our protections against
CVE-2019-5736 have been introduced:runc-dmz
is a minimal binary (~8K) which acts as an additional execve
stage, allowing us to only need to protect the smaller binary. It should
be noted that there have been several compatibility issues reported with
the usage ofrunc-dmz
(namely related to capabilities and SELinux). As
such, this mechanism is opt-in and can be enabled by runningrunc
with the environment variableRUNC_DMZ=true
(setting this environment
variable inconfig.json
will have no effect). This feature can be
disabled at build time using therunc_nodmz
build tag. (#3983, #3987)contrib/memfd-bind
is a helper daemon which will bind-mount a memfd copy
of/usr/bin/runc
on top of/usr/bin/runc
. This entirely eliminates
per-container copies of the binary, but requires care to ensure that
upgrades to runc are handled properly, and requires a long-running daemon
(unfortunately memfds cannot be bind-mounted directly and thus require a
daemon to keep them alive). (#3987)
- runc will now use
cgroup.kill
if available to kill all processes in a
container (such as when doingrunc kill
). (#3135, #3825) - Add support for setting the umask for
runc exec
. (#3661) - libct/cg: support
SCHED_IDLE
for runc cgroupfs. (#3377) - checkpoint/restore: implement
--manage-cgroups-mode=ignore
. (#3546) - seccomp: refactor flags support; add flags to features, set
SPEC_ALLOW
by
default. (#3588) - libct/cg/sd: use systemd v240+ new
MAJOR:*
syntax. (#3843) - Support CFS bandwidth burst for CPU. (#3749, #3145)
- Support time namespaces. (#3876)
- Reduce the
runc
binary size by ~11% by updating
github.com/checkpoint-restore/go-criu
. (#3652) - Add
--pidfd-socket
torunc run
andrunc exec
to allow for management
processes to receive a pidfd for the new process, allowing them to avoid pid
reuse attacks. (#4045)
Deprecated
runc
option--criu
is now ignored (with a warning), and the option will
be removed entirely in a future release. Users who need a non-standard
criu
binary should rely on the standard way of looking up binaries in
$PATH
. (#3316)runc kill
option-a
is now deprecated. Previously, it had to be specified
to kill a container (with SIGKILL) which does not have its own private PID
namespace (so that runc would send SIGKILL to all processes). Now, this is
done automatically. (#3864, #3825)github.com/opencontainers/runc/libcontainer/user
is now deprecated, please
usegithub.com/moby/sys/user
instead. It will be removed in a future
release. (#4017)
Changed
- When Intel RDT feature is not available, its initialization is skipped,
resulting in slightly fasterrunc exec
andrunc run
. (#3306) runc features
is no longer experimental. (#3861)- libcontainer users that create and kill containers from a daemon process
(so that the container init is a child of that process) must now implement
a proper child reaper in case a container does not have its own private PID
namespace, as documented incontainer.Signal
. (#3825) - Sum
anon
andfile
frommemory.stat
for cgroupv2 root usage,
as the root does not havememory.current
for cgroupv2.
This aligns cgroupv2 root usage more closely with cgroupv1 reporting.
Additionally, report root swap usage as sum of swap and memory usage,
aligned with v1 and existing non-root v2 reporting. (#3933) - Add
swapOnlyUsage
inMemoryStats
. This field reports swap-only usage.
For cgroupv1,Usage
andFailcnt
are set by subtracting memory usage
from memory+swap usage. For cgroupv2,Usage
,Limit
, andMaxUsage
are set. (#4010) - libcontainer users that create and kill containers from a daemon process
(so that the container init is a child of that process) must now implement
a proper child reaper in case a container does not have its own private PID
namespace, as documented incontainer.Signal
. (#3825) - libcontainer:
container.Signal
no longer takes anall
argument. Whether
or not it is necessary to kill all processes in the container individually
is now determined automatically. (#3825, #3885) - seccomp: enable seccomp binary tree optimization. (#3405)
runc run
/runc exec
: ignore SIGURG. (#3368)- Remove tun/tap from the default device allowlist. (#3468)
runc --root non-existent-dir list
now reports an error for non-existent
root directory. (#3374)
Fixed
- In case the runc binary resides on tmpfs,
runc init
no longer re-execs
itself twice. (#3342) - Our seccomp
-ENOSYS
stub now correctly handles multiplexed syscalls on
s390 and s390x. This solves the issue where syscalls the host kernel did not
support would return-EPERM
despite the existence of the-ENOSYS
stub
code (this was due to how s390x does syscall multiplexing). (#3474) - Remove tun/tap from the default device rules. (#3468)
- specconv: avoid mapping "acl" to
MS_POSIXACL
. (#3739) - libcontainer: fix private PID namespace detection when killing the
container. (#3866, #3825) - systemd socket notification: fix race where runc exited before systemd
properly handled theREADY
notification. (#3291, #3293) - The
-ENOSYS
seccomp stub is now always generated for the native
architecture thatrunc
is running on. This is needed to work around some
arguably specification-incompliant behaviour from Docker on architectures
such as ppc64le, where the allowed architecture list is set tonull
. This
ensures that we always generate at least one-ENOSYS
stub for the native
architecture even with these weird configs. (#4219)
Removed
- In order to fix performance issues in the "lightweight" bindfd protection
against CVE-2019-5736, the temporaryro
bind-mount of
/proc/self/exe
has been removed. runc now creates a binary copy in all
cases. See the above notes aboutmemfd-bind
andrunc-dmz
as well as
contrib/cmd/memfd-bind/README.md
for more information about how this
(minor) change in memory usage can be further reduced. (#3987, #3599, #2532,
#3931) - libct/cg: Remove
EnterPid
(a function with no users). (#3797) - libcontainer: Remove
{Pre,Post}MountCmds
which were never used and are
obsoleted by more generic container hooks. (#3350)
Static Linking Notices
The runc
binary distributed with this release are statically linked with
the following GNU LGPL-2.1 licensed libraries, with runc
acting
as a "work that uses the Library":
The versions of these libraries were not modified from their upstream versions,
but in order to comply with the LGPL-2.1 (§6(a)), we have attached the
complete source code for those libraries which (when combined with the attached
runc source code) may be used to exercise your rights under the LGPL-2.1.
However we strongly suggest that you make use of your distribution's packages
or download them from the authoritative upstream sources, especially since
these libraries are related to the security of your containers.
Thanks to the following contributors who made this release possible:
- Akihiro Suda akihiro.suda.cz@hco.ntt.co.jp
- Alban Crequy albancrequy@microsoft.com
- Aleksa Sarai cyphar@cyphar.com
- Alex Jia ajia@redhat.com
- Alexander Eldeib alexeldeib@gmail.com
- Andrey Tsygunka dreamsider@mail.ru
- Austin Vazquez macedonv@amazon.com
- Bjorn Neergaard bjorn.neergaard@docker.com
- Brian Goff cpuguy83@gmail.com
- Chengen, Du chengen.du@canonical.com
- Chethan Suresh chethan.suresh@sony.com
- Christian Happ Christian.Happ@jumo.net
- Cory Snider csnider@mirantis.com
- CrazyMax crazy-max@users.noreply.github.com
- Daniel, Dao Quang Minh dqminh89@gmail.com
- Danish Prakash grafitykoncept@gmail.com
- Davanum Srinivas davanum@gmail.com
- Eng Zer Jun engzerjun@gmail.com
- Eric Ernst eric_ernst@apple.com
- Erik Sjölund erik.sjolund@gmail.com
- Evan Phoenix evan@phx.io
- Francis Laniel flaniel@linux.microsoft.com
- Heran Yang heran55@126.com
- Irwin D'Souza dsouzai.gh@gmail.com
- Jaroslav Jindrak dzejrou@gmail.com
- Jonas Eschenburg jonas.eschenburg@kuka.com
- Jordan Rife jrife0@gmail.com
- Kailun Qin kailun.qin@intel.com
- Kang Chen kongchen28@gmail.com
- Kazuki Hasegawa nanasi880@gmail.com
- Kir Kolyshkin kolyshkin@gmail.com
- Markus Lehtonen markus.lehtonen@intel.com
- Masahiro Yamada masahiroy@kernel.org
- Mikko Ylinen mikko.ylinen@intel.com
- Mrunal Patel mrunalp@gmail.com
- Peter Hunt pehunt@redhat.com
- Prajwal S N prajwalnadig21@gmail.com
- Qiang Huang h.huangqiang@huawei.com
- Radostin Stoyanov rstoyanov@fedoraproject.org
- Rodrigo Campos rodrigoca@microsoft.com
- Ruediger Pluem ruediger.pluem@vodafone.com
- Sebastiaan van Stijn github@gone.nl
- Shengjing Zhu zhsj@debian.org
- Sjoerd van Leent sjoerd.van.leent@alliander.com
- SuperQ superq@gmail.com
- TTFISH jiongchiyu@gmail.com
- Tianon Gravi admwiggin@gmail.com
- Vipul Newaskar vipulnewaskar7@gmail.com
- Walt Chen godsarmycy@gmail.com
- Wang-squirrel 117961776+Wang-squirrel@users.noreply.github.com
- Wei Fu fuweid89@gmail.com
- Zheao Li me@manjusaka.me
- Zoe hi@zoe.im
- cdoern cdoern@redhat.com
- dharmicksai dharmicksaik@gmail.com
- guodong guodong9211@gmail.com
- hang.jiang hang.jiang@daocloud.io
- lengrongfu lengrongfu@lengrongfudeMacBook-Pro.local
- lifubang lifubang@acmcoder.com
- utam0k k0ma@utam0k.jp
- wineway wangyuweihx@gmail.com
- yanggang gang.yang@daocloud.io
- yaozhenxiu 946666800@qq.com
Signed-off-by: Aleksa Sarai cyphar@cyphar.com