Daniel Golle [Sat, 17 Jul 2021 13:06:38 +0000 (14:06 +0100)]
jail: make use of realpath() for rootfs and overlaydir
Use realpath() to resolve rootfs and read/write-overlay as they are
potentially (and likely, as we are going to use blockd with autofs)
symlinks.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 14 Jul 2021 22:23:40 +0000 (23:23 +0100)]
uxc: check for required blockd mounts
When calling `uxc boot` it can happen that some required storage
volumes are not yet mounted. Make sure mountpoints exist for all
required volumes before starting a container using `uxc boot`.
(uxc' init-script will take care of calling `uxc boot` every time
a new block mount is added)
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 15 Jul 2021 01:49:23 +0000 (02:49 +0100)]
jail: open() extroot folder before mounting
Use open() to trigger autofs mount and check extroot folder exists
before mount-binding it.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 14 Jul 2021 16:47:22 +0000 (17:47 +0100)]
jail: allow rootfs to be a symbolic link
Follow symbolic link to rootfs so we can use autofs symlinks in /mnt
to reference volumes in config.json.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Tue, 13 Jul 2021 00:08:20 +0000 (01:08 +0100)]
jail: increase max additional env records to 64
In the Docker world, people pass a lot of things using env variables
it turns out. Increase to 64 for now as a hot fix, will have to be
created dynamically in future to support unlimited number of env
variables.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Mon, 12 Jul 2021 23:59:32 +0000 (00:59 +0100)]
jail: do not hack /etc/resolv.conf on container rootfs
While useful for slim containers, this violates OCI spec and breaks
containers like pihole.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Mon, 12 Jul 2021 20:22:04 +0000 (21:22 +0100)]
uxc: implement support for rootfs overlay in containers
ujail already supports having a (temporary) overlayfs on top of a
containers rootfs. This is very useful for "dirty" containers which
assume / is writable.
Support this in uxc at the time a container is created and keep the
settings on subsequent re-creates (or reboots).
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Fri, 28 May 2021 16:17:35 +0000 (18:17 +0200)]
jail: add support for cgroup devices as in OCI run-time spec
Implement eBPF generator to emulate cgroup-v1 devices.{allow,deny}
as we got only cgroup-v2 available while the spec was written having
cgroups-v1 in mind.
Instead of literally emulating the legacy behavior, do like other
runtimes do as well when running on cgroup-v2: simply translate each
device rule into a bunch of eBPF instructions and then execute them
in reverse order, prepended by some default rules covering /dev/null,
/dev/random, /dev/tty, ...
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Gaurav Pathak [Wed, 5 May 2021 11:32:45 +0000 (17:02 +0530)]
procd: Use /dev/console for serial console if exists
inittab.c: Use "/dev/console" if it is present, before trying
"/sys/class/tty/console/active" in case if console kernel command
line is not provided during boot and to allow container environment
to use it as login PTY console.
Signed-off-by: Gaurav Pathak <gaurav.pathak@pantacor.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Gaurav Pathak [Sun, 21 Mar 2021 13:14:33 +0000 (18:44 +0530)]
procd: Adding support to detect Pantavisor Container Platform
Modified container.h to detect the pantavisor container platform,
as it runs a custom modified version of LXC. container.h is modified
to check if procd is running in a pantavisor container environment by
detecting the presence of pantavisor directory under /.
Signed-off-by: Gaurav Pathak <gaurav.pathak@pantacor.com>
Daniel Golle [Fri, 19 Mar 2021 22:22:44 +0000 (22:22 +0000)]
trace: fix build on aarch64
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 28 Jan 2021 20:10:46 +0000 (20:10 +0000)]
jail/seccomp: add support for aarch64
Add support for Aarch64 in utrace and ujail.
Sort and unify architecture-specific definitions in headers.
Use new PTRACE_GET_SYSCALL_INFO call (available since Linux 5.3), for
now only for aarch64, but this may potentially unify things and get
rid of some #ifdef'ery for other platforms as well.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Mathew McBride [Fri, 5 Mar 2021 00:54:15 +0000 (00:54 +0000)]
inittab: detect active console from kernel if no console= specified
The default serial console can be set in the device tree
using the linux,stdout-path parameter (or equivalent from ACPI).
This is important for universal booting (EFI/EBBR) on ARM platforms
where the default console can be different (e.g ttyS0 vs ttyAMA0).
Signed-off-by: Mathew McBride <matt@traverse.com.au>
Daniel Golle [Sun, 7 Mar 2021 23:45:33 +0000 (23:45 +0000)]
utils: fix C style in header file
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Rosen Penev [Tue, 2 Mar 2021 00:05:46 +0000 (16:05 -0800)]
procd: fix compilation with newer musl
An open bracket was missing.
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Daniel Golle [Mon, 15 Feb 2021 07:06:42 +0000 (07:06 +0000)]
system: expose if system was booted from initramfs
It can be good for UI to show to the user that the system was booted
from initramfs ie. no writable permanent storage is available.
I imagine LuCI only serving applications which are explicitely marked
as being shown even in initramfs mode, ie. nothing but status,
network->interfaces, network->wireless, system->upgrade,
system->backup, system->backuprestore tabs.
Also sysupgrade could take into account we are running on initramfs
and perform offline backup/restore of whatever is in the flash.
In that way OpenWrt-generated initramfs-images can serve as recovery
OS on devices with dual-boot in a meaningful way.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sat, 13 Feb 2021 20:56:27 +0000 (20:56 +0000)]
cosmetics: provide compatible system info on Aarch64
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Mon, 21 Dec 2020 21:51:01 +0000 (21:51 +0000)]
procd: add hotplug-call dispatcher
Add hotplug-call dispatcher ubus objects for each subsystem.
This will allow more services to run non-root and without
excessive permissions while still being able to trigger
(asynchronous) hotplug events.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 28 Jan 2021 23:46:16 +0000 (23:46 +0000)]
jail: cgroups: fix uninitialized variable
Make sure 'limit' is initialized to -1 (==max) when translating
cgroups-1 memory controller spec to cgroups-2.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Mon, 4 Jan 2021 21:52:33 +0000 (21:52 +0000)]
jail: only output BPF instr. table header if debugging
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Mon, 28 Dec 2020 16:22:38 +0000 (16:22 +0000)]
jail: remove duplicate check for hook file permissions
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
John Crispin [Tue, 26 Jan 2021 10:19:10 +0000 (11:19 +0100)]
procd: fix compiler warning
[ 37%] Building C object CMakeFiles/procd.dir/state.c.o
/projects/procd/state.c: In function ‘state_enter’:
/projects/procd/state.c:147:4: error: ignoring return value of ‘chown’, declared with attribute warn_unused_result [-Werror=unused-result]
147 | chown(p->pw_dir, p->pw_uid, p->pw_gid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[2]: *** [CMakeFiles/procd.dir/build.make:89: CMakeFiles/procd.dir/state.c.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:241: CMakeFiles/procd.dir/all] Error 2
make: *** [Makefile:130: all] Error 2
Signed-off-by: John Crispin <john@phrozen.org>
Stefan Eichenberger [Sun, 24 Jan 2021 22:58:50 +0000 (23:58 +0100)]
hotplug.c: set nl_pid to zero
With the current solution where nl_pid is set through getpid we run into
problems when running procd in a different PID namespace (e.g.
container). The PID number inside the active PID namespace will be set
which doesn't match the global PID. Therefore, procd will never receive
any netlink messages.
By setting nl_pid to zero the kernel will assign the global PID
automatically and fixes the issue.
Signed-off-by: Stefan Eichenberger <eichest@gmail.com>
Acked-by: John Crispin <john@phrozen.org>
Daniel Golle [Sat, 12 Dec 2020 22:59:54 +0000 (22:59 +0000)]
treewide: replace local mkdir_p implementations
Replace local implementations of mkdir_p in favour of using the
more robust implementation now added to libubox.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 9 Dec 2020 11:10:32 +0000 (11:10 +0000)]
jail: remove unreachable code
Replace unreachable error handling code in function setns_open with
a more appropriate assertion.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Fri, 4 Dec 2020 09:51:34 +0000 (09:51 +0000)]
early: fall-back to run ubus as root if user can't be found
Users have been reporting problems in case the ubus user is missing in
/etc/passwd. Run ubus as root in that case and display warning.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Tue, 1 Dec 2020 22:45:15 +0000 (22:45 +0000)]
jail: improve seccomp log output
Pass loglevel to preloaded seccomp handler, output generated program
along with unresolved syscalls if debugging output is requested.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Mon, 30 Nov 2020 00:44:53 +0000 (00:44 +0000)]
jail: seccomp: improve code readability
Break overly long line, add some comments.
No functional changes.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 29 Nov 2020 23:21:04 +0000 (23:21 +0000)]
jail: always call cgroups_free()
In commit
3019f50 ("jail: leak less memory") memory handling in cgroups
related code was refactored. That allows to call cgroups_free()
unconditionally and remove the child-branch of in free_opts().
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 29 Nov 2020 19:12:17 +0000 (19:12 +0000)]
jail: improve seccomp BPF generator
Restructure and add code to process rules based on syscall arguments as
defined in OCI run-tine spec. Generated BPF code became more efficient
as now only one BPF instruction for each syscall is required.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 26 Nov 2020 16:34:38 +0000 (16:34 +0000)]
jail: properly initialize timens_fd
So we are safe for the future.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 26 Nov 2020 16:24:47 +0000 (16:24 +0000)]
jail: enter existing cgroups namespace if given
Call to enter an existing cgroups namespace was missing. Add it.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 26 Nov 2020 04:49:35 +0000 (04:49 +0000)]
jail: don't attempt to mount /sys with noatime
Because that won't work. Use relatime instead.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 26 Nov 2020 03:29:45 +0000 (03:29 +0000)]
jail: fix typo in usage output
'-j' is wrong, it should be '-i' (for _i_mmediately).
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 26 Nov 2020 01:44:50 +0000 (01:44 +0000)]
jail: seteuid before clone(CLONE_NEWUSER)
Resolve the userid in parent namespace mapped to the root user of the
new user namespace. Before clone(), seteuid() to that user in the parent
namespace.
Use SECBIT_NO_SETUID_FIXUP so the parent process can later on switch
back using seteuid(0).
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 26 Nov 2020 01:01:14 +0000 (01:01 +0000)]
jail: don't fail if can't mount-bind /etc/resolv.conf
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 26 Nov 2020 00:55:20 +0000 (00:55 +0000)]
jail: don't use NULL arguments for mount syscall
Make valgrind more happy
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 26 Nov 2020 00:26:43 +0000 (00:26 +0000)]
jail: relax /etc/resolv.conf creation
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 25 Nov 2020 23:25:58 +0000 (23:25 +0000)]
jail: fix and simplify userns uid/gid maps from OCI
Pre-calculate allocation length more simple and make sure maps are
properly generated.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 25 Nov 2020 20:00:10 +0000 (20:00 +0000)]
jail: fix segfault on missing name and refactor
Move check for named jail up to main() function, and also add that
condition in case an OCI container is loaded as that would segfault
in case no name was given.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Tue, 24 Nov 2020 21:03:12 +0000 (21:03 +0000)]
jail: leak less memory
Always free everything before exiting, clean up dynamic structures,
add missing free() calls in various places, ...
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 22 Nov 2020 22:50:22 +0000 (22:50 +0000)]
jail: add 'debug' extern variable to preload_seccomp
ujail's seccomp ld-preload support broke recently with
Error relocating /lib/libpreload-seccomp.so: debug: symbol not found
Fix that by adding a debug variable to seccomp.c.
Fixes: be6da62 ("seccomp: silence 'unknown syscall' warnings")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 22 Nov 2020 04:23:29 +0000 (04:23 +0000)]
uxc: also delete procd runtime state on 'delete'
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 22 Nov 2020 03:16:31 +0000 (03:16 +0000)]
uxc: fix incomplete commit
Fixes: 04a2edd ("uxc: make force-delete kill container process")
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 28 Oct 2020 13:06:07 +0000 (13:06 +0000)]
jail: cgroup hack: rewrite cgroup -> cgroup2
"I'm sure you said cgroup2"
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Fri, 20 Nov 2020 23:56:13 +0000 (23:56 +0000)]
seccomp: silence 'unknown syscall' warnings
Output them as debugging messages instead.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 19 Nov 2020 17:12:54 +0000 (17:12 +0000)]
uxc: make force-delete kill container process
Don't allow to delete running containers unless '--force' is
specified. If '--force' is specified, send KILL signal to container
process before deleting it.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 15 Nov 2020 23:58:44 +0000 (23:58 +0000)]
trace: switch to OCI seccomp JSON output
Generate JSON as specified on OCI runtime spec for seccomp syscall
filter instead of our previous OpenWrt-specific format.
[1]: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#seccomp
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 15 Nov 2020 23:22:13 +0000 (23:22 +0000)]
seccomp: switch to new OCI compliant parser
Drop the old OpenWrt-specific seccomp rule parser in favour of reusing
the OCI compliant variant.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 15 Nov 2020 23:45:38 +0000 (23:45 +0000)]
seccomp: specifying architectures is optional
Specifying the architecture used for system calls is optional in OCI
spec. Make it optional in the parser.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Fri, 6 Nov 2020 18:42:25 +0000 (18:42 +0000)]
jail: fix capabilities
Allocate enough stack space for capget()/capset() which requires
2*sizeof(struct __user_cap_data_struct), each containing 32-bit fields,
where the 2nd struct contains the bits for high (>32) capabilities.
Failing to do that not only leads to those high capabilities being
inaccessible but also overwrote the stack resulting in ujail hanging
infinitely instead of returning from applyOCIcapabilities().
Also adapt debugging output to 64-bit format.
Apart from that, don't set SECBIT_NO_SETUID_FIXUP when not actually
modifying capabilities explicitely, as that would result in ALL
capabilities retained in the subsequent setuid() call instead of
having them all dropped.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Tue, 27 Oct 2020 16:34:06 +0000 (16:34 +0000)]
uxc: mimic runc cmdline by using getopt_long
Imitate runc (or crun) cmdline parameters. This allows using uxc as
runtime with podman.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 28 Oct 2020 13:01:52 +0000 (13:01 +0000)]
jail: don't fail if maskedPath cannot be found
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 28 Oct 2020 11:59:10 +0000 (11:59 +0000)]
jail: add support for absolute root path in OCI spec
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 28 Oct 2020 01:39:34 +0000 (01:39 +0000)]
jail: relax seccomp unknown syscall handling
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 28 Oct 2020 00:30:03 +0000 (00:30 +0000)]
jail: handle mount propagation flags
Add support for propagation mount options (private, slave, shared,
unbindable, rprivate, rslave, rshared, runbindable).
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 28 Oct 2020 00:09:51 +0000 (00:09 +0000)]
jail: add option for pidfile
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Tue, 27 Oct 2020 22:15:09 +0000 (22:15 +0000)]
jail: guard boolean blobmsg attributes
ujail tried to parse boolean values in config.json even if they were
not present which lead to segfaults.
Check if booleans are actually present before trying to parse them.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 22 Oct 2020 21:59:14 +0000 (22:59 +0100)]
ujail: elf: work around GCC bug on MIPS64
Work-around gcc bug which leads to segfault parsing ELF on MIPS64.
The codepath added in this commit gets triggered when parsing
/lib/ld-musl-mips64-sf.so.1 (a symlink to /lib/libc.so) on MIPS64
(built with gcc-8.4.0 and musl 1.1.24) in qemu-system-mips64 on the
malta/be64 target.
Include work-around outputting an error message, but preventing
segfault when building for MIPS64.
Tested-by: Roman Kuzmitskii <damex.pp@icloud.com>
[tested on edgerouter 4 and edgerouter lite]
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 22 Oct 2020 01:44:14 +0000 (02:44 +0100)]
jail: mount more stuff read-only
Mount /etc/resolv.conf, /etc/passwd, /etc/group and /etc/nsswitch.conf
read-only in ujail slim-containers.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Mon, 19 Oct 2020 18:30:13 +0000 (19:30 +0100)]
jail: capabilities: apply in two phases
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Mon, 19 Oct 2020 16:15:11 +0000 (17:15 +0100)]
jail: nuke old capabilities code in favour of reusing OCI code
Previsously capabilities could be defined for slim-containers using
our own JSON format, only allowing to modify capabilities in the
bouding set. As apparently that was never used by even a single
package, drop that old parser and logic in favour of reusing the now
existing OCI capability handling functions.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Mon, 19 Oct 2020 16:50:19 +0000 (17:50 +0100)]
instance: actually wire up capabilities filename
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Mon, 19 Oct 2020 16:00:26 +0000 (17:00 +0100)]
jail: adapt to new ubus socket path
The previous commit
3121467 ("early: run ubusd non-root as user ubus, group ubus")
changed the path of the ubus socket from /var/run/ubus.sock to
/var/run/ubus/ubus.sock. Adapt jail to also mount-bind that new
path for jails which include ubus access (eg. dnsmasq).
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Mon, 19 Oct 2020 12:43:23 +0000 (13:43 +0100)]
early: run ubusd non-root as user ubus, group ubus
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 13 Aug 2020 00:54:21 +0000 (01:54 +0100)]
cgroups: memory controller fixes
OCI 'swap' value encodes memory+swap, make the best out of that.
Ignore 'kernel' and 'kernelTCP' values rather than returning with
error as kernel memory is accounted in the existing limits in cgroup2.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 13 Aug 2020 00:22:11 +0000 (01:22 +0100)]
cgroups: restrict allowed keys in 'unified' section
Prevent specifying directories by banning the use of '/' characters
and disallow some internal cgroup.* files as suggested in [1].
[1]: https://github.com/opencontainers/runtime-spec/pull/1040
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Thomas Petazzoni [Mon, 10 Aug 2020 01:15:20 +0000 (15:15 -1000)]
initd/init: add minimal SELinux policy loading support
In order to support SELinux in OpenWrt, this commit introduces minimal
support for loading the SELinux policy in the init code. The logic is
very much inspired from what Busybox is doing: call
selinux_init_load_policy() from libselinux, and then re-execute init
so that it runs with the SELinux policy in place and enforced.
Signed-off-by: Thomas Petazzoni <thomas.petazzoni at bootlin.com>
[fix spelling of OpenWrt]
Signed-off-by: Paul Spooren <mail@aparcar.org>
Daniel Golle [Thu, 6 Aug 2020 14:34:27 +0000 (15:34 +0100)]
jail: fix freeing cgroups avl
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 6 Aug 2020 14:34:27 +0000 (15:34 +0100)]
jail: only free cgroups if they were allocated
Fixes segfault on shutdown with slim containers.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 5 Aug 2020 17:37:53 +0000 (18:37 +0100)]
jail: parse OCI cgroups resources
Start pure cgroup2 implementation with emulation of (some) cgroup1
properties.
Initially support converting cpu, memory, blockIO, pids to unified in
addition to directly specifying unified attributes as suggested in
https://github.com/opencontainers/runtime-spec/pull/1040
Support for converting devices and network into BPF programs is
planned.
Now that containers have their representation in the unified cgroup
hierarchy, make sure using cgroup namespaces also produces meaningful
results.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 5 Aug 2020 13:36:44 +0000 (14:36 +0100)]
instance: add instances into unified cgroup hierarchy
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Tue, 4 Aug 2020 00:55:40 +0000 (01:55 +0100)]
jail: make use of BLOBMSG_CAST_INT64 for OCI rlimits
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 2 Aug 2020 18:25:29 +0000 (19:25 +0100)]
jail: use pidns semantics also for timens
Just like pidns, timens is also only applied to children forked after
the setns() call, so use the same semantics here as well when joining
an existing time namespace.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 29 Jul 2020 13:26:51 +0000 (14:26 +0100)]
initd: attempt to mount cgroup2
Prepare for using cgroup2 in procd and ujail.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 29 Jul 2020 12:49:38 +0000 (13:49 +0100)]
service: add method to query available container features
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 30 Jul 2020 11:58:42 +0000 (12:58 +0100)]
uxc: remove debugging left-over
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Wed, 29 Jul 2020 21:17:05 +0000 (22:17 +0100)]
instance: make sure values are not inherited from previous runs
Code to update and move instance attributes has been neglected when
new instance and jail options were added.
Add the ones which were missing.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Tue, 28 Jul 2020 23:41:32 +0000 (00:41 +0100)]
uxc: use new container.%s kill ubus API
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Tue, 28 Jul 2020 23:36:19 +0000 (00:36 +0100)]
jail: add 'kill' method to container.%s object
Using the the current container signal method to send a signal to the
jailed process works fine, as signals are being forwarded by the
ujail parent process. However, in case of KILL (==9) signal, both,
parent and jailed process are killed immediately which results in the
'poststop' OCI hook being skipped.
Add new 'kill' method to ujail's container object to allow sending
signals to the jailed process directly instead of having to send
signals to the parent.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Tue, 28 Jul 2020 23:13:27 +0000 (00:13 +0100)]
uxc: fix create operation
The 'create' operation needs uxc to reload it's configuration, so after
adding the container to uxc' persistent state tracking the follow-up
call to create the run-time can find it.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Tue, 28 Jul 2020 08:06:39 +0000 (09:06 +0100)]
uxc: behave more like a compliant OCI run-time
Follow CLI syntax as described in OCI run-time spec[1].
In addition, allow 'create' call also without 'path' parameter to
re-create previously created containers, also after reboot.
Usual workflow:
uxc create debian /mnt/sda3/debian
uxc start debian
uxc kill debian 1
uxc create debian
uxc start debian
...
To create a container and have it automatically launched at boot:
uxc create debian /mnt/sda3/debian true
[1]: https://github.com/opencontainers/runtime-spec/blob/master/runtime.md#operations
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Tue, 28 Jul 2020 08:05:24 +0000 (09:05 +0100)]
jail: add some remaining OCI features
* register ubus object for container to query state
* wait on 'created' state until 'start' command is issued via ubus
* have a way to bypass waiting on 'created' state
* support OCI annotations pass-through
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sat, 25 Jul 2020 17:28:25 +0000 (18:28 +0100)]
jail: serialize hook execution
Make sure hook execution is completed before continueing with any
further actions. This involves a major refactoring ujail to use a
single uloop mainloop for each process to avoid congruency issues.
Also fix other remaining problems in code for OCI hooks, such as making
sure memory allocated to store hook information is zerod.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sat, 25 Jul 2020 15:30:29 +0000 (16:30 +0100)]
jail: fix build on glibc and uclibc
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Mon, 20 Jul 2020 14:00:23 +0000 (15:00 +0100)]
jail: add support for referencing existing namespaces
Allow OCI containers to specify paths to existing namespaces.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Rosen Penev [Mon, 20 Jul 2020 22:35:27 +0000 (15:35 -0700)]
jail: fix wrong format for 32-bit
The proper format for size_t is %zu .
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Rosen Penev [Mon, 20 Jul 2020 22:35:26 +0000 (15:35 -0700)]
rcS: cast format string to int64_t
musl 1.2.0 turns time_t into a 64-bit value, even on 32-bit. This makes it
compatible.
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Daniel Golle [Mon, 20 Jul 2020 00:37:15 +0000 (01:37 +0100)]
jail: re-implement /proc/sys/net read-write in netns hack
Hack to make /proc/sys/net read-write while the rest of /proc/sys is
read-only which cannot be expressed with OCI spec, but happends to be
very useful. Only apply it if '/proc/sys' is not already listed as
mount, maskedPath or readonlyPath.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 19 Jul 2020 23:12:44 +0000 (00:12 +0100)]
jail: refactor default mounts into new structure
Add default mounts of /dev, /dev/pts, /dev/shm, /proc and /sys to
the restructured mounts AVL list instead of calling mount directly.
While for slim containers this change shouldn't make any difference,
it allows OCI containers to override options of those default
filesystems.
The previous hack keeping /proc/sys/net mounted read-write if inside
a new network namespace while all the rest of /proc/sys is read-only
cannot easily be translated and is removed for now.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 19 Jul 2020 23:30:06 +0000 (00:30 +0100)]
jail: actually apply filesystem-specific mount options
OCI supplied filesystems-specific mount options have not been stored
in the add_mount() function. strdup() them there and free the original
string in the OCI function.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 19 Jul 2020 20:45:53 +0000 (21:45 +0100)]
jail: add support for defining devices
OCI run-time spec allows containers to specify devices to be created
in /dev in addition to the default devices.
Parse devices from linux section in config.json; clean-up and refactor
default entries in /dev into the same function using a similar scheme.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 19 Jul 2020 19:21:33 +0000 (20:21 +0100)]
jail: move /tmp/resolv.conf.d to /dev/resolv.conf.d
OCI spec implicitely intends /dev to be used as tmpfs mounted by
default while /tmp may not be mounted or may not even exist.
Hence move /tmp/resolv.conf.d to /dev/resolv.conf.d inside
container.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 19 Jul 2020 18:09:34 +0000 (19:09 +0100)]
jail: /proc/$pid/oom_score_adj to OCI defined oomScoreAdj
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 19 Jul 2020 16:31:42 +0000 (17:31 +0100)]
jail: parse and apply POSIX rlimits
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sun, 19 Jul 2020 00:32:55 +0000 (01:32 +0100)]
jail: read and apply umask from OCI if defined
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sat, 18 Jul 2020 23:35:16 +0000 (00:35 +0100)]
jail: implement OCI user additionalGIDs
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sat, 18 Jul 2020 21:58:22 +0000 (22:58 +0100)]
jail: parse and apply OCI sysctl values
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Sat, 18 Jul 2020 11:31:09 +0000 (12:31 +0100)]
jail: fix hooks
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Daniel Golle [Thu, 16 Jul 2020 22:00:35 +0000 (23:00 +0100)]
jail: add support for maskedPaths and readonlyPaths
Parse maskedPaths and readonlyPaths string arrays if defined in OCI
container linux section. readonlyPaths are implemented by adding a
recursive read-only bind-mount on the path, maskedPaths are done by
mounting a zero-sized tmpfs with mode 000 for directories and mount-
binding an empty file having mode 000 for non-directories.
Mounts of both, maskedPaths and readonlyPaths, may fail silently if
the path doesn't exist.
Signed-off-by: Daniel Golle <daniel@makrotopia.org>