https://github.com/systemd/systemd/pull/23192 caused breakage in
Arch Linux's build tooling. Let's give users an opt-out aside from
reverting the patch. It's hardly any maintenance work on our side
and gives users an easy way to revert the locale change if needed.
Of course, by default we still pick C.UTF-8 if the option is not
specified.
This is a follow-up for f470cb6d13 which in
turn is a follow-up for a068aceafb.
The latter started to honour hidden files when deciding whether a
directory is empty. The former reverted to the old behaviour to fix
issue #23220.
It introduced a bug though: when a directory contains a larger number of
hidden entries the getdents64() buffer will not suffice to read them,
since we just allocate three entries for it (which is definitely enough
if we just ignore the . + .. entries, but not ig we ignore more).
I think it's a bit confusing that dir_is_empty() can return true even if
rmdir() on the dir would return ENOTEMPTY. Hence, let's rework the
function to make it optional whether hidden files are ignored or not.
After all, I looking at the users of this function I am pretty sure in
more cases we want to honour hidden files.
Follow-up for 2362fdde1b
When --machine is specified with --ephemeral, no random suffix is added, so
the recently added assert would fail.
Add a top-level variable with the expected file name for nspawn files, and
compute it when the rest of the names are computed.
When --ephemeral is used, a random 16 characters suffix is added to the image
name, so matching on .nspawn files based on the image name no longer works.
Fixes https://github.com/systemd/systemd/issues/13297
So here's something we should always keep in mind:
systemd-udevd actually does *two* things with BSD file locks on block
devices:
1. While it probes a device it takes a LOCK_SH lock. Thus everyone else
taking a LOCK_EX lock will temporarily block udev from probing
devices, which is good when making changes to it.
2. Whenever a device is closed after write (detected via inotify), udevd
will issue BLKRRPART (requesting the kernel to reread the partition
table). It does this while holding a LOCK_EX lock on the block
device. Thus anyone else taking LOCK_SH or LOCK_EX will temporarily
block udevd from issuing that ioctl. And that's quite relevant, since
the kernel will temporarily flush out all partitions while re-reading
the partition table and then create them anew. Thus it is smart to
take LOCK_SH when dissecting a block device to ensure that no
BLKRRPART is issued in the background, until we mounted the devices.
See a2b0cd3f5a. When -Dshared-lib-tag is used,
libsystemd-shared.so and libsystemd-core.so get a suffix which breaks the
parsing done by systemd_installation_has_version(). We can assume that the
tag will be something like "251-rc1-1.fc37" that is currently used in Fedora.
(Anything that does *not* start with the version would be completely crazy.)
By switching to strverscmp_improved() we simplify the code and fix comparisons
with such versions.
$ build/test-nspawn-util /var/lib/machines/rawhide
...
Found libsystemd shared at "/var/lib/machines/rawhide/lib/systemd/libsystemd-shared-251-rc1-1.fc37.so.so", version 251-rc1-1.fc37 (OK).
/var/lib/machines/rawhide has systemd >= 251: yes
...
I noticed this when I started a systemd-nspawn container with Redora rawhide
and got the message "Not running with unified cgroup hierarchy, LSM BPF is not
supported". I thought the message is in error, but it was actually correct:
nspawn was misdetecting that the container does not sport new-enough systemd
to support cgroups-v2.
This function implements a heuristic that is only used by nspawn. It doesn't
belong in basic. I opted for a new file "nspawn-utils.c", because it seems
likely that we'll need some other new utilities like that in the future.
No functional change.
Defaults to /bin/bash, no changes in the default configuration
The fallback shell for non-root users is as-specified,
and the interactive shell for nspawn sessions is started as
exec(default-user-shell, "-" + basename(default-user-shell), ...)
before falling through to bash and sh
Same idea as 03677889f0.
No functional change intended. The type of the iterator is generally changed to
be 'const char*' instead of 'char*'. Despite the type commonly used, modifying
the string was not allowed.
I adjusted the naming of some short variables for clarity and reduced the scope
of some variable declarations in code that was being touched anyway.
When using user namespaces in conjunction with uidmapped mounts, nspawn
so far set up two uidmappings:
1. One that is used for the uidmapped mount and that maps the UID range
0…65535 on the backing fs to some high UID range X…X+65535 on the
uidmapped fs. (Let's call this mapping the "mount mapping")
2. One that is used for the userns namespace the container payload
processes run in, that maps X…X+65535 back to 0…65535. (Let's call
this one the "process mapping").
These mappings hence are pretty much identical, one just moves things up
and one back down. (Reminder: we do all this so that the processes can
run under high UIDs while running off file systems that require no
recursive chown()ing, i.e. we want processes with high UID range but
files with low UID range.)
This creates one problem, i.e. issue #20989: if nspawn (which runs as
host root, i.e. host UID 0) wants to add inodes to the uidmapped mount
it can't do that, since host UID 0 is not defined in the mount mapping
(only the X…X+65536 range is, after all, and X > 0), and processes whose
UID is not mapped in a uidmapped fs cannot create inodes in it since
those would be owned by an unmapped UID, which then triggers
the famous EOVERFLOW error.
Let's fix this, by explicitly including an entry for the host UID 0 in
the mount mapping. Specifically, we'll extend the mount mapping to map
UID 2147483646 (which is INT32_MAX-1, see code for an explanation why I
picked this one) of the backing fs to UID 0 on the uidmapped fs. This
way nspawn can creates inode on the uidmapped as it likes (which will
then actually be owned by UID 2147483646 on the backing fs), and as it
always did. Note that we do *not* create a similar entry in the process
mapping. Thus any files created by nspawn that way (and not chown()ed to
something better) will appear as unmapped (i.e. as overflowuid/"nobody")
in the container payload. And that's good. Of course, the latter is
mostly theoretic, as nspawn should generally chown() the inodes it
creates to UID ranges that actually make sense for the container (and we
generally already do this correctly), but it#s good to know that we are
safe here, given we might accidentally forget to chown() some inodes we
create.
Net effect: the two mappings will not be identical anymore. The mount
mapping has one entry more, and the only reason it exists is so that
nspawn can access the uidmapped fs reasonably independently from any
process mapping.
Fixes: #20989
This mirrors a similar check in Linux kernel 5.16
(9dcc38e2813e0cd3b195940c98b181ce6ede8f20) that raised the
RLIMIT_MEMLOCK to 8M.
This change does two things: raise the default limit for nspawn
containers (where we try to mimic closely what the kernel does), and
bump it when running on old kernels which still have the lower setting.
Fixes: #16300
See: https://lwn.net/Articles/876288/
We expose various other forms of UUID helpers already, i.e.
SD_ID128_UUID_FORMAT_STR and SD_ID128_MAKE_UUID_STR(), and we parse
UUIDs, hence add a high-level helper for formatting UUIDs too.
This doesn't add any new code, it just moves some helpers
id128-util.[ch] → sd-id128.[ch], to make them public.
Our goal here (as in the previous commits) is to ensure that a settings
file loaded in --settings=override mode is truly a NOP. Previously this
was not the case as we'd drop CAP_NET_ADMIN from the caps if the
settings file didn't enable networking.
With this change we'll drop it only if explicitly turned off in the
settings file, and otherwise let the built-in defaults and cmdline
params reign supreme as documented.
Fixes: #20055
Let's only pick this up from the settings if actually set.
As in the previous commit this makes sure that an empty settings file in
--settings=override mode is really a NOP.
Let's turn these three fields into tristates, so that we can distinguish
whether they are not configured at all from explicitly turned off.
Let#s then use this to ensure that we only copy the settings fields into
our execution environment if they are actually configured.
We already do this for some of the boolean settings, this adds it for
the missing ones.
The goal here is to ensure that an empty settings file used in
--settings=override mode (i.e. the default mode used in the
systemd-nspawn@.service unit) is truly a NOP.
The new helper returns whether the settings file had *any* networking
setting configured at all. We already have a similar helper
settings_private_network() which returns a similar result. The
difference is that the new helper will return true when the private
network was explicitly turned off, while the old one will only return
true if configured and enabled.
We'll reuse the helper a 2nd time later on, but even without it it makes
things a bit more readable.
Most sd_notify() calls are like log_info() — the result is only informative
and if they fail, it's best ignore this. But if a call with READY=1 fails,
the unit may enter a failed state, so we should warn about this. Similarly
for FSTOREREMOVE=1: the manager may be left with a stale fd, at least wasting
resources.
We try to pass containers roughly the same rlimits as the host gets from
the kernel. However, this means we'd set the RLIMIT_NOFILE to 4K. Which
is quite limiting though, and is something we actually departed from in
PID1: since 52d6207578 we raise the limit
substantially for all userspace.
Given that nspawn is quite often invoked without proper PID1, let's raise the
limits for container payloads the same way as we do from the real PID1
to its service payloads.
This is supposed to be used by package/image builders such as mkosi to
speed up building, since it allows us to suppress sync() inside a
container.
This does what Debian's eatmydata tool does, but for a container, and
via seccomp (instead of LD_PRELOAD).
LLVM 13 introduced `-Wunused-but-set-variable` diagnostic flag, which
trips over some intentionally set-but-not-used variables or variables
attached to cleanup handlers with side effects (`_cleanup_umask_`,
`_cleanup_(notify_on_cleanup)`, `_cleanup_(restore_sigsetp)`, etc.):
```
../src/basic/process-util.c:1257:46: error: variable 'saved_ssp' set but not used [-Werror,-Wunused-but-set-variable]
_cleanup_(restore_sigsetp) sigset_t *saved_ssp = NULL;
^
1 error generated.
```
Let's make the booleans indicating verity state a bit more descriptive.
Let's rename:
can_verity → has_verity: because that's really what this about
whether verity data is included in the image. Whether we actually
can use it is a different story.
verity → verity_ready: this one should tell us if we have everything
need to actually set it up, hence explicitly say "ready to use" in
the name.
No change in behaviour. Just a bit of renaming.
It expects a generic "struct sockaddr", not a "struct sockaddr_un".
Pass the right member of the union.
Not sure why gcc/llvm never complained about this...