The goal of this change is to delay getty services until after
systemd-vconsole-setup has finished. systemd-vconsole-setup starts loadkeys,
and it seems that when loadkeys is interrupted by the TTY hangup call we do
when starting tty services [1], so that loadkeys starts getting EIO from the
ioctl("/dev/tty1", KDSKBENT) syscall it does.
Fixes#26908.
[1] https://github.com/legionus/kbd/issues/92#issuecomment-1554451788
Initially I wanted to add ordering dependencies to individual units, but
TTYVHangup= can be added to other various external units too. The solution with
an implicit dependency should cover those cases too.
This is closely related to the previous commit: if the credentials dir
is empty and nothing mounted on it, let's remove it again.
This will in particular happen if we decided to not actually install the
mount we prepared for the credentials because it is empty. In that case
the mount point inode is already there, and with this we'll remove it.
Primary effect, users will see ENOENT rather than EACCESS when trying to
access it, which should be preferable, given we already handle that
nicely in our credential consumption code.
This should also be useful on systems where we lack any privs to create
mounts, and thus operate on a regular dir anyway.
Let's avoid creating another mount in the system if it's empty anyway.
This is mostl a cosmetic thing in one (pretty common) special case: if
creds settings are used in a unit but no creds actually available to be
passed.
(While we are at it this also does one more minor optimization: it
adjusts the MS_RDONLY/MS_NOSUID/… flags of the source mount we are about
to MS_MOVE into the right place only if we actually really move it, and
if we instead unmount it again we won't bother with the flags either)
If we create a subcroup (regardless if the '.control' subgroup we
always created or one configured via DelegateSubgroup=) it's inside of
the delegated territory of the cgroup tree, hence it should be owned
fully by the unit's users. Hence do so.
We don't need to apply the journal/oomd xattrs to the subcgroups we add,
since those daemons already look for the xattrs up the tree anyway.
Hence remove this.
This is in particular relevant as it means later changes to the xattr
don#t need to be replicated on the subcgroup either.
This implements a minimal subset of #24961, but in a lot more
restrictive way: we only allow one level of subcgroup (as that's enough
to address the no-processes in inner cgroups rule), and does not change
anything about threaded cgroup logic or similar, or make any of this new
behaviour mandatory.
All this does is this: all non-control processes we invoke for a unit
we'll invoke in a subgroup by the specified name.
We'll later port all our current services that use cgroup delegation
over to this, i.e. user@.service, systemd-nspawn@.service and
systemd-udevd.service.
Enabling these options when not running as root requires a user
namespace, so implicitly enable PrivateUsers=.
This has a side effect as it changes which users are visible to the unit.
However until now these options did not work at all for user units, and
in practice just a handful of user units in Fedora, Debian and Ubuntu
mistakenly used them (and they have been all fixed since).
This fixes the long-standing confusing issue that the user and system
units take the same options but the behaviour is wildly (and sometimes
silently) different depending on which is which, with user units
requiring manually specifiying PrivateUsers= in order for sandboxing
options to actually work and not be silently ignored.
Now that we have a potentially pinned fdstore let's add a concept for
cleaning it explicitly on user requested. Let's expose this via
"systemctl clean", i.e. the same way as user directories are cleaned.
After 'if (DEBUG_LOGGING)' is added, the two call sites are almost identical,
except that we forgot LOG_UNIT_INVOCATION_ID(unit).
I removed the handling of the log_oom(). It's a debug message only after all,
and it's unlikely to fail.
Currently, exec runtimes can be shared between units (using
JoinsNamespaceOf=). Let's introduce a concept of a private exec
runtime that isn't shared with JoinsNamespaceOf=. The existing
ExecRuntime struct is renamed to ExecRuntimeShared and becomes a
private member of the new private ExecRuntime.
Chasing symlinks is a core function that's used in a lot of places
so it deservers a less verbose names so let's rename it to chase()
and chaseat().
We also slightly change the pattern used for the chaseat() helpers
so we get chase_and_openat() and similar.
Whenever we're going to close all file descriptors, we tend to close
the log and set it into open when needed mode. When this is done with
the logging target set to LOG_TARGET_AUTO, we run into issues because
for every logging call, we'll check if stderr is connected to the
journal to determine where to send the logging message. This check
obviously stops working when we close stderr, so we settle the log
target before we do that so that we keep using the same logging
target even after stderr is closed.
Let's allow configuring tty term and size using kernel cmdline arguments
so that when running in a VM we can communicate the terminal TERM and size
from the host via SMBIOS extra kernel cmdline arguments.
The interface requires services to write to the cgroup file to activate notifications,
but with ProtectControlGroups=yes we make it read-only. Add a writable bind mount.
Follow-up for 6bb0084204
ExecContext has a field that controls the mount propagation flag of the
mounts in the resulting namespace. This is exposed as "MountFlags="
which is super confusing, as it suggests one could control more than
propagation, and that it was actually a flags field. It's an enum
though only, and nothing else.
We might want to rename this externally one day, but given the compat
kludges this requires and the fact this is somewhat nichey it might not
be worth it. But internally let's rename it, as it makes things much
easier to grok, in particular as part of the codebase already exposed
the concept as mount_propagation_flag.
No actual code flow changes, just some renaming.
On some ARM platforms, the dynamic linker could use PROT_BTI memory protection
flag with `mprotect(..., PROT_BTI | PROT_EXEC)` to enable additional memory
protection for executable pages. But `MemoryDenyWriteExecute=yes` blocks this
with seccomp filter denying all `mprotect(..., x | PROT_EXEC)`.
Newly preferred method is to use prctl(PR_SET_MDWE) on supported kernels. Then
in-kernel implementation can allow PROT_BTI as necessary, without weakening
MDWE. In-kernel version may also be extended to more sophisticated protections
in the future.
In various tools and services we have a per-system and per-user concept.
So far we sometimes used a boolean indicating whether we are in system
mode, or a reversed boolean indicating whether we are in user mode, or
the LookupScope enum used by the lookup path logic.
Let's address that, in introduce a common enum for this, we can use all
across the board.
This is mostly just search/replace, no actual code changes.
If a PAM service sets some ambient caps, we should honour that, hence
query it, and merge it with our own ambient settings.
This needs to be done manually since otherwise dropping privs via
setresuid() will undo all such caps, and we need to manually tweak
things to keep them.
Even when a mount namespace is created, previously host's sysfs is used,
especially with RootDirectory= or RootImage=, thus service processes can
still access the properties of the network interfaces in the main network
namespace through sysfs.
This makes, sysfs is remounted with the new network namespace tag, except
when PrivateMounts= is explicitly disabled. Hence, the properties of the
network interfaces in the main network namespace cannot be accessed by
service processes through sysfs.
Fixes#26422.
We should be more careful with distinguishing the cases "all bits set in
caps mask" from "cap mask invalid". We so far mostly used UINT64_MAX for
both, which is not correct though (as it would mean
AmbientCapabilities=~0 followed by AmbientCapabilities=0) would result
in capability 63 to be set (which we don't really allow, since that
means unset).