So far the idea was that the default is 'auto', and if appropriate, the
distribution will create /var/log/journal/ to tell journald to use persistent
mode. This doesn't work well with factory resets, because after a factory reset
obviously /var/log is gone. That old default was useful when journald was new
and people were reluctant to enable persistent mode and instead relied on
rsyslog and such for the persistent storage. But nowadays that is rarer, and
anyway various features like user journals only work with persistent storage,
so we want people to enable this by default. Add an option to flip the default
and distributions can opt in. The default default value remains unchanged.
(I also tested using tmpfiles to instead change this, since we already set
access mode for /var/log/journal through tmpfiles. Unfortunately, tmpfiles runs
too late, after journald has already started, so if tmpfiles creates the
directory, it'll only be used after a reboot. This probably could be made to
work by adding a new service to flush the journal, but that becomes complicated
and we lose the main advantage of simplicity.)
Resolves https://bugzilla.redhat.com/show_bug.cgi?id=1387796.
A --empower session is effectively root without being UID 0, so it
doesn't make sense to enforce polkit authentication in those. Let's
add the empower group, add --empower sessions to that group and ship
a polkit rule to skip authentication for all users in the empower
group.
(As a side-effect this will also allow users to add themselves to this
group outside of 'run0 --empower' to mimick NOPASSWD from sudo)
We'd like to measure various additional things into PCRs, but all
available ones to the OS are already used for various purposes. Hence,
let's introduce a new concept of "NV Index based PCRs", i.e. let's use
TPM2 nv indexes of type TPM2_NT_EXTEND that mostly behave like real
PCRs, but which we can allocate relatively freely from the nv index
space. Let's call these "fake" PCRs "NvPCRs".
My original intention was to get a fixed NV index range assigned from
the TCG, either for Linux or for systemd as a project, but this stalled
with no further updates from the TCG for more than a year and a half
now. I was told an NV index range to use though, even if it never was
officially assigned, hence this PR uses this by default. But the range
is configurable at build time, on purpose, so that downstreams have some
flexibility to change this if they want. To abstract the actual nvindex
number away we introduce a naming concept, so that nvindexes are
referenced by name string rather than number.
NvPCRs are defined in little JSON snippets in /usr/lib/nvpcr/*.nvpcr,
that match up index number and name, as well as pick a hash algorithm.
There's one complication: these nvindex (like any nvindex) can be
deleted by anyone with access to the TPM, and then be recreated. This
could be used to reset the NvPCRs to zero during runtime, which defeats
the whole point of them. Our way out: we measure a secret as first thing
after creation into the NvPCRs. (Or actually, we measure a per-NvPCR
secret we derive from a system secret via an HMAC of the NvPCR name) and
the nvindex handle). This "anchoring" secret is stored in /run/ +
/var/lib/ + ESP/XBOOTLDR (the latter encrypted as credential, locked to
the TPM), to make it available at the whole runtime of the OS.
I noticed in our NixOS packaging that we were working around the fact
that core/swap.c looks for swapon and swapoff in /sbin
Lets make it configurable just like all the other util-linux binaries
through meson and make it default to /usr/sbin/{swapon,swapoff}
This way mounts work on a systemd without the /sbin -> /usr/sbin
compatibility symlink. (And as a side-effect has NixOS be able
to have it in /nix/store too like the other util-linux tools).
In multi-seat scenarios, a display manager might need to start multiple
greeter sessions. But systemd allows at most one graphical session per
user. So, display managers now have a range of UIDs to dynamically
allocate users for their greeter sessions.
Currently, when fuzzers are enabled, we run meson from within meson
to build the fuzzer executables with sanitizers. The idea is that
we can build the fuzzers with different kinds of sanitizers
independently from the main build.
The issue with this setup is that we don't actually make use of it.
We only build the fuzzers with one set of sanitizers (address,undefined)
so we're adding a bunch of extra complexity without any benefit as we
can just setup the top level meson build with these sanitizers and get
the same result.
The other issue with this setup is that we don't pass on all the options
passed to the top level meson build to the nested meson build. The only things
we pass on are extra compiler arguments and the value of the auto_features
option, but none of the individual feature options if overridden are passed on,
which can lead to very hard to debug issues as an option enabled in the top
level build is not enabled in the nested build.
Since we're not getting anything useful out of this setup, let's simplify
and get rid of the nested meson build. Instead, sanitizers should be enabled
for the top level meson.build. This currently didn't work as we were overriding
the sanitizers passed to the meson build with the fuzzer sanitizer, so we
fix that as well by making sure we combine the fuzzer sanitizer with the ones
passed in by the user.
We also drop support for looking up libFuzzer as a separate library as
it has been shipped builtin in clang since clang 6.0, so we can assume
that -fsanitize=fuzzer is available.
To make sure we still run the fuzzing tests, we enable the fuzz-tests option
by default now to make sure they still always run (without instrumentation unless
one of llvm-fuzz or oss-fuzz is enabled).
We add a default test setup that excludes the integration-tests suite
so that the integration tests don't run by default. This allows us to
get rid of $SYSTEMD_INTEGRATION_TESTS. Then, we add two extra setups:
'integration' and 'shell'. The 'integration' setup does not exclude the
integration-tests suite, and so can be used to run the integration tests.
The 'shell' setup does the same, but additionally sets $TEST_SHELL=1,
allowing to get rid of $TEST_SHELL in the docs.
- default-hierarchy meson option was deprecated by
31323f21bb (v256).
- nscd meson option was deprecated by
28f1f1a5e6 (v257).
Let's completely remove them now.
I ran into the limit with ParticleOS, with 6 profiles, hence I think the
current default value is a bit low. let's bump it 4x, to 120. This is
still a lot lower than 500 or so which Debian uses downstream.
We can look into raising this further should we collide with this again,
but for now, let's try 120 and see how it goes in practice.
This makes the UID range configurable via build time options, but of
course it really shouldn't be changed. The default range I picked is
outside even of IPAs current (ridiculously large) allocation ranges,
hence hopefully minimizes conflicts.
This commit introduces a build-time option to enable/disable sysupdated
separately from sysupdate. 'auto' translated to enabled by default in
developer builds.
IPE is a new LSM being introduced in 6.12. Like IMA, it works based on a
policy file that has to be loaded at boot, the earlier the better. So
like IMA, if such a policy is present, load it and activate it.
If there are any .p7b files in /etc/ipe/, load them as policies.
The files have to be inline signed in DER format as per IPE documentation.
For more information on the details of IPE:
https://microsoft.github.io/ipe/
Now that we have multi-profile UKIs people likely want to stick more PE
sections into them than before. Hence, bump the number of available PE
section slots to 30 (up from 15). Also, make this configurable at build
time since some folks probably want even more, and others don't want
this at all.
(pre-allocating too many shouldn't matter too much btw, I'd advise
everyone to overshoot, except maybe on the tiniest of embedded boards)
The new link-executor-shared option is similar to the existing
link-udev-shared: when set to false, we link to the static versions of our
internal libraries.
The resulting exuctor binary is fairly large, about as large as libsystemd-core
(14 MB without lto, 8 with lto).
This is intended as a workaround for the fuckup with the pinned executor
binary:
when an upgrade is performed, the package manager will install new version of
the libraries and new version of the code, and some time later reexecute the
managers. This creates a window when the pinned executor binary will fail to
execute. There are two factors which make the issue easier to hit:
- when the distribution uses a finely-grained shared-lib-tag. E.g. Fedora
uses version-release as the tag, which means that the issue occurs on
every package upgrade. This is the right thing to do, because the
ABI of our internal libraries is not stable at all, so replacing the
library from a different version in place creates a window where our
programs may crash or misbehave.
- when the distribution doesn't immediately reexec all the managers after
upgrade. In early versions of systemd, we used to hammer the machine during
upgrade, doing daemon-reexecs repeatedly. This works, but is ugly and
wasteful. Doing the reexecs while the upgrade is in progres also creates a
window where a mix of old and new configs or both is loaded. Users are
particularly annoyed by those reloads if there is some issue in the
configuration causing us to emit warnings on every reexec. Doing the
reexecs once after the new configuration and libraries have been put
in place is nicer.
The pinning of the executor binary breaks upgrades and in particular
it penalizes the distributions which make use of the features which
were previously added to avoid bugs and inefficiency during upgrades.
When the executor is linked statically, there is a smaller chance that it'll
fail to load libraries. The issue can still occur because other libraries, not
our own, are linked dynamically.
nscd is known to be racy [1] and it was already deprecated and later dropped in
Fedora a while back [1,2]. We don't need to support obsolete stuff in systemd,
and the cache in systemd-resolved provides a better solution anyway.
We announced the plan to drop nscd in d44934f378.
[1] https://fedoraproject.org/wiki/Changes/DeprecateNSCD
[2] https://fedoraproject.org/wiki/Changes/RemoveNSCD
The option is kept as a stub without any effect to make the transition easier.
This adds a small, socket-activated Varlink daemon that can delegate UID
ranges for user namespaces to clients asking for it.
The primary call is AllocateUserRange() where the user passes in an
uninitialized userns fd, which is then set up.
There are other calls that allow assigning a mount fd to a userns
allocated that way, to set up permissions for a cgroup subtree, and to
allocate a veth for such a user namespace.
Since the UID assignments are supposed to be transitive, i.e. not
permanent, care is taken to ensure that users cannot create inodes owned
by these UIDs, so that persistancy cannot be acquired. This is
implemented via a BPF-LSM module that ensures that any member of a
userns allocated that way cannot create files unless the mount it
operates on is owned by the userns itself, or is explicitly
allowelisted.
BPF LSM program with contributions from Alexei Starovoitov.
To make it easy to have a workable ssh-generator on various distros,
let's optionally generate the ssh privsep dir via tmpfiles.d/ drop-in.
This enables the concept with a path of /run/sshd/ as default. This is
the path Debian/Ubuntu uses, and means that we just work on those
distros. Debian/Ubuntu is the only distro (apparently?) that puts the
privsep dir under /run/, hence always needs the dir to be created
manually. Other distros don't need it that much, because they place the
dir in /usr/ (fedora, best choice!) or /var/ (others, not ideal, because
still mutable).
Also adds a longer explanation about this in NEWS, in the hope that
distro maintaines read that and maybe start cleaning this up.
Alternative to: #31543
Partially reverts commit b0d3095fd6.
While it is generally worthwhile for systemd to drop split-usr support,
these options are NOT about split-usr support. The universal location of
POSIX sh is always /bin/sh. Bash is pretty reasonably standardized there
too.
This happens irrespective of /bin being a symlink to /usr/bin.
Ramifications of this change include things like:
- portably running shell scripts that might run very nearly anywhere
- /etc/shells support
For standardization and compatibility reasons, these commands with these
paths need to be consistently found on any system, and thus distros make
sure this works, although even on split-usr systems /usr/bin/bash may be
a symlink to /bin/bash.
Embedding the *access path* of bash as /usr/bin/bash in systemd, for
example in libnss_systemd.so, means that login shells must agree with
systemd on how they invoke the shell. End result: users fail to login
because of access violations.
This cannot be fixed by "fixing PAM" because PAM does not follow
symlinks by design: one example is that it needs to treat rbash as
different from bash.
Fixes: https://bugs.gentoo.org/919749
Signed-off-by: Eli Schwartz <eschwartz93@gmail.com>
Most of our kernel cmdline options use underscores as word separators in
kernel cmdline options, but there were some exceptions. Let's fix those,
and also use underscores.
Since our /proc/cmdline parsers don't distinguish between the two
characters anyway this should not break anything, but makes sure our own
codebase (and in particular docs and log messages) are internally
consistent.
Let's split off a new vcs-tag option from version-tag that configures whether
the current commit should be appended to the version tag. Doing this saves
us from having to fiddle around with generating git versions in packaging
specs and instead let's meson do it for us, even if we pass in a custom
version tag.
With this approach there's no more need for tools/meson-vcs-tag.sh so
we remove it.
This functionality relied on telinit being available in a different path
then the compat symlink shipped by systemd itself. This is no longer the
case for any known distro, so remove that code.
Fixes: #31220
Replaces: #31249
This adds a tiny binary that is hooked into SSH client config via
ProxyCommand and which simply connects to an AF_UNIX or AF_VSOCK socket
of choice.
The syntax is as simple as this:
ssh unix/some/path # (this connects to AF_UNIX socket /some/path)
or:
ssh vsock/4711
I used "/" as separator of the protocol ID and the value since ":" is
already taken by SSH itself when doing sftp. And "@" is already taken
for separating the user name.
sshd now supports config file drop-ins, hence let's install one to hook
up "userdb ssh-authorized-keys", so that things just work.
We put the drop-in relatively early, so that other drop-ins generally
will override this.
Ideally sshd would support such drop-ins in /usr/ rather than /etc/, but
let's take what we can get. It's not that sshd's upstream was
particularly open to weird ideas from Linux people.
This should also implicitly enabled vmspawn in CI. It wasn't passing even the
basic tests, which we didn't see, because it needs to be explicitly enabled.
Also this renames 80-ethernet.network.example -> 89-ethernet.network.example,
to make it have lower precedence over other default .network files for
Ethernet interfaces.
Closes#29765.