Currently `systemd-sysctl` binary is used in `systemd-sysctl.service`
which is mostly configured as `oneshot`. There are situations where one
would like to use systemd to maintain Sysctl configurations on a host,
using a configuration managers such as Chef or Puppet, by apply
configurations every X duration.
The problem with using `systemd-sysctl` is that it writes all the Sysctl
settings, even if the values for those settings have not changed. From
experience, we have observed that some Sysctl settings cause actions in
the kernel upon writing(like dropping caches) which in turn cause
undesired side effects.
This patch tries to minimize such side effects by comparing values
before writing.
Also,
- drop unnecessary +1 from buffer size, as IF_NAMESIZE or IFNAMSIZ
includes the nul at the end.
- format_ifname() does not update buffer on failure,
- introduces format_ifname_alloc(), FORMAT_IFNAME(), and their friends.
Currently there does not exist a way to specify a path relative to which
all binaries executed by Exec should be found. The only way is to
specify the absolute path.
This change implements the functionality to specify a path relative to which
binaries executed by Exec*= can be found.
Closes#6308
loadavg.h is an internal header of the Linux source repository, and as
such it is licensed as GPLv2-only, without syscall exception.
We use it only for 4 macros, which are simply doing some math calculations
that cannot thus be subject to copyright.
Reimplement the same calculations in another internal header and delete
loadavg.h from our tree.
As far as I can see, we use this to get a list of ARPHRD_* defines (used in
particular for Type= in .link files). If we drop our copy, and build against
old kernel headers, the user will have a shorter list of types available. This
seems OK, and I don't think it's worth carrying our own version of this file
just to have newest possible entries.
7c5b9952c4 recently updated this file, but we'd
have to update it every time the kernel adds new entries. But if we look at
the failure carefully:
src/basic/arphrd-from-name.gperf:65:16: error: ‘ARPHRD_MCTP’ undeclared (first use in this function); did you mean ‘ARPHRD_FCPP’?
65 | MCTP, ARPHRD_MCTP
| ^~
| ARPHRD_FCPP
we see that the list we were generating was from the system headers, so it was
only as good as the system headers anyway, without the newer entries in our
bundled copy, if there were any. So let's make things simpler by always using
system headers.
And if somebody wants to fix things so that we always have the newest list,
then we should just generate and store the converted list, not the full header.
Compared to PID1 where systemd-oomd has to be the client to PID1
because PID1 is a more privileged process than systemd-oomd, systemd-oomd
is the more privileged process compared to a user manager so we have
user managers be the client whereas systemd-oomd is now the server.
The same varlink protocol is used between user managers and systemd-oomd
to deliver ManagedOOM property updates. systemd-oomd now sets up a varlink
server that user managers connect to to send ManagedOOM property updates.
We also add extra validation to make sure that non-root senders don't
send updates for cgroups they don't own.
The integration test was extended to repeat the chill/bloat test using
a user manager instead of PID1.
We mishandled the case where the size we read from the file actually
matched the maximum size fully. In that case we cannot really make a
determination whether the file was fully read or only partially. In that
case let's do another loop, so that we operate with a buffer, and
we can detect the EOF (which will be signalled to us via a short read).
There's a very gradual increase of anonymous memory in systemd-journald that
blames to 2ac67221bb.
systemd-journald makes many calls to read /proc/PID/cmdline and
/proc/PID/status, both of which tend to be well under 4K. However the
combination of allocating 4M read buffers, then using `realloc()` to
shrink the buffer in `read_virtual_file()` appears to be creating
fragmentation in the heap (when combined with the other allocations
systemd-journald is doing).
To help mitigate this, try reading /proc with a 4K buffer as
`read_virtual_file()` did before 2ac67221bb.
If it isn't big enough then try again with the larger buffers.
Currently `nulstr_contains` returns a boolean, making it difficult to
identify which of the input strings matches the "needle".
Adding a new `nulstr_get()` function, returning a const pointer to the
matching string, eases this process and allows us to directly re-use the
result of a call to this function without additional processing or
memory allocation.
Signed-off-by: Arnaud Ferraris <arnaud.ferraris@collabora.com>
let's do what we did for sysctl_write()/sysctl_write_ip_property() also
for the read paths: i.e. make one a wrapper of the other, and add more
careful input validation.
Let's add similar path validation to sysctl_read() as we already have in
sysctl_write().
Let's also drop the trailing newline from the returned string, like
sysctl_read_ip_property() already does it.
(I checked all users of this, they don't care)
It does the same stuff, let's use the same codepaths as much as we can.
And while we are at it, let's generate good error codes in case we are
called with unsupported parameters/let's validate stuff more that might
originate from user input.
The sysctl_write_ip_property() call already uses write_string_file(), so
let's do so here, too, to make the codepaths more uniform.
While we are at it, let's also validate the passed path a bit, since we
shouldn't allow sysctls with /../ or such in the name. Hence simplify
the path first, and then check if it is normalized, and refuse if not.
When reading virtual files (i.e. procfs, sysfs, …) we currently put a
limit of 4M-1 on that. We have to pick something, and we have to read
these files in a single read() (since the kernel generally doesn't
support continuation read()s for them). 4M-1 is actually the maximum
size the kernel allows for reads from files in /proc/sys/, all larger
reads will result in an ENOMEM error (which is really weird, but the
kernel does what the kernel does). Hence 4M-1 sounds like a smart
choice.
However, we made one mistake here: in order to be able to detect EOFs
properly we actually read one byte more than we actually intend to
return: if that extra byte can be read, then we know the file is
actually larger than our limit and we can generate an EFBIG error from
that. However, if it cannot be read then we know EOF was hit, and we are
good. So ultimately after all we issued a single 4M read, which the
kernel then responds with ENOMEM to. And that means read_virtual_file()
actually doesn't work properly right now on /proc/sys/. Let's fix that.
The fix is simple, lower the limit of the the buffer we intend to return
by one, i.e. 4M-2. That way, the read() we'll issue is exactly as large
as the limit the kernel allows, and we still get safely detect EOF from
it.
LLVM 13 introduced `-Wunused-but-set-variable` diagnostic flag, which
trips over some intentionally set-but-not-used variables or variables
attached to cleanup handlers with side effects (`_cleanup_umask_`,
`_cleanup_(notify_on_cleanup)`, `_cleanup_(restore_sigsetp)`, etc.):
```
../src/basic/process-util.c:1257:46: error: variable 'saved_ssp' set but not used [-Werror,-Wunused-but-set-variable]
_cleanup_(restore_sigsetp) sigset_t *saved_ssp = NULL;
^
1 error generated.
```
The current detection code relies on /sys/firmware/dmi/entries/0-0/raw
to disambiguate Amazon EC2 virtualized from metal instances.
Unfortunately this file is root only. Thus on a c6g.metal instance
(aarch64), we observe something like this:
$ systemd-detect-virt
amazon
$ sudo systemd-detect-virt
none
Only the latter is correct.
The right long term fix is to extend the kernel to expose the SMBIOS BIOS
Characteristics properly via /sys/class/dmi, but until this happens (and
for backwards compatibility when it does), we need a plan B.
This change implements such a workaround by falling back to using the
instance type from DMI and looking at the ".metal" string present on
metal instances.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>