glibc does not provide clone() on ia64, only clone2. But only as a
symbol in the shared library, there's no prototype in the gblic
headers, so we have to define it, copied from the manpage.
This reenables epoll_pwait2() use, i.e. undoes the effect of
39f756d3ae.
Instead of just reverting that, this PR will change things so that we
strictly rely on glibc's new epoll_pwait2() wrapper (which was added
earlier this year), and drop our own manual fallback syscall wrapper.
That should nicely side-step any issues with correct syscall wrapping
definitions (which on some arch seem not to be easy, given the sigset_t
size final argument), by making this a glibc problem, not ours.
Given that the only benefit this delivers are time-outs more granular
than msec, it shouldn't really matter that we'll miss out on support
for this on systems with older glibcs.
Using fsopen()/fsconfig(), we can check if hidepid/subset are supported to
avoid the noisy logs from the kernel if they aren't supported. This works
on centos/redhat 8 as well since they've backported fsopen()/fsconfig().
glibc 2.30 (Aug 2019) added a wrapper for getdents64(). For older
versions let's define our own.
(This syscall exists since Linux 2.4, hence should be safe to use for
us)
Getting the numbers right for all architectures has proven to be a
constant chore. Let's autogenerate the header from the tables that
were imported in one of the previous commits.
Fixes#18074. (Hopefully. I cannot verify this on all architectures.)
To update the lists, or to update the header after template changes:
ninja -C build update-syscall-tables update-syscall-header
Note: the generated file is saved in git. Initially I wanted to only
store the tables in git, and generate the header during each build.
Generation is quick enough, but the header is used in many many
places (wherever missing_syscall.h is included, directly or indirectly),
which means that we would need to declare the dependency in meson, so
the header would be generated early enough. This turned out to be very
noisy. Storing the generated header in version control avoids the hassle.
For scripts, when we call fexecve(), on new kernels glibc calls execveat(),
which fails with ENOENT, and then we fall back to execve() which succeeds:
[pid 63039] execveat(3, "", ["/home/zbyszek/src/systemd/test/test-path-util/script.sh", "--version"], 0x7ffefa3633f0 /* 0 vars */, AT_EMPTY_PATH) = -1 ENOENT (No such file or directory)
[pid 63039] execve("/home/zbyszek/src/systemd/test/test-path-util/script.sh", ["/home/zbyszek/src/systemd/test/test-path-util/script.sh", "--version"], 0x7ffefa3633f0 /* 0 vars */) = 0
But on older kernels glibc (some versions?) implement a fallback which falls
into the same trap with bash $0:
[pid 13534] execve("/proc/self/fd/3", ["/home/test/systemd/test/test-path-util/script.sh", "--version"], 0x7fff84995870 /* 0 vars */) = 0
We don't want that, so let's call execveat() ourselves. Then we can do the
execve() fallback as we want.
Instead of defining the numbers only as fallback, always define our fallback
number, and if we have the real __NR_foo define, assert that our number matches
the real one.
This will result in warnings when our fallback number is not defined, even if
the kernel headers are new enough to define __NR_foo. This will probably annoy
people compiling for seldom-used architectures, but hopefully it'll provide
motivation to add the missing fallback defines.
The upside is that we have a higher chance of catching the cases where we got
the number wrong. Calling the wrong syscall is quite problematic, and with some
back luck, it might take us a long time to notice that we got the number wrong
on some rarely used architecture.
Also, rework some of the fallback wrappers to not call the syscall with a
negative number (that'd fail, but we'd got to the kernel and back). It seems
nicer to let the compiler know that this can never succeed.
The kernel finally has a proper API to determine the mnt_id of a file.
Let's use it.
This adds support for the STATX_MNT_ID field of statx(), added in
kernel 5.8.
This is not a new system call at all (since kernel 2.2), however it's
not exposed in glibc (a wrapper is exposed however in sigqueue(), but it
substantially simplifies the system call). Since we want a nice fallback
for sending signals on non-pidfd systems for pidfd_send_signal() let's
wrap rt_sigqueueinfo() since it takes the same siginfo_t parameter.
Add a comment line explaining that the syscall defines might be
defined to invalid negative numbers, as libseccomp redefines them
to negative numbers if not defined by the kernel headers, which is
not obvious just from reading the code checking for defined && > 0
The #ifndef check used to work for missing __NR_* syscall defines, but
unfortunately libseccomp now redefines missing syscall number to negative
numbers, in their public header file, e.g.:
https://github.com/seccomp/libseccomp/blob/master/include/seccomp.h.in#L801
When systemd is built, since it includes <seccomp.h>, it pulls in the
incorrect negative value for any __NR_* syscall define that's included in
the seccomp.h header (for those syscalls that the kernel headers don't
yet define, e.g. when built with older/stable-distro kernels). This leads
to bugs like:
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1821625
This changes the check so that it can override the negative number that
libseccomp defines, instead of trying to use the negative syscall number.
To avoid gcc warnings (which are failures with meson --werror), this checks
without generating a redefinition gcc warning.
I have no idea why libseccomp decided to define missing syscalls
to negative numbers inside their *public* header file, causing
problems like this.
Make possible to set NUMA allocation policy for manager. Manager's
policy is by default inherited to all forked off processes. However, it
is possible to override the policy on per-service basis. Currently we
support, these policies: default, prefer, bind, interleave, local.
See man 2 set_mempolicy for details on each policy.
Overall NUMA policy actually consists of two parts. Policy itself and
bitmask representing NUMA nodes where is policy effective. Node mask can
be specified using related option, NUMAMask. Default mask can be
overwritten on per-service level.