Following discussions with some kernel folks at All Systems Go! it
appears that file descriptors are not really as expensive as they used
to be (both memory and performance-wise) and it should thus be OK to allow
programs (including unprivileged ones) to have more of them without ill
effects.
Unfortunately we can't just raise the RLIMIT_NOFILE soft limit
globally for all processes, as select() and friends can't handle fds
>= 1024, and thus unexpecting programs might fail if they accidently get
an fd outside of that range. We can however raise the hard limit, so
that programs that need a lot of fds can opt-in into getting fds beyond
the 1024 boundary, simply by bumping the soft limit to the now higher
hard limit.
This is useful for all our client code that accesses the journal, as the
journal merging logic might need a lot of fds. Let's add a unified
function for bumping the limit in a robust way.
This simply adds a new constant we can use for bumping RLIMIT_NOFILE to
a "high" value. It default to 256K for now, which is pretty high, but
smaller than the kernel built-in limit of 1M.
Previously, some tools that needed a higher RLIMIT_NOFILE bumped it to
16K. This new define goes substantially higher than this, following the
discussion with the kernel folks.
This makes use of assert_cc() to guard against missing CASE macros,
instead of a manual implementation that might result in a static
variable to be allocated.
More importantly though this changes the base type for the array used to
determine the number of arguments for the compile time check from "int"
to "long double". This is done in order to avoid warnings from "ubsan"
that possibly large constants are assigned to small types. "long double"
hopefully isn't vulnerable to that.
Fixes: #10332
Mempool use is enabled or disabled based on the mempool_use_allowed symbol that
is linked in.
Should fix assert crashes in external programs caused by #9792.
Replaces #10286.
v2:
- use two different source files instead of a gcc constructor
linux/capability.h's CAP_TO_MASK potentially shifts a signed int "1"
(i.e. 32bit wide) left by 31 which means it becomes negative. That's
just weird, and ubsan complains about it. Let's introduce our own macro
CAP_TO_MASK_CORRECTED which doesn't fall into this trap, and make use of
it.
Fixes: #10347
As preparation for OCI support in nspawn, let's add a JSON parser.
The json.h file contains an explanation why this is new code instead of
just us linking against an existing JSON library.
Cgroup v2 provides the eBPF-based device controller, which isn't currently
supported by systemd. This commit aims to provide such support.
There are no user-visible changes, just the device policy and whitelist
start working if cgroup v2 is used.
Let's make sure the integers we parse out are not larger than USHRT_MAX.
This is a good idea as the kernel's TIOCSWINSZ ioctl for sizing
terminals can't take larger values, and we shouldn't risk an overflow.
Comes with tests.
Also add direct test for $SYSTEMD_PROC_CMDLINE.
In test-proc-cmdline, "true" was masquerading as PROC_CMDLINE_STRIP_RD_PREFIX,
fix that. Also, reorder functions to match call order.
Previously, together with kill_dots true, patch like
".", "./.", ".//.//" would all return an empty string.
That is wrong. There must be one "." left to reference
the current directory.
Also, the comment with examples was wrong.
Apparently FAT on some recent kernels can't do RENAME_NOREPLACE, and of
course cannot do linkat()/unlinkat() either (as the hard link concept
does not exist on FAT). Add a fallback using an explicit beforehand
faccessat() check. This sucks, but what we can do if the safe operations
are not available?
Fixes: #10063
LGTM was complaining:
> Multiplication result may overflow 'int' before it is converted to 'long'.
Fix this by changing all types to ssize_t and add a check for overflow
while at it.
v2: fix error in free_and_strndup()
When the orignal and copied message were the same, but shorter than specified
length l, memory read past the end of the buffer would be performed. A test
case is included: a string that had an embedded NUL ("q\0") is used to replace
"q".
v3: Fix one more bug in free_and_strndup and add tests.
v4: Some style fixed based on review, one more use of free_and_replace, and
make the tests more comprehensive.
Allows configuring the watchdog signal (with a default of SIGABRT).
This allows an alternative to SIGABRT when coredumps are not desirable.
Appropriate references to SIGABRT or aborting were renamed to reflect
more liberal watchdog signals.
Closes#8658