From 860cc6df6db113697767ae7c3bc374f26820620c Mon Sep 17 00:00:00 2001 From: Lennart Poettering Date: Mon, 29 Oct 2018 19:55:27 +0100 Subject: [PATCH 1/7] man: document that "systemctl reset-failed" also reset the start limit counters Fixes: #10529 --- man/systemctl.xml | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/man/systemctl.xml b/man/systemctl.xml index e32ee6cacb..09cb03948a 100644 --- a/man/systemctl.xml +++ b/man/systemctl.xml @@ -1058,6 +1058,12 @@ Jan 12 10:46:45 example.com bluetoothd[8900]: gatt-time-server: Input/output err terminating abnormally or timing out), it will automatically enter the failed state and its exit code and status is recorded for introspection by the administrator until the service is stopped/re-started or reset with this command. + + In addition to resetting the failed state of a unit it also resets various other + per-unit properties: the start rate limit counter of all unit types is reset to zero, as is the restart + counter of service units. Thus, if a unit's start limit (as configured with + StartLimitIntervalSec=/StartLimitBurst=) is hit and the unit refuses + to be started again, use this command to make it startable again. From 53bd20ea065fdd881a9308ace9b2dc96bd0b1c8d Mon Sep 17 00:00:00 2001 From: Lennart Poettering Date: Mon, 29 Oct 2018 20:07:22 +0100 Subject: [PATCH 2/7] man: don't claim that AssertXYZ= expressions failing had an effect on unit state In the documentation for ConditionXYZ= we claimed that AssertXYZ= would have an effect on unit state (which is wrong), while at the documentation for AssertXYZ= we said it only has an effect on the job, but not the unit (which is right). Let's fix this contradiction, and only claim the latter. Also, fix a couple of other things (for example, stop talking about a "failure state", but let's just expressly called it "the 'failed' state", as that's the actual name of that state. Finally, let's emphasize again when the conditions/assertions are executed, and that they hence are not useful to conditionalize deps. Fixes: #10433 --- man/systemd.unit.xml | 26 +++++++++++++++++--------- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/man/systemd.unit.xml b/man/systemd.unit.xml index 467b905f14..ed7a91ecf2 100644 --- a/man/systemd.unit.xml +++ b/man/systemd.unit.xml @@ -990,12 +990,13 @@ Before starting a unit, verify that the specified condition is true. If it is not true, the starting of the unit will be (mostly silently) skipped, however all ordering dependencies of it are still - respected. A failing condition will not result in the unit being moved into a failure state. The condition is - checked at the time the queued start job is to be executed. Use condition expressions in order to silently skip - units that do not apply to the local running system, for example because the kernel or runtime environment - doesn't require its functionality. Use the various AssertArchitecture=, - AssertVirtualization=, … options for a similar mechanism that puts the unit in a failure - state and logs about the failed check (see below). + respected. A failing condition will not result in the unit being moved into the failed + state. The condition is checked at the time the queued start job is to be executed. Use condition expressions + in order to silently skip units that do not apply to the local running system, for example because the kernel + or runtime environment doesn't require their functionality. Use the various + AssertArchitecture=, AssertVirtualization=, … options for a similar + mechanism that causes the job to fail (instead of being skipped) and results in logging about the failed check + (instead of being silently processed). For details about assertion conditions see below. ConditionArchitecture= may be used to check whether the system is running on a specific @@ -1276,9 +1277,16 @@ Similar to the ConditionArchitecture=, ConditionVirtualization=, …, condition settings described above, these settings add assertion checks to the start-up of the unit. However, unlike the conditions settings, any assertion setting - that is not met results in failure of the start job (which means this is logged loudly). Use assertion - expressions for units that cannot operate when specific requirements are not met, and when this is something - the administrator or user should look into. + that is not met results in failure of the start job (which means this is logged loudly). Note that hitting a + configured assertion does not cause the unit to enter the failed state (or in fact result in + any state change of the unit), it affects only the job queued for it. Use assertion expressions for units that + cannot operate when specific requirements are not met, and when this is something the administrator or user + should look into. + + Note that neither assertion nor condition expressions result in unit state changes. Also note that both + are checked at the time the job is to be executed, i.e. long after depending jobs and it itself were + queued. Thus, neither condition nor assertion expressions are suitable for conditionalizing unit + dependencies. From 48e6dd376313c92db06558e061121af8205b55ca Mon Sep 17 00:00:00 2001 From: Lennart Poettering Date: Mon, 29 Oct 2018 20:20:37 +0100 Subject: [PATCH 3/7] man: document relationship of .socket units and network namespaces Fixes: #10018 --- man/systemd.socket.xml | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/man/systemd.socket.xml b/man/systemd.socket.xml index 72807be7b6..fb51ef6658 100644 --- a/man/systemd.socket.xml +++ b/man/systemd.socket.xml @@ -94,6 +94,18 @@ socket passing (i.e. sockets passed in via standard input and output, using StandardInput=socket in the service file). + + All network sockets allocated through .socket units are allocated in the host's network + namespace (see network_namespaces7). This + does not mean however that the service activated by a configured socket unit has to be part of the host's network + namespace as well. It is supported and even good practice to run services in their own network namespace (for + example through PrivateNetwork=, see + systemd.exec5), receiving only + the sockets configured through socket-activation from the host's namespace. In such a set-up communication within + the host's network namespace is only permitted through the activation sockets passed in while all sockets allocated + from the service code itself will be associated with the service's own namespace, and thus possibly subject to a a + much more restrictive configuration. From d287820dec4e6608348256642e991a89b0cc9007 Mon Sep 17 00:00:00 2001 From: Lennart Poettering Date: Mon, 29 Oct 2018 20:24:06 +0100 Subject: [PATCH 4/7] man: document that various sandboxing settings are not available in --user services This is brief and doesn't go into detail, but should at least indicate to those searching for it that some stuff is not available. Fixes: #9870 --- man/systemd.exec.xml | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/man/systemd.exec.xml b/man/systemd.exec.xml index 5c043497bb..d6f1427dcc 100644 --- a/man/systemd.exec.xml +++ b/man/systemd.exec.xml @@ -759,6 +759,11 @@ CapabilityBoundingSet=~CAP_B CAP_C RestrictRealtime= has no effect on systems that lack support for SECCOMP system call filtering, or in containers where support for this is turned off. + Also note that some sandboxing functionality is generally not available in user services (i.e. services run + by the per-user service manager). Specifically, the various settings requiring file system namespacing support + (such as ProtectSystem=) are not available, as the underlying kernel functionality is only + accessible to privileged processes. + From 0e18724eb1baf251c3fa66144ef721726a00cbe9 Mon Sep 17 00:00:00 2001 From: Lennart Poettering Date: Mon, 29 Oct 2018 20:45:04 +0100 Subject: [PATCH 5/7] man: emphasize the ReadOnlyPaths= mount propagation "hole" This changes the ProtectSystem= documentation to refer in more explicit words to the restrictions of ReadOnlyPath=, as sugegsted in #9857. THis also extends the paragraph in ReadOnlyPath= that explains the hole. Fixes: #9857 --- man/systemd.exec.xml | 37 ++++++++++++++++++++++--------------- 1 file changed, 22 insertions(+), 15 deletions(-) diff --git a/man/systemd.exec.xml b/man/systemd.exec.xml index d6f1427dcc..3f0535726b 100644 --- a/man/systemd.exec.xml +++ b/man/systemd.exec.xml @@ -781,9 +781,9 @@ CapabilityBoundingSet=~CAP_B CAP_C recommended to enable this setting for all long-running services, unless they are involved with system updates or need to modify the operating system in other ways. If this option is used, ReadWritePaths= may be used to exclude specific directories from being made read-only. This - setting is implied if DynamicUser= is set. For this setting the same restrictions regarding - mount propagation and privileges apply as for ReadOnlyPaths= and related calls, see - below. Defaults to off. + setting is implied if DynamicUser= is set. This setting cannot ensure protection in all + cases. In general it has the same limitations as ReadOnlyPaths=, see below. Defaults to + off. @@ -802,11 +802,11 @@ CapabilityBoundingSet=~CAP_B CAP_C ReadOnlyPaths=, and tmpfs is mostly equivalent to TemporaryFileSystem=. - It is recommended to enable this setting for all long-running services (in particular network-facing ones), - to ensure they cannot get access to private user data, unless the services actually require access to the user's - private data. This setting is implied if DynamicUser= is set. For this setting the same - restrictions regarding mount propagation and privileges apply as for ReadOnlyPaths= and related - calls, see below. + It is recommended to enable this setting for all long-running services (in particular network-facing + ones), to ensure they cannot get access to private user data, unless the services actually require access to + the user's private data. This setting is implied if DynamicUser= is set. This setting cannot + ensure protection in all cases. In general it has the same limitations as ReadOnlyPaths=, + see below. @@ -974,8 +974,7 @@ StateDirectory=aaa/bbb ccc BindPaths=, or BindReadOnlyPaths= inside it. For a more flexible option, see TemporaryFileSystem=. - Note that restricting access with these options does not extend to submounts of a directory that are - created later on. Non-directory paths may be specified as well. These options may be specified more than once, + Non-directory paths may be specified as well. These options may be specified more than once, in which case all paths listed will have limited access from within the namespace. If the empty string is assigned to this option, the specific list is reset, and all prior assignments have no effect. @@ -987,11 +986,19 @@ StateDirectory=aaa/bbb ccc + on the same path make sure to specify - first, and + second. - Note that using this setting will disconnect propagation of mounts from the service to the host - (propagation in the opposite direction continues to work). This means that this setting may not be used for - services which shall be able to install mount points in the main mount namespace. Note that the effect of these - settings may be undone by privileged processes. In order to set up an effective sandboxed environment for a - unit it is thus recommended to combine these settings with either + Note that these settings will disconnect propagation of mounts from the unit's processes to the + host. This means that this setting may not be used for services which shall be able to install mount points in + the main mount namespace. For ReadWritePaths= and ReadOnlyPaths= + propagation in the other direction is not affected, i.e. mounts created on the host generally appear in the + unit processes' namespace, and mounts removed on the host also disappear there too. In particular, note that + mount propagation from host to unit will result in unmodified mounts to be created in the unit's namespace, + i.e. writable mounts appearing on the host will be writable in the unit's namespace too, even when propagated + below a path marked with ReadOnlyPaths=! Restricting access with these options hence does + not extend to submounts of a directory that are created later on. This means the lock-down offered by that + setting is not complete, and does not offer full protection. + + Note that the effect of these settings may be undone by privileged processes. In order to set up an + effective sandboxed environment for a unit it is thus recommended to combine these settings with either CapabilityBoundingSet=~CAP_SYS_ADMIN or SystemCallFilter=~@mount. From ff5bd14bb490134f5d73c8040c163dbc7d18448b Mon Sep 17 00:00:00 2001 From: Lennart Poettering Date: Mon, 29 Oct 2018 20:49:41 +0100 Subject: [PATCH 6/7] man: document that "list-dependencies --reverse" is pretty incomplete Fixes: #9681 --- man/systemctl.xml | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/man/systemctl.xml b/man/systemctl.xml index 09cb03948a..2be5a838e0 100644 --- a/man/systemctl.xml +++ b/man/systemctl.xml @@ -1091,6 +1091,10 @@ Jan 12 10:46:45 example.com bluetoothd[8900]: gatt-time-server: Input/output err , may be used to change what types of dependencies are shown. + + Note that this command only lists units currently loaded into memory by the service manager. In + particular, this command is not suitable to get a comprehensive list at all reverse dependencies on a + specific unit, as it won't list the dependencies declared by units currently not loaded. From e5b62c9bf187d05b2bd28ff73e4db63649e00467 Mon Sep 17 00:00:00 2001 From: Lennart Poettering Date: Mon, 29 Oct 2018 21:09:57 +0100 Subject: [PATCH 7/7] man: document what "in-memory" units means Fixes: #10338 --- man/systemd.xml | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/man/systemd.xml b/man/systemd.xml index 77aade8158..b166e534a5 100644 --- a/man/systemd.xml +++ b/man/systemd.xml @@ -392,6 +392,25 @@ systemd.special7 for details about these target units. + systemd only keeps a minimal set of units loaded into memory. Specifically, the only units that are kept + loaded into memory are those for which at least one of the following conditions is true: + + + It is in an active, activating, deactivating or failed state (i.e. in any unit state except for dead) + It has a job queued for it + It is a dependency of some sort of at least one other unit that is loaded into memory + It has some form of resource still allocated (e.g. a service unit that is inactive but for which + a process is still lingering that ignored the request to be terminated) + It has been pinned into memory programmatically by a D-Bus call + + + systemd will automatically and implicitly load units from disk — if they are not loaded yet — as soon as + operations are requested for them. Thus, in many respects, the fact whether a unit is loaded or not is invisible to + clients. Use systemctl list-units --all to comprehensively list all units currently loaded. Any + unit for which none of the conditions above applies is promptly unloaded. Note that when a unit is unloaded from + memory its accounting data is flushed out too. However, this data is generally not lost, as a journal log record + is generated declaring the consumed resources whenever a unit shuts down. + Processes systemd spawns are placed in individual Linux control groups named after the unit which they belong to in the private systemd hierarchy. (see