summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAge
...
* | | cgroup: fix compile warningFelipe Balbi2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") added the following compile warning: kernel/cgroup.c: In function ‘cgroup_show_path’: kernel/cgroup.c:1634:15: warning: unused variable ‘ret’ [-Wunused-variable] int len = 0, ret = 0; ^ fix it. Fixes: 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespacesSerge E. Hallyn2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch summary: When showing a cgroupfs entry in mountinfo, show the path of the mount root dentry relative to the reader's cgroup namespace root. Short explanation (courtesy of mkerrisk): If we create a new cgroup namespace, then we want both /proc/self/cgroup and /proc/self/mountinfo to show cgroup paths that are correctly virtualized with respect to the cgroup mount point. Previous to this patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo does not. Long version: When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup namespace, and then mounts a new instance of the freezer cgroup, the new mount will be rooted at /a/b. The root dentry field of the mountinfo entry will show '/a/b'. cat > /tmp/do1 << EOF mount -t cgroup -o freezer freezer /mnt grep freezer /proc/self/mountinfo EOF unshare -Gm bash /tmp/do1 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer The task's freezer cgroup entry in /proc/self/cgroup will simply show '/': grep freezer /proc/self/cgroup 9:freezer:/ If instead the same task simply bind mounts the /a/b cgroup directory, the resulting mountinfo entry will again show /a/b for the dentry root. However in this case the task will find its own cgroup at /mnt/a/b, not at /mnt: mount --bind /sys/fs/cgroup/freezer/a/b /mnt 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer In other words, there is no way for the task to know, based on what is in mountinfo, which cgroup directory is its own. Example (by mkerrisk): First, a little script to save some typing and verbiage: echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)" cat /proc/self/mountinfo | grep freezer | awk '{print "\tmountinfo:\t\t" $4 "\t" $5}' Create cgroup, place this shell into the cgroup, and look at the state of the /proc files: 2653 2653 # Our shell 14254 # cat(1) /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer Create a shell in new cgroup and mount namespaces. The act of creating a new cgroup namespace causes the process's current cgroups directories to become its cgroup root directories. (Here, I'm using my own version of the "unshare" utility, which takes the same options as the util-linux version): Look at the state of the /proc files: /proc/self/cgroup: 10:freezer:/ mountinfo: / /sys/fs/cgroup/freezer The third entry in /proc/self/cgroup (the pathname of the cgroup inside the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which is rooted at /a/b in the outer namespace. However, the info in /proc/self/mountinfo is not for this cgroup namespace, since we are seeing a duplicate of the mount from the old mount namespace, and the info there does not correspond to the new cgroup namespace. However, trying to create a new mount still doesn't show us the right information in mountinfo: # propagating to other mountns /proc/self/cgroup: 7:freezer:/ mountinfo: /a/b /mnt/freezer The act of creating a new cgroup namespace caused the process's current freezer directory, "/a/b", to become its cgroup freezer root directory. In other words, the pathname directory of the directory within the newly mounted cgroup filesystem should be "/", but mountinfo wrongly shows us "/a/b". The consequence of this is that the process in the cgroup namespace cannot correctly construct the pathname of its cgroup root directory from the information in /proc/PID/mountinfo. With this patch, the dentry root field in mountinfo is shown relative to the reader's cgroup namespace. So the same steps as above: /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: /../.. /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: / /mnt/freezer cgroup.clone_children freezer.parent_freezing freezer.state tasks cgroup.procs freezer.self_freezing notify_on_release 3164 2653 # First shell that placed in this cgroup 3164 # Shell started by 'unshare' 14197 # cat(1) Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> Tested-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: implement cgroup_subsys->implicit_on_dflTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some controllers, perf_event for now and possibly freezer in the future, don't really make sense to control explicitly through "cgroup.subtree_control". For example, the primary role of perf_event is identifying the cgroups of tasks; however, because the controller also keeps a small amount of state per cgroup, it can't be replaced with simple cgroup membership tests. This patch implements cgroup_subsys->implicit_on_dfl flag. When set, the controller is implicitly enabled on all cgroups on the v2 hierarchy so that utility type controllers such as perf_event can be enabled and function transparently. An implicit controller doesn't show up in "cgroup.controllers" or "cgroup.subtree_control", is exempt from no internal process rule and can be stolen from the default hierarchy even if there are non-root csses. v2: Reimplemented on top of the recent updates to css handling and subsystem rebinding. Rebinding implicit subsystems is now a simple matter of exempting it from the busy subsystem check. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: use css_set->mg_dst_cgrp for the migration target cgroupTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Migration can be multi-target on the default hierarchy when a controller is enabled - processes belonging to each child cgroup have to be moved to the child cgroup itself to refresh css association. This isn't a problem for cgroup_migrate_add_src() as each source css_set still maps to single source and target cgroups; however, cgroup_migrate_prepare_dst() is called once after all source css_sets are added and thus might not have a single destination cgroup. This is currently worked around by specifying NULL for @dst_cgrp and using the source's default cgroup as destination as the only multi-target migration in use is self-targetting. While this works, it's subtle and clunky. As all taget cgroups are already specified while preparing the source css_sets, this clunkiness can easily be removed by recording the target cgroup in each source css_set. This patch adds css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and used by cgroup_migrate_prepare_dst(). This also makes migration code ready for arbitrary multi-target migration. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroupTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On the default hierarchy, a migration can be multi-source and/or multi-destination. cgroup_taskest_migrate() used to incorrectly assume single destination cgroup but the bug has been fixed by 1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration from subtree_control enabling"). Since the commit, @dst_cgrp to cgroup[_taskset]_migrate() is only used to determine which subsystems are affected or which cgroup_root the migration is taking place in. As such, @dst_cgrp is misleading. This patch replaces @dst_cgrp with @root. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: move migration destination verification out of ↵Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_migrate_prepare_dst() cgroup_migrate_prepare_dst() verifies whether the destination cgroup is allowable; however, the test doesn't really belong there. It's too deep and common in the stack and as a result the test itself is gated by another test. Separate the test out into cgroup_may_migrate_to() and update cgroup_attach_task() and cgroup_transfer_tasks() to perform the test directly. This doesn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses()Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_update_dfl_csses() should move each task in the subtree to self; however, it was incorrectly calling cgroup_migrate_add_src() with the root of the subtree as @dst_cgrp. Fortunately, cgroup_migrate_add_src() currently uses @dst_cgrp only to determine the hierarchy and the bug doesn't cause any actual breakages. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: update css iteration in cgroup_update_dfl_csses()Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing sequences of operations ensure that the offlining csses are drained before cgroup_update_dfl_csses(), so even though cgroup_update_dfl_csses() uses css_for_each_descendant_pre() to walk the target cgroups, it doesn't end up operating on dead cgroups. Also, the function explicitly excludes the subtree root from operation. This is fragile and inconsistent with the rest of css update operations. This patch updates cgroup_update_dfl_csses() to use cgroup_for_each_live_descendant_pre() instead and include the subtree root. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: allocate 2x cgrp_cset_links when setting up a new rootTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During prep, cgroup_setup_root() allocates cgrp_cset_links matching the number of existing css_sets to later link the new root. This is fine for now as the only operation which can happen inbetween is rebind_subsystems() and rebinding of empty subsystems doesn't create new css_sets. However, while not yet allowed, with the recent reimplementation, rebind_subsystems() can rebind subsystems with descendant csses and thus can create new css_sets. This patch makes cgroup_setup_root() allocate 2x of the existing css_sets so that later use of live subsystem rebinding doesn't blow up. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_maskTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_calc_subtree_ss_mask() currently takes @cgrp and @subtree_control. @cgrp is used for two purposes - to decide whether it's for default hierarchy and the mask of available subsystems. The former doesn't matter as the results are the same regardless. The latter can be specified directly through a subsystem mask. This patch makes cgroup_calc_subtree_ss_mask() perform the same calculations for both default and legacy hierarchies and take @this_ss_mask for available subsystems. @cgrp is no longer used and dropped. This is to allow using the function in contexts where available controllers can't be decided from the cgroup. v2: cgroup_refres_subtree_ss_mask() is removed by a previous patch. Updated accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friendsTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rebind_subsystem() open codes quite a bit of css and interface file manipulations. It tries to be fail-safe but doesn't quite achieve it. It can be greatly simplified by using the new css management helpers. This patch reimplements rebind_subsytsems() using cgroup_apply_control() and friends. * The half-baked rollback on file creation failure is dropped. It is an extremely cold path, failure isn't critical, and, aside from kernel bugs, the only reason it can fail is memory allocation failure which pretty much doesn't happen for small allocations. * As cgroup_apply_control_disable() is now used to clean up root cgroup on rebind, make sure that it doesn't end up killing root csses. * All callers of rebind_subsystems() are updated to use cgroup_lock_and_drain_offline() as the apply_control functions require drained subtree. * This leaves cgroup_refresh_subtree_ss_mask() without any user. Removed. * css_populate_dir() and css_clear_dir() no longer needs @cgrp_override parameter. Dropped. * While at it, add WARN_ON() to rebind_subsystem() calls which are expected to always succeed just in case. While the rules visible to userland aren't changed, this reimplementation not only simplifies rebind_subsystems() but also allows it to disable and enable csses recursively. This can be used to implement more flexible rebinding. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com> Change-Id: I23d1a815cc4c412e83331aebd323f111d1046dd0
* | | cgroup: use cgroup_apply_enable_control() in cgroup creation pathTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_create() manually updates control masks and creates child csses which cgroup_mkdir() then manually populates. Both can be simplified by using cgroup_apply_enable_control() and friends. The only catch is that it calls css_populate_dir() with NULL cgroup->kn during cgroup_create(). This is worked around by making the function noop on NULL kn. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: combine cgroup_mutex locking and offline css drainingTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_drain_offline() is used to wait for csses being offlined to uninstall itself from cgroup->subsys[] array so that new csses can be installed. The function's only user, cgroup_subtree_control_write(), calls it after performing some checks and restarts the whole process via restart_syscall() if draining has to release cgroup_mutex to wait. This can be simplified by draining before other synchronized operations so that there's nothing to restart. This patch converts cgroup_drain_offline() to cgroup_lock_and_drain_offline() which performs both locking and draining and updates cgroup_kn_lock_live() use it instead of cgroup_mutex() if requested. This combined locking and draining operations are easier to use and less error-prone. While at it, add WARNs in control_apply functions which triggers if the subtree isn't properly drained. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: factor out cgroup_{apply|finalize}_control() from ↵Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_subtree_control_write() Factor out cgroup_{apply|finalize}_control() so that control mask update can be done in several simple steps. This patch doesn't introduce behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: introduce cgroup_{save|propagate|restore}_control()Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While controllers are being enabled and disabled in cgroup_subtree_control_write(), the original subsystem masks are stashed in local variables so that they can be restored if the operation fails in the middle. This patch adds dedicated fields to struct cgroup to be used instead of the local variables and implements functions to stash the current values, propagate the changes and restore them recursively. Combined with the previous changes, this makes subsystem management operations fully recursive and modularlized. This will be used to expand cgroup core functionalities. While at it, remove now unused @css_enable and @css_disable from cgroup_subtree_control_write(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: make cgroup_drain_offline() and ↵Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_apply_control_{disable|enable}() recursive The three factored out css management operations - cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() - only depend on the current state of the target cgroups and idempotent and thus can be easily made to operate on the subtree instead of the immediate children. This patch introduces the iterators which walk live subtree and converts the three functions to operate on the subtree including self instead of the children. While this leads to spurious walking and be slightly more expensive, it will allow them to be used for wider scope of operations. Note that cgroup_drain_offline() now tests for whether a css is dying before trying to drain it. This is to avoid trying to drain live csses as there can be mix of live and dying csses in a subtree unlike children of the same parent. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: factor out cgroup_apply_control_enable() from ↵Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_subtree_control_write() Factor out css enabling and showing into cgroup_apply_control_enable(). * Nest subsystem walk inside child walk. The child walk will later be converted to subtree walk which is a bit more expensive. * Instead of operating on the differential masks @css_enable, simply enable or show csses according to the current cgroup_control() and cgroup_ss_mask(). This leads to the same result and is simpler and more robust. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: factor out cgroup_apply_control_disable() from ↵Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_subtree_control_write() Factor out css disabling and hiding into cgroup_apply_control_disable(). * Nest subsystem walk inside child walk. The child walk will later be converted to subtree walk which is a bit more expensive. * Instead of operating on the differential masks @css_enable and @css_disable, simply disable or hide csses according to the current cgroup_control() and cgroup_ss_mask(). This leads to the same result and is simpler and more robust. * This allows error handling path to share the same code. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: factor out cgroup_drain_offline() from cgroup_subtree_control_write()Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Factor out async css offline draining into cgroup_drain_offline(). * Nest subsystem walk inside child walk. The child walk will later be converted to subtree walk which is a bit more expensive. * Relocate the draining above subsystem mask preparation, which doesn't create any behavior differences but helps further refactoring. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: introduce cgroup_control() and cgroup_ss_mask()Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a controller is enabled and visible on a non-root cgroup is determined by subtree_control and subtree_ss_mask of the parent cgroup. For a root cgroup, by the type of the hierarchy and which controllers are attached to it. Deciding the above on each usage is fragile and unnecessarily complicates the users. This patch introduces cgroup_control() and cgroup_ss_mask() which calculate and return the [visibly] enabled subsyste mask for the specified cgroup and conver the existing usages. * cgroup_e_css() is restructured for simplicity. * cgroup_calc_subtree_ss_mask() and cgroup_subtree_control_write() no longer need to distinguish root and non-root cases. * With cgroup_control(), cgroup_controllers_show() can now handle both root and non-root cases. cgroup_root_controllers_show() is removed. v2: cgroup_control() updated to yield the correct result on v1 hierarchies too. cgroup_subtree_control_write() converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: factor out cgroup_create() out of cgroup_mkdir()Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We're in the process of refactoring cgroup and css management paths to separate them out to eventually allow cgroups which aren't visible through cgroup fs. This patch factors out cgroup_create() out of cgroup_mkdir(). cgroup_create() contains all internal object creation and initialization. cgroup_mkdir() uses cgroup_create() to create the internal cgroup and adds interface directory and file creation. This patch doesn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: reorder operations in cgroup_mkdir()Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, operations to initialize internal objects and create interface directory and files are intermixed in cgroup_mkdir(). We're in the process of refactoring cgroup and css management paths to separate them out to eventually allow cgroups which aren't visible through cgroup fs. This patch reorders operations inside cgroup_mkdir() so that interface directory and file handling comes after internal object initialization. This will enable further refactoring. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: explicitly track whether a cgroup_subsys_state is visible to userlandTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, whether a css (cgroup_subsys_state) has its interface files created is not tracked and assumed to change together with the owning cgroup's lifecycle. cgroup directory and interface creation is being separated out from internal object creation to help refactoring and eventually allow cgroups which are not visible through cgroupfs. This patch adds CSS_VISIBLE to track whether a css has its interface files created and perform management operations only when necessary which helps decoupling interface file handling from internal object lifecycle. After this patch, all css interface file management functions can be called regardless of the current state and will achieve the expected result. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: separate out interface file creation from css creationTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, interface files are created when a css is created depending on whether @visible is set. This patch separates out the two into separate steps to help code refactoring and eventually allow cgroups which aren't visible through cgroup fs. Move css_populate_dir() out of create_css() and drop @visible. While at it, rename the function to css_create() for consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: suppress spurious de-populated eventsTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During task migration, tasks may transfer between two css_sets which are associated with the same cgroup. If those tasks are the only tasks in the cgroup, this currently triggers a spurious de-populated event on the cgroup. Fix it by bumping up populated count before bumping it down during migration to ensure that it doesn't reach zero spuriously. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: re-hash init_css_set after subsystems are initializedTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | css_sets are hashed by their subsys[] contents and in cgroup_init() init_css_set is hashed early, before subsystem inits, when all entries in its subsys[] are NULL, so that cgroup_dfl_root initialization can find and link to it. As subsystems are initialized, init_css_set.subsys[] is filled up but the hashing is never updated making init_css_set hashed in the wrong place. While incorrect, this doesn't cause a critical failure as css_set management code would create an identical css_set dynamically. Fix it by rehashing init_css_set after subsystems are initialized. While at it, drop unnecessary @key local variable. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: reset css on destructionVladimir Davydov2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An associated css can be around for quite a while after a cgroup directory has been removed. In general, it makes sense to reset it to defaults so as not to worry about any remnants. For instance, memory cgroup needs to reset memory.low, otherwise pages charged to a dead cgroup might never get reclaimed. There's ->css_reset callback, which would fit perfectly for the purpose. Currently, it's only called when a subsystem is disabled in the unified hierarchy and there are other subsystems dependant on it. Let's call it on css destruction as well. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: fix and restructure error handling in copy_cgroup_ns()Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | copy_cgroup_ns()'s error handling was broken and the attempt to fix it d22025570e2e ("cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()") was broken too in that it ended up trying an ERR_PTR() value. There's only one place where copy_cgroup_ns() needs to perform cleanup after failure. Simplify and fix the error handling by removing the goto's. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: fix a mistake in warning messageXiubo Li2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a mistake about the print format name:id <--> %d:%s, which the name is 'char *' type and id is 'int' type. Change "name:id" to "id:name" instead to be consistent with "cgroup_subsys %d:%s". Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: use ->subtree_control when testing no internal process ruleTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No internal process rule is enforced by cgroup_migrate_prepare_dst() during process migration. It tests whether the target cgroup's ->child_subsys_mask is zero which is different from "subtree_control" write path which tests ->subtree_control. This hasn't mattered because up until now, both ->child_subsys_mask and ->subtree_control are zero or non-zero at the same time. However, with the planned addition of implicit controllers, this will no longer be true. This patch prepares for the change by making cgorup_migrate_prepare_dst() test ->subtree_control instead. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: make css_tryget_online_from_dir() also recognize cgroup2 fsTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | The function currently returns -EBADF for a directory on the default hierarchy. Make it also recognize cgroup2_fs_type. This will be used for perf_event cgroup2 support. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: s/cgrp_dfl_root_/cgrp_dfl_/Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | These var names are unnecessarily unwiedly and another similar variable will be added. Let's shorten them. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: make cgroup subsystem masks u16Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the recent do_each_subsys_mask() conversion, there's no reason to use ulong for subsystem masks. We'll be adding more subsystem masks to persistent data structures, let's reduce its size to u16 which should be enough for now and the foreseeable future. This doesn't create any noticeable behavior differences. v2: Johannes spotted that the initial patch missed cgroup_no_v1_mask. Converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: use do_each_subsys_mask() where applicableTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several places in cgroup_subtree_control_write() which can use do_each_subsys_mask() instead of manual mask testing. Use it. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: convert for_each_subsys_which() to do-while styleTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for_each_subsys_which() allows iterating subsystems specified in a subsystem bitmask; unfortunately, it requires the mask to be an unsigned long l-value which can be inconvenient and makes it awkward to use a smaller type for subsystem masks. This patch converts for_each_subsy_which() to do-while style which allows it to drop the l-value requirement. The new iterator is named do_each_subsys_mask() / while_each_subsys_mask(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: s/child_subsys_mask/subtree_ss_mask/Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For consistency with cgroup->subtree_control. * cgroup->child_subsys_mask -> cgroup->subtree_ss_mask * cgroup_calc_child_subsys_mask() -> cgroup_calc_subtree_ss_mask() * cgroup_refresh_child_subsys_mask() -> cgroup_refresh_subtree_ss_mask() No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | Revert "cgroup: add cgroup_subsys->css_e_css_changed()"Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 56c807ba4e91f0980567b6a69de239677879b17f. cgroup_subsys->css_e_css_changed() was supposed to be used by cgroup writeback support; however, the change to per-inode cgroup association made it unnecessary and the callback doesn't have any user. Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: fix error return value of cgroup_addrm_files()Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_addrm_files() incorrectly returned 0 after add failure. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: fix alloc_cgroup_ns() error handling in copy_cgroup_ns()Tejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | alloc_cgroup_ns() returns an ERR_PTR value on error but copy_cgroup_ns() was checking for NULL for error. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | Add FS_USERNS_FLAG to cgroup fsSerge Hallyn2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | allowing root in a non-init user namespace to mount it. This should now be safe, because 1. non-init-root cannot mount a previously unbound subsystem 2. the task doing the mount must be privileged with respect to the user namespace owning the cgroup namespace 3. the mounted subsystem will have its current cgroup as the root dentry. the permissions will be unchanged, so tasks will receive no new privilege over the cgroups which they did not have on the original mounts. Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com> Change-Id: I9079d9fd23cae67ce7fc223ad8d6fc3db97536db
* | | cgroup: mount cgroupns-root when inside non-init cgroupnsSerge Hallyn2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch enables cgroup mounting inside userns when a process as appropriate privileges. The cgroup filesystem mounted is rooted at the cgroupns-root. Thus, in a container-setup, only the hierarchy under the cgroupns-root is exposed inside the container. This allows container management tools to run inside the containers without depending on any global state. Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Change-Id: I2e1d1c127f4600e43f29877326fbf7e6a84f1459
* | | cgroup: cgroup namespace setns supportAditya Kali2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | setns on a cgroup namespace is allowed only if task has CAP_SYS_ADMIN in its current user-namespace and over the user-namespace associated with target cgroupns. No implicit cgroup changes happen with attaching to another cgroupns. It is expected that the somone moves the attaching process under the target cgroupns-root. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: introduce cgroup namespacesAditya Kali2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce the ability to create new cgroup namespace. The newly created cgroup namespace remembers the cgroup of the process at the point of creation of the cgroup namespace (referred as cgroupns-root). The main purpose of cgroup namespace is to virtualize the contents of /proc/self/cgroup file. Processes inside a cgroup namespace are only able to see paths relative to their namespace root (unless they are moved outside of their cgroupns-root, at which point they will see a relative path from their cgroupns-root). For a correctly setup container this enables container-tools (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized containers without leaking system level cgroup hierarchy to the task. This patch only implements the 'unshare' part of the cgroupns. Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com> Change-Id: Ifd2df9f562baa90b0fe7c986f86967602657c640
* | | cgroup: provide cgroup_nov1= to disable controllers in v1 mountsJohannes Weiner2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Testing cgroup2 can be painful with system software automatically mounting and populating all cgroup controllers in v1 mode. Sometimes they can be unmounted from rc.local, sometimes even that is too late. Provide a commandline option to disable certain controllers in v1 mounts, so that they remain available for cgroup2 mounts. Example use: cgroup_no_v1=memory,cpu cgroup_no_v1=all Disabling will be confirmed at boot-time as such: [ 0.013770] Disabling cpu control group subsystem in v1 mounts [ 0.016004] Disabling memory control group subsystem in v1 mounts Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: demote subsystem init messages to KERN_DEBUGTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | These are noisy during boot and not all that interesting. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | cgroup: fix a typo.Rami Rosen2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a typo in a comment in cgroup.c. Signed-off-by: Rami Rosen <rami.rosen@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | net, cgroup: cgroup_sk_updat_lock was missing initializerTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") added global spinlock cgroup_sk_update_lock but erroneously skipped initializer leading to uninitialized spinlock warning. Fix it by using DEFINE_SPINLOCK(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dexuan Cui <decui@microsoft.com> Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | sock, cgroup: add sock->sk_cgroupTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In cgroup v1, dealing with cgroup membership was difficult because the number of membership associations was unbound. As a result, cgroup v1 grew several controllers whose primary purpose is either tagging membership or pull in configuration knobs from other subsystems so that cgroup membership test can be avoided. net_cls and net_prio controllers are examples of the latter. They allow configuring network-specific attributes from cgroup side so that network subsystem can avoid testing cgroup membership; unfortunately, these are not only cumbersome but also problematic. Both net_cls and net_prio aren't properly hierarchical. Both inherit configuration from the parent on creation but there's no interaction afterwards. An ancestor doesn't restrict the behavior in its subtree in anyway and configuration changes aren't propagated downwards. Especially when combined with cgroup delegation, this is problematic because delegatees can mess up whatever network configuration implemented at the system level. net_prio would allow the delegatees to set whatever priority value regardless of CAP_NET_ADMIN and net_cls the same for classid. While it is possible to solve these issues from controller side by implementing hierarchical allowable ranges in both controllers, it would involve quite a bit of complexity in the controllers and further obfuscate network configuration as it becomes even more difficult to tell what's actually being configured looking from the network side. While not much can be done for v1 at this point, as membership handling is sane on cgroup v2, it'd be better to make cgroup matching behave like other network matches and classifiers than introducing further complications. In preparation, this patch updates sock->sk_cgrp_data handling so that it points to the v2 cgroup that sock was created in until either net_prio or net_cls is used. Once either of the two is used, sock->sk_cgrp_data reverts to its previous role of carrying prioidx and classid. This is to avoid adding yet another cgroup related field to struct sock. As the mode switching can happen at most once per boot, the switching mechanism is aimed at lowering hot path overhead. It may leak a finite, likely small, number of cgroup refs and report spurious prioidx or classid on switching; however, dynamic updates of prioidx and classid have always been racy and lossy - socks between creation and fd installation are never updated, config changes don't update existing sockets at all, and prioidx may index with dead and recycled cgroup IDs. Non-critical inaccuracies from small race windows won't make any noticeable difference. This patch doesn't make use of the pointer yet. The following patch will implement netfilter match for cgroup2 membership. v2: Use sock_cgroup_data to avoid inflating struct sock w/ another cgroup specific field. v3: Add comments explaining why sock_data_prioidx() and sock_data_classid() use different fallback values. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Daniel Wagner <daniel.wagner@bmw-carit.de> CC: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a structTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data. ->sk_cgroup_prioidx and ->sk_classid are moved into it. The struct and its accessors are defined in cgroup-defs.h. This is to prepare for overloading the fields with a cgroup pointer. This patch mostly performs equivalent conversions but the followings are noteworthy. * Equality test before updating classid is removed from sock_update_classid(). This shouldn't make any noticeable difference and a similar test will be implemented on the helper side later. * sock_update_netprioidx() now takes struct sock_cgroup_data and can be moved to netprio_cgroup.h without causing include dependency loop. Moved. * The dummy version of sock_update_netprioidx() converted to a static inline function while at it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
* | | netprio_cgroup: limit the maximum css->id to USHRT_MAXTejun Heo2022-04-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | netprio builds per-netdev contiguous priomap array which is indexed by css->id. The array is allocated using kzalloc() effectively limiting the maximum ID supported to some thousand range. This patch caps the maximum supported css->id to USHRT_MAX which should be way above what is actually useable. This allows reducing sock->sk_cgrp_prioidx to u16 from u32. The freed up part will be used to overload the cgroup related fields. sock->sk_cgrp_prioidx's position is swapped with sk_mark so that the two cgroup related fields are adjacent. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Daniel Borkmann <daniel@iogearbox.net> CC: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>