diff mbox series

[RFC,4/6] misc cgroup: introduce an fd counter

Message ID 20231108002647.73784-5-tycho@tycho.pizza (mailing list archive)
State New
Headers show
Series tracking fd counts per cgroup | expand

Commit Message

Tycho Andersen Nov. 8, 2023, 12:26 a.m. UTC
From: Tycho Andersen <tandersen@netflix.com>

This idea is not new [1], but I'm hoping using the "new" misc controller to
avoid having to introduce a completely new controller will make it more
tenable.

This patch introduces cgroup migration primitives to the misc cgroup, which
didn't have them before. I didn't know what to do in the face of kvm
migrations, so I left those as no-ops for now. I also did not do any
abstraction of the migration primitives. We are interested in adding other
rlimit-y things to the misc cgroup if this approach looks reasonable, for
the same reasons described above. I tried writing both dynamic-dispatch and
static-dispatch versions, but they introduced a lot of noise that seemed
unnecessary for this first draft. I'm happy to take suggestions here.

One oddity here is when the migration happens, which is in the
can_attach/can_fork() family vs. doing it in the actual attach/fork()
functions. This saves us from having to track some delta, or walk the fd
table twice, at the expense of a more costly revert.

As a result, the cancel_fork/cancel_attach functions do nontrivial things,
which are hard to test form userspace as far as I can tell. I did some
manual hacky fault injection, but if there is an official way to test these
I'm happy to add that.

Finally, this exposes misc.current at the root level. This was useful for
testing, but perhaps is not something we need/want in the final version.

[1]: https://lore.kernel.org/all/1404311407-4851-1-git-send-email-merimus@google.com/

Signed-off-by: Tycho Andersen <tandersen@netflix.com>
---
 fs/file.c                   |  68 +++++++++++-
 include/linux/fdtable.h     |   4 +
 include/linux/misc_cgroup.h |   1 +
 kernel/cgroup/misc.c        | 200 +++++++++++++++++++++++++++++++++++-
 4 files changed, 270 insertions(+), 3 deletions(-)

Comments

Al Viro Nov. 8, 2023, 4:57 p.m. UTC | #1
On Tue, Nov 07, 2023 at 05:26:45PM -0700, Tycho Andersen wrote:

> +	if (!charge_current_fds(newf, count_open_files(new_fdt)))
> +		return newf;

Are you sure that on configs that are not cgroup-infested compiler
will figure out that count_open_files() would have no side effects
and doesn't need to be evaluated?

Incidentally, since you are adding your charge/uncharge stuff on each
allocation/freeing, why not simply maintain an accurate counter, cgroup or
no cgroup?  IDGI...  Make it an inlined helper right there in fs/file.c,
doing increment/decrement and, conditional upon config, calling
the cgroup side of things.  No need to look at fdt, etc. outside
of fs/file.c either - the counter can be picked right from the
files_struct...

>  static void __put_unused_fd(struct files_struct *files, unsigned int fd)
>  {
>  	struct fdtable *fdt = files_fdtable(files);
> +	if (test_bit(fd, fdt->open_fds))
> +		uncharge_current_fds(files, 1);

Umm...  Just where do we call it without the bit in ->open_fds set?
Any such caller would be a serious bug; suppose you are trying to
call __put_unused_fd(files, N) while N is not in open_fds.  Just before
your call another thread grabs a descriptor and picks N.  Resulting
state won't be pretty, especially if right *after* your call the
third thread also asks for a descriptor - and also gets N.

Sure, you have an exclusion on ->file_lock, but AFAICS all callers
are under it and in all callers except for put_unused_fd() we
have just observed a non-NULL file reference in ->fd[N]; that
would *definitely* be a hard constraint violation if it ever
happened with N not in ->open_fds at that moment.

So the only possibility would be a broken caller of put_unused_fd(),
and any such would be a serious bug.

Details, please - have you ever observed that?

BTW, what about the locking hierarchy?  In the current tree ->files_lock
nests inside of everything; what happens with your patches in place?
Tycho Andersen Nov. 8, 2023, 9:01 p.m. UTC | #2
Hi Al,

Thanks for looking. Somehow I also missed CCing you, whoops,

On Wed, Nov 08, 2023 at 04:57:49PM +0000, Al Viro wrote:
> On Tue, Nov 07, 2023 at 05:26:45PM -0700, Tycho Andersen wrote:
> 
> > +	if (!charge_current_fds(newf, count_open_files(new_fdt)))
> > +		return newf;
> 
> Are you sure that on configs that are not cgroup-infested compiler
> will figure out that count_open_files() would have no side effects
> and doesn't need to be evaluated?
> 
> Incidentally, since you are adding your charge/uncharge stuff on each
> allocation/freeing, why not simply maintain an accurate counter, cgroup or
> no cgroup?  IDGI...  Make it an inlined helper right there in fs/file.c,
> doing increment/decrement and, conditional upon config, calling
> the cgroup side of things.  No need to look at fdt, etc. outside
> of fs/file.c either - the counter can be picked right from the
> files_struct...

Thanks, I can re-work it to look like this.

> >  static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> >  {
> >  	struct fdtable *fdt = files_fdtable(files);
> > +	if (test_bit(fd, fdt->open_fds))
> > +		uncharge_current_fds(files, 1);
> 
> Umm...  Just where do we call it without the bit in ->open_fds set?
> Any such caller would be a serious bug; suppose you are trying to
> call __put_unused_fd(files, N) while N is not in open_fds.  Just before
> your call another thread grabs a descriptor and picks N.  Resulting
> state won't be pretty, especially if right *after* your call the
> third thread also asks for a descriptor - and also gets N.
> 
> Sure, you have an exclusion on ->file_lock, but AFAICS all callers
> are under it and in all callers except for put_unused_fd() we
> have just observed a non-NULL file reference in ->fd[N]; that
> would *definitely* be a hard constraint violation if it ever
> happened with N not in ->open_fds at that moment.
> 
> So the only possibility would be a broken caller of put_unused_fd(),
> and any such would be a serious bug.
> 
> Details, please - have you ever observed that?

No, I just kept it from the original series. I agree that it should be
safe to drop.

> BTW, what about the locking hierarchy?  In the current tree ->files_lock
> nests inside of everything; what happens with your patches in place?

If I understand correctly you're asking about ->files_lock nesting
inside of task_lock()? I tried to make the cgroup side in this patch
do the same thing in the same order. Or am I misunderstanding?

I did test this with some production container traffic and didn't see
anything too strange, but no doubt there are complicated edge cases
here.

Thanks,

Tycho
Christian Brauner Nov. 9, 2023, 9:53 a.m. UTC | #3
> @@ -411,9 +453,22 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
>  
>  	rcu_assign_pointer(newf->fdt, new_fdt);
>  
> -	return newf;
> +	if (!charge_current_fds(newf, count_open_files(new_fdt)))
> +		return newf;


> @@ -542,6 +600,10 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>  	if (error)
>  		goto repeat;
>  
> +	error = -EMFILE;
> +	if (charge_current_fds(files, 1) < 0)
> +		goto out;

Whoops, I had that message ready to fire but didn't send it.

This may have a noticeable performance impact as charge_current_fds()
calls misc_cg_try_charge() which looks pretty expensive in this
codepath.

We're constantly getting patches to tweak performance during file open
and closing and adding a function that does require multiple atomics and
spinlocks won't exactly improve this.

On top of that I really dislike that we're pulling cgroups into this
code here at all.

Can you get a similar effect through a bpf program somehow that you
don't even tie this to cgroups?
Tycho Andersen Nov. 9, 2023, 2:58 p.m. UTC | #4
On Thu, Nov 09, 2023 at 10:53:17AM +0100, Christian Brauner wrote:
> > @@ -411,9 +453,22 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
> >  
> >  	rcu_assign_pointer(newf->fdt, new_fdt);
> >  
> > -	return newf;
> > +	if (!charge_current_fds(newf, count_open_files(new_fdt)))
> > +		return newf;
> 
> 
> > @@ -542,6 +600,10 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> >  	if (error)
> >  		goto repeat;
> >  
> > +	error = -EMFILE;
> > +	if (charge_current_fds(files, 1) < 0)
> > +		goto out;
> 
> Whoops, I had that message ready to fire but didn't send it.
> 
> This may have a noticeable performance impact as charge_current_fds()
> calls misc_cg_try_charge() which looks pretty expensive in this
> codepath.
> 
> We're constantly getting patches to tweak performance during file open
> and closing and adding a function that does require multiple atomics and
> spinlocks won't exactly improve this.

I don't see any spin locks in misc_cg_try_charge(), but it does walk
up the tree, resulting in multiple atomic writes.
If we didn't walk up the tree it would change the semantics, but
Netflix probably wouldn't delegate this, so at least for our purposes
having only one atomic would be sufficient. Is that more tenable?

> On top of that I really dislike that we're pulling cgroups into this
> code here at all.
> 
> Can you get a similar effect through a bpf program somehow that you
> don't even tie this to cgroups?

Possibly, I can look into it.

Tycho
diff mbox series

Patch

diff --git a/fs/file.c b/fs/file.c
index 539bead2364e..b27ec5c9d77e 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -20,6 +20,7 @@ 
 #include <linux/spinlock.h>
 #include <linux/rcupdate.h>
 #include <linux/close_range.h>
+#include <linux/misc_cgroup.h>
 #include <net/sock.h>
 
 #include "internal.h"
@@ -318,6 +319,45 @@  static unsigned int sane_fdtable_size(struct fdtable *fdt, unsigned int max_fds)
 	return ALIGN(min(count, max_fds), BITS_PER_LONG);
 }
 
+#ifdef CONFIG_CGROUP_MISC
+static int charge_current_fds(struct files_struct *files, unsigned int count)
+{
+	return misc_cg_try_charge(MISC_CG_RES_NOFILE, files->mcg, count);
+}
+
+static void uncharge_current_fds(struct files_struct *files, unsigned int count)
+{
+	misc_cg_uncharge(MISC_CG_RES_NOFILE, files->mcg, count);
+}
+
+static void files_get_misc_cg(struct files_struct *newf)
+{
+	newf->mcg = get_current_misc_cg();
+}
+
+static void files_put_misc_cg(struct files_struct *newf)
+{
+	put_misc_cg(newf->mcg);
+}
+#else
+static int charge_current_fds(struct files_struct *files, unsigned int count)
+{
+	return 0;
+}
+
+static void uncharge_current_fds(struct files_struct *files, unsigned int count)
+{
+}
+
+static void files_get_misc_cg(struct files_struct *newf)
+{
+}
+
+static void files_put_misc_cg(struct files_struct *newf)
+{
+}
+#endif
+
 /*
  * Allocate a new files structure and copy contents from the
  * passed in files structure.
@@ -341,6 +381,7 @@  struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 	newf->resize_in_progress = false;
 	init_waitqueue_head(&newf->resize_wait);
 	newf->next_fd = 0;
+	files_get_misc_cg(newf);
 	new_fdt = &newf->fdtab;
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
@@ -350,6 +391,7 @@  struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 
 	spin_lock(&oldf->file_lock);
 	old_fdt = files_fdtable(oldf);
+
 	open_files = sane_fdtable_size(old_fdt, max_fds);
 
 	/*
@@ -411,9 +453,22 @@  struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 
 	rcu_assign_pointer(newf->fdt, new_fdt);
 
-	return newf;
+	if (!charge_current_fds(newf, count_open_files(new_fdt)))
+		return newf;
+
+	new_fds = new_fdt->fd;
+	for (i = open_files; i != 0; i--) {
+		struct file *f = *new_fds++;
+
+		if (f)
+			fput(f);
+	}
+	if (new_fdt != &newf->fdtab)
+		__free_fdtable(new_fdt);
+	*errorp = -EMFILE;
 
 out_release:
+	files_put_misc_cg(newf);
 	kmem_cache_free(files_cachep, newf);
 out:
 	return NULL;
@@ -439,6 +494,7 @@  static struct fdtable *close_files(struct files_struct * files)
 			if (set & 1) {
 				struct file * file = xchg(&fdt->fd[i], NULL);
 				if (file) {
+					uncharge_current_fds(files, 1);
 					filp_close(file, files);
 					cond_resched();
 				}
@@ -448,6 +504,8 @@  static struct fdtable *close_files(struct files_struct * files)
 		}
 	}
 
+	files_put_misc_cg(files);
+
 	return fdt;
 }
 
@@ -542,6 +600,10 @@  static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	if (error)
 		goto repeat;
 
+	error = -EMFILE;
+	if (charge_current_fds(files, 1) < 0)
+		goto out;
+
 	if (start <= files->next_fd)
 		files->next_fd = fd + 1;
 
@@ -578,6 +640,8 @@  EXPORT_SYMBOL(get_unused_fd_flags);
 static void __put_unused_fd(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = files_fdtable(files);
+	if (test_bit(fd, fdt->open_fds))
+		uncharge_current_fds(files, 1);
 	__clear_open_fd(fd, fdt);
 	if (fd < files->next_fd)
 		files->next_fd = fd;
@@ -1248,7 +1312,7 @@  __releases(&files->file_lock)
 	 */
 	fdt = files_fdtable(files);
 	tofree = fdt->fd[fd];
-	if (!tofree && fd_is_open(fd, fdt))
+	if (!tofree && (fd_is_open(fd, fdt) || charge_current_fds(files, 1) < 0))
 		goto Ebusy;
 	get_file(file);
 	rcu_assign_pointer(fdt->fd[fd], file);
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index d74234c5d4e9..b8783fa0f63f 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -14,6 +14,7 @@ 
 #include <linux/types.h>
 #include <linux/init.h>
 #include <linux/fs.h>
+#include <linux/misc_cgroup.h>
 
 #include <linux/atomic.h>
 
@@ -65,6 +66,9 @@  struct files_struct {
 	unsigned long open_fds_init[1];
 	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
+#ifdef CONFIG_CGROUP_MISC
+	struct misc_cg *mcg;
+#endif
 };
 
 struct file_operations;
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 6ddffeeb6f97..a675be53c875 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -18,6 +18,7 @@  enum misc_res_type {
 	/* AMD SEV-ES ASIDs resource */
 	MISC_CG_RES_SEV_ES,
 #endif
+	MISC_CG_RES_NOFILE,
 	MISC_CG_RES_TYPES
 };
 
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index bbce097270cf..a28f97307b3e 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -12,6 +12,8 @@ 
 #include <linux/atomic.h>
 #include <linux/slab.h>
 #include <linux/misc_cgroup.h>
+#include <linux/mm.h>
+#include <linux/fdtable.h>
 
 #define MAX_STR "max"
 #define MAX_NUM U64_MAX
@@ -24,6 +26,7 @@  static const char *const misc_res_name[] = {
 	/* AMD SEV-ES ASIDs resource */
 	"sev_es",
 #endif
+	"nofile",
 };
 
 /* Root misc cgroup */
@@ -37,7 +40,9 @@  static struct misc_cg root_cg;
  * more than the actual capacity. We are using Limits resource distribution
  * model of cgroup for miscellaneous controller.
  */
-static u64 misc_res_capacity[MISC_CG_RES_TYPES];
+static u64 misc_res_capacity[MISC_CG_RES_TYPES] = {
+	[MISC_CG_RES_NOFILE] = MAX_NUM,
+};
 
 /**
  * parent_misc() - Get the parent of the passed misc cgroup.
@@ -445,10 +450,203 @@  static void misc_cg_free(struct cgroup_subsys_state *css)
 	kfree(css_misc(css));
 }
 
+static void revert_attach_until(struct cgroup_taskset *tset, struct task_struct *stop)
+{
+	struct task_struct *task;
+	struct cgroup_subsys_state *dst_css;
+
+	cgroup_taskset_for_each(task, dst_css, tset) {
+		struct misc_cg *misc, *old_misc;
+		struct cgroup_subsys_state *old_css;
+		struct files_struct *files;
+		struct fdtable *fdt;
+		unsigned long nofile;
+
+		if (task == stop)
+			break;
+
+		misc = css_misc(dst_css);
+		old_css = task_css(task, misc_cgrp_id);
+		old_misc = css_misc(old_css);
+
+		if (misc == old_misc)
+			continue;
+
+		task_lock(task);
+		files = task->files;
+		spin_lock(&files->file_lock);
+		fdt = files_fdtable(files);
+
+		if (old_misc == files->mcg)
+			goto done;
+
+		WARN_ON_ONCE(misc != files->mcg);
+
+		nofile = count_open_files(fdt);
+		misc_cg_charge(MISC_CG_RES_NOFILE, old_misc, nofile);
+		misc_cg_uncharge(MISC_CG_RES_NOFILE, misc, nofile);
+
+		put_misc_cg(files->mcg);
+		css_get(old_css);
+		files->mcg = old_misc;
+
+done:
+		spin_unlock(&files->file_lock);
+		task_unlock(task);
+	}
+}
+
+static int misc_cg_can_attach(struct cgroup_taskset *tset)
+{
+	struct task_struct *task;
+	struct cgroup_subsys_state *dst_css;
+
+	cgroup_taskset_for_each(task, dst_css, tset) {
+		struct misc_cg *misc, *old_misc;
+		struct cgroup_subsys_state *old_css;
+		unsigned long nofile;
+		struct files_struct *files;
+		struct fdtable *fdt;
+		int ret;
+
+		misc = css_misc(dst_css);
+		old_css = task_css(task, misc_cgrp_id);
+		old_misc = css_misc(old_css);
+
+		if (misc == old_misc)
+			continue;
+
+		task_lock(task);
+		files = task->files;
+		spin_lock(&files->file_lock);
+		fdt = files_fdtable(files);
+
+		/*
+		 * If this task->files was already in the right place (either
+		 * because of dup_fd() or because some other thread had already
+		 * migrated it), we don't need to do anything.
+		 */
+		if (misc == files->mcg)
+			goto done;
+
+		WARN_ON_ONCE(old_misc != files->mcg);
+
+		nofile = count_open_files(fdt);
+		ret = misc_cg_try_charge(MISC_CG_RES_NOFILE, misc, nofile);
+		if (ret < 0) {
+			spin_unlock(&files->file_lock);
+			task_unlock(task);
+			revert_attach_until(tset, task);
+			return ret;
+		}
+		misc_cg_uncharge(MISC_CG_RES_NOFILE, old_misc, nofile);
+
+		/*
+		 * let's ref the new table, install it, and
+		 * deref the old one.
+		 */
+		put_misc_cg(files->mcg);
+		css_get(dst_css);
+		files->mcg = misc;
+
+done:
+		spin_unlock(&files->file_lock);
+		task_unlock(task);
+
+	}
+
+	return 0;
+}
+
+static void misc_cg_cancel_attach(struct cgroup_taskset *tset)
+{
+	revert_attach_until(tset, NULL);
+}
+
+static int misc_cg_can_fork(struct task_struct *task, struct css_set *cset)
+{
+	struct misc_cg *dst_misc, *init_misc;
+	struct files_struct *files;
+	struct fdtable *fdt;
+	unsigned long nofile;
+	struct cgroup_subsys_state *dst_css, *cur_css;
+	int ret;
+
+	init_misc = css_misc(init_css_set.subsys[misc_cgrp_id]);
+	cur_css = task_get_css(task, misc_cgrp_id);
+
+	WARN_ON_ONCE(init_misc != css_misc(cur_css));
+
+	dst_css = cset->subsys[misc_cgrp_id];
+	dst_misc = css_misc(dst_css);
+
+	/*
+	 * When forking, tasks are initially put into the init_css_set (see
+	 * cgroup_fork()). Then, we do a dup_fd() and charge init_css_set for
+	 * the new task's fds. We need to migrate from the init_css_set to the
+	 * target one so we can charge the right place.
+	 */
+	task_lock(task);
+	files = task->files;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+
+	ret = 0;
+	if (files->mcg == dst_misc)
+		goto out;
+
+	nofile = count_open_files(fdt);
+	ret = misc_cg_try_charge(MISC_CG_RES_NOFILE, dst_misc, nofile);
+	if (ret < 0)
+		goto out;
+
+	misc_cg_uncharge(MISC_CG_RES_NOFILE, init_misc, nofile);
+
+	put_misc_cg(files->mcg);
+	css_get(dst_css);
+	files->mcg = dst_misc;
+	ret = 0;
+
+out:
+	spin_unlock(&files->file_lock);
+	task_unlock(task);
+
+	return ret;
+}
+
+static void misc_cg_cancel_fork(struct task_struct *task, struct css_set *cset)
+{
+	struct misc_cg *dst_misc;
+	struct files_struct *files;
+	struct fdtable *fdt;
+	unsigned long nofile;
+	struct cgroup_subsys_state *dst_css;
+
+	dst_css = cset->subsys[misc_cgrp_id];
+	dst_misc = css_misc(dst_css);
+
+	task_lock(task);
+	files = task->files;
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+
+	/*
+	 * we don't need to re-charge anyone, since this fork is going away.
+	 */
+	nofile = count_open_files(fdt);
+	misc_cg_uncharge(MISC_CG_RES_NOFILE, dst_misc, nofile);
+	spin_unlock(&files->file_lock);
+	task_unlock(task);
+}
+
 /* Cgroup controller callbacks */
 struct cgroup_subsys misc_cgrp_subsys = {
 	.css_alloc = misc_cg_alloc,
 	.css_free = misc_cg_free,
 	.legacy_cftypes = misc_cg_files,
 	.dfl_cftypes = misc_cg_files,
+	.can_attach = misc_cg_can_attach,
+	.cancel_attach = misc_cg_cancel_attach,
+	.can_fork = misc_cg_can_fork,
+	.cancel_fork = misc_cg_cancel_fork,
 };