Message ID | 20220217153620.4607bc28@imladris.surriel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] ipc,fs: use rcu_work to free struct ipc_namespace | expand |
On Thu, Feb 17, 2022 at 03:36:20PM -0500, Rik van Riel wrote: > The patch works, but a cleanup question for Al Viro: > > How do we get rid of #include "../fs/mount.h" and the raw ->mnt_ns = NULL thing > in the cleanest way? > > ---8<--- > Currently freeing ipc_namespace structures is done through a > workqueue, with every single item on the queue waiting in > synchronize_rcu before it is freed, limiting the rate at which > ipc_namespace structures can be freed to something on the order > of 100 a second. > > Getting rid of that workqueue and just using rcu_work instead > allows a whole batch of ipc_namespace frees to wait one single > RCU grace period, after which they can all get freed quickly. > > Without this patch, a test program that simply calls > unshare(CLONE_NEWIPC) a million times in a loop eventually > gets -ENOSPC as the total number of ipc_namespace structures > exceeds the limit, due to slow freeing. > > With this patch, the test program runs successfully every time. > > Reported-by: Chris Mason <clm@fb.com> > Signed-off-by: Rik van Riel <riel@surriel.com> From an RCU perspective, with the ever-dangerous presumption that the prior code was functionally correct: Reviewed-by: Paul E. McKenney <paulmck@kernel.org> > --- > include/linux/ipc_namespace.h | 2 +- > ipc/namespace.c | 30 ++++++++---------------------- > 2 files changed, 9 insertions(+), 23 deletions(-) > > diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h > index b75395ec8d52..ee26fdbb2ce4 100644 > --- a/include/linux/ipc_namespace.h > +++ b/include/linux/ipc_namespace.h > @@ -67,7 +67,7 @@ struct ipc_namespace { > struct user_namespace *user_ns; > struct ucounts *ucounts; > > - struct llist_node mnt_llist; > + struct rcu_work free_rwork; > > struct ns_common ns; > } __randomize_layout; > diff --git a/ipc/namespace.c b/ipc/namespace.c > index ae83f0f2651b..3d151bc5f723 100644 > --- a/ipc/namespace.c > +++ b/ipc/namespace.c > @@ -17,6 +17,7 @@ > #include <linux/proc_ns.h> > #include <linux/sched/task.h> > > +#include "../fs/mount.h" > #include "util.h" > > static struct ucounts *inc_ipc_namespaces(struct user_namespace *ns) > @@ -115,12 +116,11 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids, > up_write(&ids->rwsem); > } > > -static void free_ipc_ns(struct ipc_namespace *ns) > +static void free_ipc_ns(struct work_struct *work) > { > - /* mq_put_mnt() waits for a grace period as kern_unmount() > - * uses synchronize_rcu(). > - */ > - mq_put_mnt(ns); > + struct ipc_namespace *ns = container_of(to_rcu_work(work), > + struct ipc_namespace, free_rwork); > + mntput(ns->mq_mnt); > sem_exit_ns(ns); > msg_exit_ns(ns); > shm_exit_ns(ns); > @@ -131,21 +131,6 @@ static void free_ipc_ns(struct ipc_namespace *ns) > kfree(ns); > } > > -static LLIST_HEAD(free_ipc_list); > -static void free_ipc(struct work_struct *unused) > -{ > - struct llist_node *node = llist_del_all(&free_ipc_list); > - struct ipc_namespace *n, *t; > - > - llist_for_each_entry_safe(n, t, node, mnt_llist) > - free_ipc_ns(n); > -} > - > -/* > - * The work queue is used to avoid the cost of synchronize_rcu in kern_unmount. > - */ > -static DECLARE_WORK(free_ipc_work, free_ipc); > - > /* > * put_ipc_ns - drop a reference to an ipc namespace. > * @ns: the namespace to put > @@ -166,10 +151,11 @@ void put_ipc_ns(struct ipc_namespace *ns) > { > if (refcount_dec_and_lock(&ns->ns.count, &mq_lock)) { > mq_clear_sbinfo(ns); > + real_mount(ns->mq_mnt)->mnt_ns = NULL; > spin_unlock(&mq_lock); > > - if (llist_add(&ns->mnt_llist, &free_ipc_list)) > - schedule_work(&free_ipc_work); > + INIT_RCU_WORK(&ns->free_rwork, free_ipc_ns); > + queue_rcu_work(system_wq, &ns->free_rwork); > } > } > > -- > 2.34.1 > >
On Thu, Feb 17, 2022 at 03:36:20PM -0500, Rik van Riel wrote: > The patch works, but a cleanup question for Al Viro: > > How do we get rid of #include "../fs/mount.h" and the raw ->mnt_ns = NULL thing > in the cleanest way? Hell knows... mnt_make_shortterm(mnt) with big, fat warning along the lines of "YOU MUST HAVE AN RCU GRACE PERIOD BEFORE YOU DROP THAT REFERENCE!!!", perhaps?
On Fri, Feb 18, 2022 at 03:29:56AM +0000, Al Viro wrote: > On Thu, Feb 17, 2022 at 03:36:20PM -0500, Rik van Riel wrote: > > The patch works, but a cleanup question for Al Viro: > > > > How do we get rid of #include "../fs/mount.h" and the raw ->mnt_ns = NULL thing > > in the cleanest way? > > Hell knows... mnt_make_shortterm(mnt) with big, fat warning along the lines of > "YOU MUST HAVE AN RCU GRACE PERIOD BEFORE YOU DROP THAT REFERENCE!!!", perhaps? Rik's patch uses queue_rcu_work(), which always uses a normal grace period. Therefore, one way of checking that an RCU grace period has elapsed is as follows: Step 1: Get a snapshot of the normal grace-period state: rcuseq = get_state_synchronize_rcu(); // In mainline Step 2: Verify that a normal grace period has elapsed since step 1: WARN_ON_ONCE(!poll_state_synchronize_rcu(rcuseq)); These functions are both in mainline. And apologies for my answer on IRC being unhelpful. Here is hoping that this is more to the point. Thanx, Paul ------------------------------------------------------------------------ PS. Just in case it ever becomes relevant, if Rik's patch were instead to use synchronize_rcu() or synchronize_rcu_expedited() to wait for the grace period, it would be necessary to capture both the normal and expedited grace-period state: Step 1: Get a snapshot of both the normal and the expedited state: rcuseq = get_state_synchronize_rcu(); rcuxseq = get_state_synchronize_rcu_expedited(); Step 2: Verify that either a normal or expedited grace period has elapsed since step 1: WARN_ON_ONCE(!poll_state_synchronize_rcu(rcuseq) && !poll_state_synchronize_rcu_expedited(rcuxseq)); The reason for doing both is that synchronize_rcu_expedited() and synchronize_rcu() can both be switched between using normal and expedited grace periods. Not just at boot time, but also at runtime. Fun. The two expedited functions are in -rcu rather than mainline, so if someone does ever need them, please let me know.
On Thu, Feb 17, 2022 at 03:36:20PM -0500, Rik van Riel wrote: > The patch works, but a cleanup question for Al Viro: > > How do we get rid of #include "../fs/mount.h" and the raw ->mnt_ns = NULL thing > in the cleanest way? Maybe add a tiny helper to include/linux/mount.h or ipc_namespace.h, slap two underscores in front of it, mention in the comment that this is i) not for general use, ii) requires - as Al said - clean rcu handling, iii) should never ever be exported? To prevent it from being reusable you could make it take struct ipc_namespace as arg instead of struct (vfs)mnt? That might mean you'd have to include it in ipc_namespace.h, I guess. Not sure what's best there. > > ---8<--- > Currently freeing ipc_namespace structures is done through a > workqueue, with every single item on the queue waiting in > synchronize_rcu before it is freed, limiting the rate at which > ipc_namespace structures can be freed to something on the order > of 100 a second. > > Getting rid of that workqueue and just using rcu_work instead > allows a whole batch of ipc_namespace frees to wait one single > RCU grace period, after which they can all get freed quickly. > > Without this patch, a test program that simply calls > unshare(CLONE_NEWIPC) a million times in a loop eventually > gets -ENOSPC as the total number of ipc_namespace structures > exceeds the limit, due to slow freeing. > > With this patch, the test program runs successfully every time. > > Reported-by: Chris Mason <clm@fb.com> > Signed-off-by: Rik van Riel <riel@surriel.com> > --- > include/linux/ipc_namespace.h | 2 +- > ipc/namespace.c | 30 ++++++++---------------------- > 2 files changed, 9 insertions(+), 23 deletions(-) > > diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h > index b75395ec8d52..ee26fdbb2ce4 100644 > --- a/include/linux/ipc_namespace.h > +++ b/include/linux/ipc_namespace.h > @@ -67,7 +67,7 @@ struct ipc_namespace { > struct user_namespace *user_ns; > struct ucounts *ucounts; > > - struct llist_node mnt_llist; > + struct rcu_work free_rwork; > > struct ns_common ns; > } __randomize_layout; > diff --git a/ipc/namespace.c b/ipc/namespace.c > index ae83f0f2651b..3d151bc5f723 100644 > --- a/ipc/namespace.c > +++ b/ipc/namespace.c > @@ -17,6 +17,7 @@ > #include <linux/proc_ns.h> > #include <linux/sched/task.h> > > +#include "../fs/mount.h" > #include "util.h" > > static struct ucounts *inc_ipc_namespaces(struct user_namespace *ns) > @@ -115,12 +116,11 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids, > up_write(&ids->rwsem); > } > > -static void free_ipc_ns(struct ipc_namespace *ns) > +static void free_ipc_ns(struct work_struct *work) > { > - /* mq_put_mnt() waits for a grace period as kern_unmount() > - * uses synchronize_rcu(). > - */ > - mq_put_mnt(ns); > + struct ipc_namespace *ns = container_of(to_rcu_work(work), > + struct ipc_namespace, free_rwork); > + mntput(ns->mq_mnt); > sem_exit_ns(ns); > msg_exit_ns(ns); > shm_exit_ns(ns); > @@ -131,21 +131,6 @@ static void free_ipc_ns(struct ipc_namespace *ns) > kfree(ns); > } > > -static LLIST_HEAD(free_ipc_list); > -static void free_ipc(struct work_struct *unused) > -{ > - struct llist_node *node = llist_del_all(&free_ipc_list); > - struct ipc_namespace *n, *t; > - > - llist_for_each_entry_safe(n, t, node, mnt_llist) > - free_ipc_ns(n); > -} > - > -/* > - * The work queue is used to avoid the cost of synchronize_rcu in kern_unmount. > - */ > -static DECLARE_WORK(free_ipc_work, free_ipc); > - > /* > * put_ipc_ns - drop a reference to an ipc namespace. > * @ns: the namespace to put > @@ -166,10 +151,11 @@ void put_ipc_ns(struct ipc_namespace *ns) > { > if (refcount_dec_and_lock(&ns->ns.count, &mq_lock)) { > mq_clear_sbinfo(ns); > + real_mount(ns->mq_mnt)->mnt_ns = NULL; > spin_unlock(&mq_lock); > > - if (llist_add(&ns->mnt_llist, &free_ipc_list)) > - schedule_work(&free_ipc_work); > + INIT_RCU_WORK(&ns->free_rwork, free_ipc_ns); > + queue_rcu_work(system_wq, &ns->free_rwork); > } > } > > -- > 2.34.1 > >
Maybe I am reading the lifetimes wrong but is there any chance the code can just do something like the diff below? AKA have a special version of kern_umount that does the call_rcu? Looking at rcu_reclaim_tiny I think this use of mnt_rcu is valid. AKA reusing the rcu_head in the rcu callback. diff --git a/fs/namespace.c b/fs/namespace.c index 40b994a29e90..7d7aaef1592e 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4395,6 +4395,22 @@ void kern_unmount(struct vfsmount *mnt) } EXPORT_SYMBOL(kern_unmount); +static void rcu_mntput(struct rcu_head *head) +{ + struct mount *mnt = container_of(head, struct mount, mnt_rcu); + mntput(&mnt->mnt); +} + +void kern_rcu_unmount(struct vfsmount *mnt) +{ + /* release long term mount so mount point can be released */ + if (!IS_ERR_OR_NULL(mnt)) { + struct mount *m = real_mount(mnt); + m->mnt_ns = NULL; + call_rcu(&m->mnt_rcu, rcu_mntput); + } +} + void kern_unmount_array(struct vfsmount *mnt[], unsigned int num) { unsigned int i; diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 5becca9be867..e54742f82e7d 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -1700,7 +1700,7 @@ void mq_clear_sbinfo(struct ipc_namespace *ns) void mq_put_mnt(struct ipc_namespace *ns) { - kern_unmount(ns->mq_mnt); + kern_rcu_unmount(ns->mq_mnt); } static int __init init_mqueue_fs(void) Eric
On Fri, Feb 18, 2022 at 10:08:05AM -0600, Eric W. Biederman wrote: > > Maybe I am reading the lifetimes wrong but is there > any chance the code can just do something like the diff below? > > AKA have a special version of kern_umount that does the call_rcu? > > Looking at rcu_reclaim_tiny I think this use of mnt_rcu is valid. > AKA reusing the rcu_head in the rcu callback. As long as you don't try to pass a given rcu_head structure to call_rcu() before some previous call_rcu() has invoked the corresponding callback, this can work. Careful, though, because rcu_reclaim_tiny() is part of Tiny RCU, which assumes NR_CPUS=1. ;-) The DEBUG_OBJECTS_RCU_HEAD Kconfig option checks for double-call_rcu() of a single rcu_head structure. Of course, you would need a test that actually forces a race between the other uses of ->mnt_rcu. Except that it looks like mntput_no_expire() is using ->mnt_rcu for other purposes, which DEBUG_OBJECTS_RCU_HEAD is blissfully unaware of. But doesn't mntput() call mntput_no_expire(), which in turn calls lock_mount_hash(), which calls write_seqlock(), which is not going to be happy in an RCU callback's BH-disabled execution context? Or did I miss a turn in there somewhere? Thanx, Paul > diff --git a/fs/namespace.c b/fs/namespace.c > index 40b994a29e90..7d7aaef1592e 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -4395,6 +4395,22 @@ void kern_unmount(struct vfsmount *mnt) > } > EXPORT_SYMBOL(kern_unmount); > > +static void rcu_mntput(struct rcu_head *head) > +{ > + struct mount *mnt = container_of(head, struct mount, mnt_rcu); > + mntput(&mnt->mnt); > +} > + > +void kern_rcu_unmount(struct vfsmount *mnt) > +{ > + /* release long term mount so mount point can be released */ > + if (!IS_ERR_OR_NULL(mnt)) { > + struct mount *m = real_mount(mnt); > + m->mnt_ns = NULL; > + call_rcu(&m->mnt_rcu, rcu_mntput); > + } > +} > + > void kern_unmount_array(struct vfsmount *mnt[], unsigned int num) > { > unsigned int i; > diff --git a/ipc/mqueue.c b/ipc/mqueue.c > index 5becca9be867..e54742f82e7d 100644 > --- a/ipc/mqueue.c > +++ b/ipc/mqueue.c > @@ -1700,7 +1700,7 @@ void mq_clear_sbinfo(struct ipc_namespace *ns) > > void mq_put_mnt(struct ipc_namespace *ns) > { > - kern_unmount(ns->mq_mnt); > + kern_rcu_unmount(ns->mq_mnt); > } > > static int __init init_mqueue_fs(void) > > Eric
On Fri, 2022-02-18 at 10:08 -0600, Eric W. Biederman wrote: > > Maybe I am reading the lifetimes wrong but is there > any chance the code can just do something like the diff below? > > AKA have a special version of kern_umount that does the call_rcu? > > Looking at rcu_reclaim_tiny I think this use of mnt_rcu is valid. > AKA reusing the rcu_head in the rcu callback. > > > diff --git a/fs/namespace.c b/fs/namespace.c > index 40b994a29e90..7d7aaef1592e 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -4395,6 +4395,22 @@ void kern_unmount(struct vfsmount *mnt) > } > EXPORT_SYMBOL(kern_unmount); > > +static void rcu_mntput(struct rcu_head *head) > +{ > + struct mount *mnt = container_of(head, struct mount, > mnt_rcu); > + mntput(&mnt->mnt); > +} > + > +void kern_rcu_unmount(struct vfsmount *mnt) > +{ > + /* release long term mount so mount point can be released */ > + if (!IS_ERR_OR_NULL(mnt)) { > + struct mount *m = real_mount(mnt); > + m->mnt_ns = NULL; > + call_rcu(&m->mnt_rcu, rcu_mntput); > + } > +} OK, two comments here: 1) As Paul pointed out, we need to use call_rcu_work here, because rcu_mntput needs to run from a work item (or well, any process context) because of the rwlock. 2) No user of kern_unmount can use the vfsmount structure after kern_unmount returns. That means they could all use the RCU version, and no special version is needed. Let me go test a patch that does that...
diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h index b75395ec8d52..ee26fdbb2ce4 100644 --- a/include/linux/ipc_namespace.h +++ b/include/linux/ipc_namespace.h @@ -67,7 +67,7 @@ struct ipc_namespace { struct user_namespace *user_ns; struct ucounts *ucounts; - struct llist_node mnt_llist; + struct rcu_work free_rwork; struct ns_common ns; } __randomize_layout; diff --git a/ipc/namespace.c b/ipc/namespace.c index ae83f0f2651b..3d151bc5f723 100644 --- a/ipc/namespace.c +++ b/ipc/namespace.c @@ -17,6 +17,7 @@ #include <linux/proc_ns.h> #include <linux/sched/task.h> +#include "../fs/mount.h" #include "util.h" static struct ucounts *inc_ipc_namespaces(struct user_namespace *ns) @@ -115,12 +116,11 @@ void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids, up_write(&ids->rwsem); } -static void free_ipc_ns(struct ipc_namespace *ns) +static void free_ipc_ns(struct work_struct *work) { - /* mq_put_mnt() waits for a grace period as kern_unmount() - * uses synchronize_rcu(). - */ - mq_put_mnt(ns); + struct ipc_namespace *ns = container_of(to_rcu_work(work), + struct ipc_namespace, free_rwork); + mntput(ns->mq_mnt); sem_exit_ns(ns); msg_exit_ns(ns); shm_exit_ns(ns); @@ -131,21 +131,6 @@ static void free_ipc_ns(struct ipc_namespace *ns) kfree(ns); } -static LLIST_HEAD(free_ipc_list); -static void free_ipc(struct work_struct *unused) -{ - struct llist_node *node = llist_del_all(&free_ipc_list); - struct ipc_namespace *n, *t; - - llist_for_each_entry_safe(n, t, node, mnt_llist) - free_ipc_ns(n); -} - -/* - * The work queue is used to avoid the cost of synchronize_rcu in kern_unmount. - */ -static DECLARE_WORK(free_ipc_work, free_ipc); - /* * put_ipc_ns - drop a reference to an ipc namespace. * @ns: the namespace to put @@ -166,10 +151,11 @@ void put_ipc_ns(struct ipc_namespace *ns) { if (refcount_dec_and_lock(&ns->ns.count, &mq_lock)) { mq_clear_sbinfo(ns); + real_mount(ns->mq_mnt)->mnt_ns = NULL; spin_unlock(&mq_lock); - if (llist_add(&ns->mnt_llist, &free_ipc_list)) - schedule_work(&free_ipc_work); + INIT_RCU_WORK(&ns->free_rwork, free_ipc_ns); + queue_rcu_work(system_wq, &ns->free_rwork); } }
The patch works, but a cleanup question for Al Viro: How do we get rid of #include "../fs/mount.h" and the raw ->mnt_ns = NULL thing in the cleanest way? ---8<--- Currently freeing ipc_namespace structures is done through a workqueue, with every single item on the queue waiting in synchronize_rcu before it is freed, limiting the rate at which ipc_namespace structures can be freed to something on the order of 100 a second. Getting rid of that workqueue and just using rcu_work instead allows a whole batch of ipc_namespace frees to wait one single RCU grace period, after which they can all get freed quickly. Without this patch, a test program that simply calls unshare(CLONE_NEWIPC) a million times in a loop eventually gets -ENOSPC as the total number of ipc_namespace structures exceeds the limit, due to slow freeing. With this patch, the test program runs successfully every time. Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> --- include/linux/ipc_namespace.h | 2 +- ipc/namespace.c | 30 ++++++++---------------------- 2 files changed, 9 insertions(+), 23 deletions(-)