RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)

Message ID	20190721081933-mutt-send-email-mst@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of mst@redhat.com designates 209.132.183.28 as permitted sender) client-ip=209.132.183.28; Date: Sun, 21 Jul 2019 08:28:05 -0400 From: "Michael S. Tsirkin" <mst@redhat.com> To: paulmck@linux.vnet.ibm.com Cc: aarcange@redhat.com, akpm@linux-foundation.org, christian@brauner.io, davem@davemloft.net, ebiederm@xmission.com, elena.reshetova@intel.com, guro@fb.com, hch@infradead.org, james.bottomley@hansenpartnership.com, jasowang@redhat.com, jglisse@redhat.com, keescook@chromium.org, ldv@altlinux.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-parisc@vger.kernel.org, luto@amacapital.net, mhocko@suse.com, mingo@kernel.org, namit@vmware.com, peterz@infradead.org, syzkaller-bugs@googlegroups.com, viro@zeniv.linux.org.uk, wad@chromium.org Subject: RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop) Message-ID: <20190721081933-mutt-send-email-mst@kernel.org> References: <0000000000008dd6bb058e006938@google.com> <000000000000964b0d058e1a0483@google.com> <20190721044615-mutt-send-email-mst@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190721044615-mutt-send-email-mst@kernel.org> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop) \| expand RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)

Michael S. Tsirkin July 21, 2019, 12:28 p.m. UTC

Hi Paul, others,

So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
is what happens if userspace starts cycling through lots of these
ioctls.  Given we actually use rcu as an optimization, we could just
disable the optimization temporarily - but the question would be how to
detect an excessive rate without working too hard :) .

I guess we could define as excessive any rate where callback is
outstanding at the time when new structure is allocated.  I have very
little understanding of rcu internals - so I wanted to check that the
following more or less implements this heuristic before I spend time
actually testing it.

Could others pls take a look and let me know?

Thanks!

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Paul E. McKenney July 21, 2019, 1:17 p.m. UTC | #1

On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote:
> Hi Paul, others,
> 
> So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
> is what happens if userspace starts cycling through lots of these
> ioctls.  Given we actually use rcu as an optimization, we could just
> disable the optimization temporarily - but the question would be how to
> detect an excessive rate without working too hard :) .
> 
> I guess we could define as excessive any rate where callback is
> outstanding at the time when new structure is allocated.  I have very
> little understanding of rcu internals - so I wanted to check that the
> following more or less implements this heuristic before I spend time
> actually testing it.
> 
> Could others pls take a look and let me know?

These look good as a way of seeing if there are any outstanding callbacks,
but in the case of Tree RCU, call_rcu_outstanding() would almost never
return false on a busy system.

Here are some alternatives:

o	RCU uses some pieces of Rao Shoaib kfree_rcu() patches.
	The idea is to make kfree_rcu() locally buffer requests into
	batches of (say) 1,000, but processing smaller batches when RCU
	is idle, or when some smallish amout of time has passed with
	no more kfree_rcu() request from that CPU.  RCU than takes in
	the batch using not call_rcu(), but rather queue_rcu_work().
	The resulting batch of kfree() calls would therefore execute in
	workqueue context rather than in softirq context, which should
	be much easier on the system.

	In theory, this would allow people to use kfree_rcu() without
	worrying quite so much about overload.  It would also not be
	that hard to implement.

o	Subsystems vulnerable to user-induced kfree_rcu() flooding use
	call_rcu() instead of kfree_rcu().  Keep a count of the number
	of things waiting for a grace period, and when this gets too
	large, disable the optimization.  It will then drain down, at
	which point the optimization can be re-enabled.

	But please note that callbacks are -not- guaranteed to run on
	the CPU that queued them.  So yes, you would need a per-CPU
	counter, but you would need to periodically sum it up to check
	against the global state.  Or keep track of the CPU that
	did the call_rcu() so that you can atomically decrement in
	the callback the same counter that was atomically incremented
	just before the call_rcu().  Or any number of other approaches.

Also, the overhead is important.  For example, as far as I know,
current RCU gracefully handles close(open(...)) in a tight userspace
loop.  But there might be trouble due to tight userspace loops around
lighter-weight operations.

So an important question is "Just how fast is your ioctl?"  If it takes
(say) 100 microseconds to execute, there should be absolutely no problem.
On the other hand, if it can execute in 50 nanoseconds, this very likely
does need serious attention.

Other thoughts?

							Thanx, Paul

> Thanks!
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> 
> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> index 477b4eb44af5..067909521d72 100644
> --- a/kernel/rcu/tiny.c
> +++ b/kernel/rcu/tiny.c
> @@ -125,6 +125,25 @@ void synchronize_rcu(void)
>  }
>  EXPORT_SYMBOL_GPL(synchronize_rcu);
> 
> +/*
> + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> + */
> +bool call_rcu_outstanding(void)
> +{
> +	unsigned long flags;
> +	struct rcu_data *rdp;
> +	bool outstanding;
> +
> +	local_irq_save(flags);
> +	rdp = this_cpu_ptr(&rcu_data);
> +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> +	outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail;
> +	local_irq_restore(flags);
> +
> +	return outstanding;
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> +
>  /*
>   * Post an RCU callback to be invoked after the end of an RCU grace
>   * period.  But since we have but one CPU, that would be after any
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index a14e5fbbea46..d4b9d61e637d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp)
>  {
>  }
> 
> +/*
> + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> + */
> +bool call_rcu_outstanding(void)
> +{
> +	unsigned long flags;
> +	struct rcu_data *rdp;
> +	bool outstanding;
> +
> +	local_irq_save(flags);
> +	rdp = this_cpu_ptr(&rcu_data);
> +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> +	local_irq_restore(flags);
> +
> +	return outstanding;
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> +
>  /*
>   * Helper function for call_rcu() and friends.  The cpu argument will
>   * normally be -1, indicating "currently running CPU".  It may specify

Michael S. Tsirkin July 21, 2019, 5:53 p.m. UTC | #2

On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote:
> > Hi Paul, others,
> > 
> > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
> > is what happens if userspace starts cycling through lots of these
> > ioctls.  Given we actually use rcu as an optimization, we could just
> > disable the optimization temporarily - but the question would be how to
> > detect an excessive rate without working too hard :) .
> > 
> > I guess we could define as excessive any rate where callback is
> > outstanding at the time when new structure is allocated.  I have very
> > little understanding of rcu internals - so I wanted to check that the
> > following more or less implements this heuristic before I spend time
> > actually testing it.
> > 
> > Could others pls take a look and let me know?
> 
> These look good as a way of seeing if there are any outstanding callbacks,
> but in the case of Tree RCU, call_rcu_outstanding() would almost never
> return false on a busy system.


Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy
and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000?

> 
> Here are some alternatives:
> 
> o	RCU uses some pieces of Rao Shoaib kfree_rcu() patches.
> 	The idea is to make kfree_rcu() locally buffer requests into
> 	batches of (say) 1,000, but processing smaller batches when RCU
> 	is idle, or when some smallish amout of time has passed with
> 	no more kfree_rcu() request from that CPU.  RCU than takes in
> 	the batch using not call_rcu(), but rather queue_rcu_work().
> 	The resulting batch of kfree() calls would therefore execute in
> 	workqueue context rather than in softirq context, which should
> 	be much easier on the system.
> 
> 	In theory, this would allow people to use kfree_rcu() without
> 	worrying quite so much about overload.  It would also not be
> 	that hard to implement.
> 
> o	Subsystems vulnerable to user-induced kfree_rcu() flooding use
> 	call_rcu() instead of kfree_rcu().  Keep a count of the number
> 	of things waiting for a grace period, and when this gets too
> 	large, disable the optimization.  It will then drain down, at
> 	which point the optimization can be re-enabled.
> 
> 	But please note that callbacks are -not- guaranteed to run on
> 	the CPU that queued them.  So yes, you would need a per-CPU
> 	counter, but you would need to periodically sum it up to check
> 	against the global state.  Or keep track of the CPU that
> 	did the call_rcu() so that you can atomically decrement in
> 	the callback the same counter that was atomically incremented
> 	just before the call_rcu().  Or any number of other approaches.

I'm really looking for something we can do this merge window
and without adding too much code, and kfree_rcu is intended to
fix a bug.
Adding call_rcu and careful accounting is something that I'm not
happy adding with merge window already open.

> 
> Also, the overhead is important.  For example, as far as I know,
> current RCU gracefully handles close(open(...)) in a tight userspace
> loop.  But there might be trouble due to tight userspace loops around
> lighter-weight operations.
> 
> So an important question is "Just how fast is your ioctl?"  If it takes
> (say) 100 microseconds to execute, there should be absolutely no problem.
> On the other hand, if it can execute in 50 nanoseconds, this very likely
> does need serious attention.
> 
> Other thoughts?
> 
> 							Thanx, Paul

Hmm the answer to this would be I'm not sure.
It's setup time stuff we never tested it.

> > Thanks!
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > 
> > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> > index 477b4eb44af5..067909521d72 100644
> > --- a/kernel/rcu/tiny.c
> > +++ b/kernel/rcu/tiny.c
> > @@ -125,6 +125,25 @@ void synchronize_rcu(void)
> >  }
> >  EXPORT_SYMBOL_GPL(synchronize_rcu);
> > 
> > +/*
> > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > + */
> > +bool call_rcu_outstanding(void)
> > +{
> > +	unsigned long flags;
> > +	struct rcu_data *rdp;
> > +	bool outstanding;
> > +
> > +	local_irq_save(flags);
> > +	rdp = this_cpu_ptr(&rcu_data);
> > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > +	outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail;
> > +	local_irq_restore(flags);
> > +
> > +	return outstanding;
> > +}
> > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > +
> >  /*
> >   * Post an RCU callback to be invoked after the end of an RCU grace
> >   * period.  But since we have but one CPU, that would be after any
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index a14e5fbbea46..d4b9d61e637d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp)
> >  {
> >  }
> > 
> > +/*
> > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > + */
> > +bool call_rcu_outstanding(void)
> > +{
> > +	unsigned long flags;
> > +	struct rcu_data *rdp;
> > +	bool outstanding;
> > +
> > +	local_irq_save(flags);
> > +	rdp = this_cpu_ptr(&rcu_data);
> > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > +	local_irq_restore(flags);
> > +
> > +	return outstanding;
> > +}
> > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > +
> >  /*
> >   * Helper function for call_rcu() and friends.  The cpu argument will
> >   * normally be -1, indicating "currently running CPU".  It may specify

Paul E. McKenney July 21, 2019, 7:28 p.m. UTC | #3

On Sun, Jul 21, 2019 at 01:53:23PM -0400, Michael S. Tsirkin wrote:
> On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote:
> > > Hi Paul, others,
> > > 
> > > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
> > > is what happens if userspace starts cycling through lots of these
> > > ioctls.  Given we actually use rcu as an optimization, we could just
> > > disable the optimization temporarily - but the question would be how to
> > > detect an excessive rate without working too hard :) .
> > > 
> > > I guess we could define as excessive any rate where callback is
> > > outstanding at the time when new structure is allocated.  I have very
> > > little understanding of rcu internals - so I wanted to check that the
> > > following more or less implements this heuristic before I spend time
> > > actually testing it.
> > > 
> > > Could others pls take a look and let me know?
> > 
> > These look good as a way of seeing if there are any outstanding callbacks,
> > but in the case of Tree RCU, call_rcu_outstanding() would almost never
> > return false on a busy system.
> 
> Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy
> and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000?

Or the function could simply return the number of callbacks queued
on the current CPU, and let the caller decide how many is too many.

> > Here are some alternatives:
> > 
> > o	RCU uses some pieces of Rao Shoaib kfree_rcu() patches.
> > 	The idea is to make kfree_rcu() locally buffer requests into
> > 	batches of (say) 1,000, but processing smaller batches when RCU
> > 	is idle, or when some smallish amout of time has passed with
> > 	no more kfree_rcu() request from that CPU.  RCU than takes in
> > 	the batch using not call_rcu(), but rather queue_rcu_work().
> > 	The resulting batch of kfree() calls would therefore execute in
> > 	workqueue context rather than in softirq context, which should
> > 	be much easier on the system.
> > 
> > 	In theory, this would allow people to use kfree_rcu() without
> > 	worrying quite so much about overload.  It would also not be
> > 	that hard to implement.
> > 
> > o	Subsystems vulnerable to user-induced kfree_rcu() flooding use
> > 	call_rcu() instead of kfree_rcu().  Keep a count of the number
> > 	of things waiting for a grace period, and when this gets too
> > 	large, disable the optimization.  It will then drain down, at
> > 	which point the optimization can be re-enabled.
> > 
> > 	But please note that callbacks are -not- guaranteed to run on
> > 	the CPU that queued them.  So yes, you would need a per-CPU
> > 	counter, but you would need to periodically sum it up to check
> > 	against the global state.  Or keep track of the CPU that
> > 	did the call_rcu() so that you can atomically decrement in
> > 	the callback the same counter that was atomically incremented
> > 	just before the call_rcu().  Or any number of other approaches.
> 
> I'm really looking for something we can do this merge window
> and without adding too much code, and kfree_rcu is intended to
> fix a bug.
> Adding call_rcu and careful accounting is something that I'm not
> happy adding with merge window already open.

OK, then I suggest having the interface return you the number of
callbacks.  That allows you to experiment with the cutoff.

Give or take the ioctl overhead...

> > Also, the overhead is important.  For example, as far as I know,
> > current RCU gracefully handles close(open(...)) in a tight userspace
> > loop.  But there might be trouble due to tight userspace loops around
> > lighter-weight operations.
> > 
> > So an important question is "Just how fast is your ioctl?"  If it takes
> > (say) 100 microseconds to execute, there should be absolutely no problem.
> > On the other hand, if it can execute in 50 nanoseconds, this very likely
> > does need serious attention.
> > 
> > Other thoughts?
> > 
> > 							Thanx, Paul
> 
> Hmm the answer to this would be I'm not sure.
> It's setup time stuff we never tested it.

Is it possible to measure it easily?

							Thanx, Paul

> > > Thanks!
> > > 
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > 
> > > 
> > > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> > > index 477b4eb44af5..067909521d72 100644
> > > --- a/kernel/rcu/tiny.c
> > > +++ b/kernel/rcu/tiny.c
> > > @@ -125,6 +125,25 @@ void synchronize_rcu(void)
> > >  }
> > >  EXPORT_SYMBOL_GPL(synchronize_rcu);
> > > 
> > > +/*
> > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > > + */
> > > +bool call_rcu_outstanding(void)
> > > +{
> > > +	unsigned long flags;
> > > +	struct rcu_data *rdp;
> > > +	bool outstanding;
> > > +
> > > +	local_irq_save(flags);
> > > +	rdp = this_cpu_ptr(&rcu_data);
> > > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > > +	outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail;
> > > +	local_irq_restore(flags);
> > > +
> > > +	return outstanding;
> > > +}
> > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > > +
> > >  /*
> > >   * Post an RCU callback to be invoked after the end of an RCU grace
> > >   * period.  But since we have but one CPU, that would be after any
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index a14e5fbbea46..d4b9d61e637d 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp)
> > >  {
> > >  }
> > > 
> > > +/*
> > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > > + */
> > > +bool call_rcu_outstanding(void)
> > > +{
> > > +	unsigned long flags;
> > > +	struct rcu_data *rdp;
> > > +	bool outstanding;
> > > +
> > > +	local_irq_save(flags);
> > > +	rdp = this_cpu_ptr(&rcu_data);
> > > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > > +	local_irq_restore(flags);
> > > +
> > > +	return outstanding;
> > > +}
> > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > > +
> > >  /*
> > >   * Helper function for call_rcu() and friends.  The cpu argument will
> > >   * normally be -1, indicating "currently running CPU".  It may specify
>

Matthew Wilcox (Oracle) July 21, 2019, 9:08 p.m. UTC | #4

On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> Also, the overhead is important.  For example, as far as I know,
> current RCU gracefully handles close(open(...)) in a tight userspace
> loop.  But there might be trouble due to tight userspace loops around
> lighter-weight operations.

I thought you believed that RCU was antifragile, in that it would scale
better as it was used more heavily?

Would it make sense to have call_rcu() check to see if there are many
outstanding requests on this CPU and if so process them before returning?
That would ensure that frequent callers usually ended up doing their
own processing.

Paul E. McKenney July 21, 2019, 11:31 p.m. UTC | #5

On Sun, Jul 21, 2019 at 02:08:37PM -0700, Matthew Wilcox wrote:
> On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > Also, the overhead is important.  For example, as far as I know,
> > current RCU gracefully handles close(open(...)) in a tight userspace
> > loop.  But there might be trouble due to tight userspace loops around
> > lighter-weight operations.
> 
> I thought you believed that RCU was antifragile, in that it would scale
> better as it was used more heavily?

You are referring to this?  https://paulmck.livejournal.com/47933.html

If so, the last few paragraphs might be worth re-reading.   ;-)

And in this case, the heuristics RCU uses to decide when to schedule
invocation of the callbacks needs some help.  One component of that help
is a time-based limit to the number of consecutive callback invocations
(see my crude prototype and Eric Dumazet's more polished patch).  Another
component is an overload warning.

Why would an overload warning be needed if RCU's callback-invocation
scheduling heurisitics were upgraded?  Because someone could boot a
100-CPU system with the rcu_nocbs=0-99, bind all of the resulting
rcuo kthreads to (say) CPU 0, and then run a callback-heavy workload
on all of the CPUs.  Given the constraints, CPU 0 cannot keep up.

So warnings are required as well.

> Would it make sense to have call_rcu() check to see if there are many
> outstanding requests on this CPU and if so process them before returning?
> That would ensure that frequent callers usually ended up doing their
> own processing.

Unfortunately, no.  Here is a code fragment illustrating why:

	void my_cb(struct rcu_head *rhp)
	{
		unsigned long flags;

		spin_lock_irqsave(&my_lock, flags);
		handle_cb(rhp);
		spin_unlock_irqrestore(&my_lock, flags);
	}

	. . .

	spin_lock_irqsave(&my_lock, flags);
	p = look_something_up();
	remove_that_something(p);
	call_rcu(p, my_cb);
	spin_unlock_irqrestore(&my_lock, flags);

Invoking the extra callbacks directly from call_rcu() would thus result
in self-deadlock.  Documentation/RCU/UP.txt contains a few more examples
along these lines.

Michael S. Tsirkin July 22, 2019, 7:52 a.m. UTC | #6

On Sun, Jul 21, 2019 at 04:31:13PM -0700, Paul E. McKenney wrote:
> On Sun, Jul 21, 2019 at 02:08:37PM -0700, Matthew Wilcox wrote:
> > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > > Also, the overhead is important.  For example, as far as I know,
> > > current RCU gracefully handles close(open(...)) in a tight userspace
> > > loop.  But there might be trouble due to tight userspace loops around
> > > lighter-weight operations.
> > 
> > I thought you believed that RCU was antifragile, in that it would scale
> > better as it was used more heavily?
> 
> You are referring to this?  https://paulmck.livejournal.com/47933.html
> 
> If so, the last few paragraphs might be worth re-reading.   ;-)
> 
> And in this case, the heuristics RCU uses to decide when to schedule
> invocation of the callbacks needs some help.  One component of that help
> is a time-based limit to the number of consecutive callback invocations
> (see my crude prototype and Eric Dumazet's more polished patch).  Another
> component is an overload warning.
> 
> Why would an overload warning be needed if RCU's callback-invocation
> scheduling heurisitics were upgraded?  Because someone could boot a
> 100-CPU system with the rcu_nocbs=0-99, bind all of the resulting
> rcuo kthreads to (say) CPU 0, and then run a callback-heavy workload
> on all of the CPUs.  Given the constraints, CPU 0 cannot keep up.
> 
> So warnings are required as well.
> 
> > Would it make sense to have call_rcu() check to see if there are many
> > outstanding requests on this CPU and if so process them before returning?
> > That would ensure that frequent callers usually ended up doing their
> > own processing.
> 
> Unfortunately, no.  Here is a code fragment illustrating why:
> 
> 	void my_cb(struct rcu_head *rhp)
> 	{
> 		unsigned long flags;
> 
> 		spin_lock_irqsave(&my_lock, flags);
> 		handle_cb(rhp);
> 		spin_unlock_irqrestore(&my_lock, flags);
> 	}
> 
> 	. . .
> 
> 	spin_lock_irqsave(&my_lock, flags);
> 	p = look_something_up();
> 	remove_that_something(p);
> 	call_rcu(p, my_cb);
> 	spin_unlock_irqrestore(&my_lock, flags);
> 
> Invoking the extra callbacks directly from call_rcu() would thus result
> in self-deadlock.  Documentation/RCU/UP.txt contains a few more examples
> along these lines.

We could add an option that simply fails if overloaded, right?
Have caller recover...

Michael S. Tsirkin July 22, 2019, 7:56 a.m. UTC | #7

On Sun, Jul 21, 2019 at 12:28:41PM -0700, Paul E. McKenney wrote:
> On Sun, Jul 21, 2019 at 01:53:23PM -0400, Michael S. Tsirkin wrote:
> > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > > On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote:
> > > > Hi Paul, others,
> > > > 
> > > > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
> > > > is what happens if userspace starts cycling through lots of these
> > > > ioctls.  Given we actually use rcu as an optimization, we could just
> > > > disable the optimization temporarily - but the question would be how to
> > > > detect an excessive rate without working too hard :) .
> > > > 
> > > > I guess we could define as excessive any rate where callback is
> > > > outstanding at the time when new structure is allocated.  I have very
> > > > little understanding of rcu internals - so I wanted to check that the
> > > > following more or less implements this heuristic before I spend time
> > > > actually testing it.
> > > > 
> > > > Could others pls take a look and let me know?
> > > 
> > > These look good as a way of seeing if there are any outstanding callbacks,
> > > but in the case of Tree RCU, call_rcu_outstanding() would almost never
> > > return false on a busy system.
> > 
> > Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy
> > and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000?
> 
> Or the function could simply return the number of callbacks queued
> on the current CPU, and let the caller decide how many is too many.
> 
> > > Here are some alternatives:
> > > 
> > > o	RCU uses some pieces of Rao Shoaib kfree_rcu() patches.
> > > 	The idea is to make kfree_rcu() locally buffer requests into
> > > 	batches of (say) 1,000, but processing smaller batches when RCU
> > > 	is idle, or when some smallish amout of time has passed with
> > > 	no more kfree_rcu() request from that CPU.  RCU than takes in
> > > 	the batch using not call_rcu(), but rather queue_rcu_work().
> > > 	The resulting batch of kfree() calls would therefore execute in
> > > 	workqueue context rather than in softirq context, which should
> > > 	be much easier on the system.
> > > 
> > > 	In theory, this would allow people to use kfree_rcu() without
> > > 	worrying quite so much about overload.  It would also not be
> > > 	that hard to implement.
> > > 
> > > o	Subsystems vulnerable to user-induced kfree_rcu() flooding use
> > > 	call_rcu() instead of kfree_rcu().  Keep a count of the number
> > > 	of things waiting for a grace period, and when this gets too
> > > 	large, disable the optimization.  It will then drain down, at
> > > 	which point the optimization can be re-enabled.
> > > 
> > > 	But please note that callbacks are -not- guaranteed to run on
> > > 	the CPU that queued them.  So yes, you would need a per-CPU
> > > 	counter, but you would need to periodically sum it up to check
> > > 	against the global state.  Or keep track of the CPU that
> > > 	did the call_rcu() so that you can atomically decrement in
> > > 	the callback the same counter that was atomically incremented
> > > 	just before the call_rcu().  Or any number of other approaches.
> > 
> > I'm really looking for something we can do this merge window
> > and without adding too much code, and kfree_rcu is intended to
> > fix a bug.
> > Adding call_rcu and careful accounting is something that I'm not
> > happy adding with merge window already open.
> 
> OK, then I suggest having the interface return you the number of
> callbacks.  That allows you to experiment with the cutoff.
> 
> Give or take the ioctl overhead...

OK - and for tiny just assume 1 is too much?


> > > Also, the overhead is important.  For example, as far as I know,
> > > current RCU gracefully handles close(open(...)) in a tight userspace
> > > loop.  But there might be trouble due to tight userspace loops around
> > > lighter-weight operations.
> > > 
> > > So an important question is "Just how fast is your ioctl?"  If it takes
> > > (say) 100 microseconds to execute, there should be absolutely no problem.
> > > On the other hand, if it can execute in 50 nanoseconds, this very likely
> > > does need serious attention.
> > > 
> > > Other thoughts?
> > > 
> > > 							Thanx, Paul
> > 
> > Hmm the answer to this would be I'm not sure.
> > It's setup time stuff we never tested it.
> 
> Is it possible to measure it easily?
> 
> 							Thanx, Paul
> 
> > > > Thanks!
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > 
> > > > 
> > > > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> > > > index 477b4eb44af5..067909521d72 100644
> > > > --- a/kernel/rcu/tiny.c
> > > > +++ b/kernel/rcu/tiny.c
> > > > @@ -125,6 +125,25 @@ void synchronize_rcu(void)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(synchronize_rcu);
> > > > 
> > > > +/*
> > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > > > + */
> > > > +bool call_rcu_outstanding(void)
> > > > +{
> > > > +	unsigned long flags;
> > > > +	struct rcu_data *rdp;
> > > > +	bool outstanding;
> > > > +
> > > > +	local_irq_save(flags);
> > > > +	rdp = this_cpu_ptr(&rcu_data);
> > > > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > > > +	outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail;
> > > > +	local_irq_restore(flags);
> > > > +
> > > > +	return outstanding;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > > > +
> > > >  /*
> > > >   * Post an RCU callback to be invoked after the end of an RCU grace
> > > >   * period.  But since we have but one CPU, that would be after any
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index a14e5fbbea46..d4b9d61e637d 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp)
> > > >  {
> > > >  }
> > > > 
> > > > +/*
> > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > > > + */
> > > > +bool call_rcu_outstanding(void)
> > > > +{
> > > > +	unsigned long flags;
> > > > +	struct rcu_data *rdp;
> > > > +	bool outstanding;
> > > > +
> > > > +	local_irq_save(flags);
> > > > +	rdp = this_cpu_ptr(&rcu_data);
> > > > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > > > +	local_irq_restore(flags);
> > > > +
> > > > +	return outstanding;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > > > +
> > > >  /*
> > > >   * Helper function for call_rcu() and friends.  The cpu argument will
> > > >   * normally be -1, indicating "currently running CPU".  It may specify
> >

Paul E. McKenney July 22, 2019, 11:51 a.m. UTC | #8

On Mon, Jul 22, 2019 at 03:52:05AM -0400, Michael S. Tsirkin wrote:
> On Sun, Jul 21, 2019 at 04:31:13PM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 21, 2019 at 02:08:37PM -0700, Matthew Wilcox wrote:
> > > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > > > Also, the overhead is important.  For example, as far as I know,
> > > > current RCU gracefully handles close(open(...)) in a tight userspace
> > > > loop.  But there might be trouble due to tight userspace loops around
> > > > lighter-weight operations.
> > > 
> > > I thought you believed that RCU was antifragile, in that it would scale
> > > better as it was used more heavily?
> > 
> > You are referring to this?  https://paulmck.livejournal.com/47933.html
> > 
> > If so, the last few paragraphs might be worth re-reading.   ;-)
> > 
> > And in this case, the heuristics RCU uses to decide when to schedule
> > invocation of the callbacks needs some help.  One component of that help
> > is a time-based limit to the number of consecutive callback invocations
> > (see my crude prototype and Eric Dumazet's more polished patch).  Another
> > component is an overload warning.
> > 
> > Why would an overload warning be needed if RCU's callback-invocation
> > scheduling heurisitics were upgraded?  Because someone could boot a
> > 100-CPU system with the rcu_nocbs=0-99, bind all of the resulting
> > rcuo kthreads to (say) CPU 0, and then run a callback-heavy workload
> > on all of the CPUs.  Given the constraints, CPU 0 cannot keep up.
> > 
> > So warnings are required as well.
> > 
> > > Would it make sense to have call_rcu() check to see if there are many
> > > outstanding requests on this CPU and if so process them before returning?
> > > That would ensure that frequent callers usually ended up doing their
> > > own processing.
> > 
> > Unfortunately, no.  Here is a code fragment illustrating why:
> > 
> > 	void my_cb(struct rcu_head *rhp)
> > 	{
> > 		unsigned long flags;
> > 
> > 		spin_lock_irqsave(&my_lock, flags);
> > 		handle_cb(rhp);
> > 		spin_unlock_irqrestore(&my_lock, flags);
> > 	}
> > 
> > 	. . .
> > 
> > 	spin_lock_irqsave(&my_lock, flags);
> > 	p = look_something_up();
> > 	remove_that_something(p);
> > 	call_rcu(p, my_cb);
> > 	spin_unlock_irqrestore(&my_lock, flags);
> > 
> > Invoking the extra callbacks directly from call_rcu() would thus result
> > in self-deadlock.  Documentation/RCU/UP.txt contains a few more examples
> > along these lines.
> 
> We could add an option that simply fails if overloaded, right?
> Have caller recover...

For example, return EBUSY from your ioctl?  That should work.  You could
also sleep for a jiffy or two to let things catch up in this BUSY (or
similar) case.  Or try three times, waiting a jiffy between each try,
and return EBUSY if all three tries failed.

Or just keep it simple and return EBUSY on the first try.  ;-)

All of this assumes that this ioctl is the cause of the overload, which
during early boot seems to me to be a safe assumption.

							Thanx, Paul

Paul E. McKenney July 22, 2019, 11:57 a.m. UTC | #9

On Mon, Jul 22, 2019 at 03:56:22AM -0400, Michael S. Tsirkin wrote:
> On Sun, Jul 21, 2019 at 12:28:41PM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 21, 2019 at 01:53:23PM -0400, Michael S. Tsirkin wrote:
> > > On Sun, Jul 21, 2019 at 06:17:25AM -0700, Paul E. McKenney wrote:
> > > > On Sun, Jul 21, 2019 at 08:28:05AM -0400, Michael S. Tsirkin wrote:
> > > > > Hi Paul, others,
> > > > > 
> > > > > So it seems that vhost needs to call kfree_rcu from an ioctl. My worry
> > > > > is what happens if userspace starts cycling through lots of these
> > > > > ioctls.  Given we actually use rcu as an optimization, we could just
> > > > > disable the optimization temporarily - but the question would be how to
> > > > > detect an excessive rate without working too hard :) .
> > > > > 
> > > > > I guess we could define as excessive any rate where callback is
> > > > > outstanding at the time when new structure is allocated.  I have very
> > > > > little understanding of rcu internals - so I wanted to check that the
> > > > > following more or less implements this heuristic before I spend time
> > > > > actually testing it.
> > > > > 
> > > > > Could others pls take a look and let me know?
> > > > 
> > > > These look good as a way of seeing if there are any outstanding callbacks,
> > > > but in the case of Tree RCU, call_rcu_outstanding() would almost never
> > > > return false on a busy system.
> > > 
> > > Hmm, ok. Maybe I could rename this to e.g. call_rcu_busy
> > > and change the tree one to do rcu_segcblist_n_lazy_cbs > 1000?
> > 
> > Or the function could simply return the number of callbacks queued
> > on the current CPU, and let the caller decide how many is too many.
> > 
> > > > Here are some alternatives:
> > > > 
> > > > o	RCU uses some pieces of Rao Shoaib kfree_rcu() patches.
> > > > 	The idea is to make kfree_rcu() locally buffer requests into
> > > > 	batches of (say) 1,000, but processing smaller batches when RCU
> > > > 	is idle, or when some smallish amout of time has passed with
> > > > 	no more kfree_rcu() request from that CPU.  RCU than takes in
> > > > 	the batch using not call_rcu(), but rather queue_rcu_work().
> > > > 	The resulting batch of kfree() calls would therefore execute in
> > > > 	workqueue context rather than in softirq context, which should
> > > > 	be much easier on the system.
> > > > 
> > > > 	In theory, this would allow people to use kfree_rcu() without
> > > > 	worrying quite so much about overload.  It would also not be
> > > > 	that hard to implement.
> > > > 
> > > > o	Subsystems vulnerable to user-induced kfree_rcu() flooding use
> > > > 	call_rcu() instead of kfree_rcu().  Keep a count of the number
> > > > 	of things waiting for a grace period, and when this gets too
> > > > 	large, disable the optimization.  It will then drain down, at
> > > > 	which point the optimization can be re-enabled.
> > > > 
> > > > 	But please note that callbacks are -not- guaranteed to run on
> > > > 	the CPU that queued them.  So yes, you would need a per-CPU
> > > > 	counter, but you would need to periodically sum it up to check
> > > > 	against the global state.  Or keep track of the CPU that
> > > > 	did the call_rcu() so that you can atomically decrement in
> > > > 	the callback the same counter that was atomically incremented
> > > > 	just before the call_rcu().  Or any number of other approaches.
> > > 
> > > I'm really looking for something we can do this merge window
> > > and without adding too much code, and kfree_rcu is intended to
> > > fix a bug.
> > > Adding call_rcu and careful accounting is something that I'm not
> > > happy adding with merge window already open.
> > 
> > OK, then I suggest having the interface return you the number of
> > callbacks.  That allows you to experiment with the cutoff.
> > 
> > Give or take the ioctl overhead...
> 
> OK - and for tiny just assume 1 is too much?

I bet that for tiny you won't need to rate-limit at all.  The reason
is that grace periods are quite short.

In fact, for TINY (that is, !SMP && !PREEMPT), synchronize_rcu() is a
no-op.  So in TINY, given that your ioctl is executing at process level,
you could just invoke synchronize_rcu() and then kfree():

#ifdef CONFIG_TINY_RCU
	synchronize_rcu();  /* No other CPUs, so a QS is a GP! */
	kfree(whatever);
	return; /* Or whatever control flow is appropriate. */
#endif
	/* More complicated stuff for !TINY. */

							Thanx, Paul

> > > > Also, the overhead is important.  For example, as far as I know,
> > > > current RCU gracefully handles close(open(...)) in a tight userspace
> > > > loop.  But there might be trouble due to tight userspace loops around
> > > > lighter-weight operations.
> > > > 
> > > > So an important question is "Just how fast is your ioctl?"  If it takes
> > > > (say) 100 microseconds to execute, there should be absolutely no problem.
> > > > On the other hand, if it can execute in 50 nanoseconds, this very likely
> > > > does need serious attention.
> > > > 
> > > > Other thoughts?
> > > > 
> > > > 							Thanx, Paul
> > > 
> > > Hmm the answer to this would be I'm not sure.
> > > It's setup time stuff we never tested it.
> > 
> > Is it possible to measure it easily?
> > 
> > 							Thanx, Paul
> > 
> > > > > Thanks!
> > > > > 
> > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > 
> > > > > 
> > > > > diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> > > > > index 477b4eb44af5..067909521d72 100644
> > > > > --- a/kernel/rcu/tiny.c
> > > > > +++ b/kernel/rcu/tiny.c
> > > > > @@ -125,6 +125,25 @@ void synchronize_rcu(void)
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(synchronize_rcu);
> > > > > 
> > > > > +/*
> > > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > > > > + */
> > > > > +bool call_rcu_outstanding(void)
> > > > > +{
> > > > > +	unsigned long flags;
> > > > > +	struct rcu_data *rdp;
> > > > > +	bool outstanding;
> > > > > +
> > > > > +	local_irq_save(flags);
> > > > > +	rdp = this_cpu_ptr(&rcu_data);
> > > > > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > > > > +	outstanding = rcu_ctrlblk.donetail != rcu_ctrlblk.curtail;
> > > > > +	local_irq_restore(flags);
> > > > > +
> > > > > +	return outstanding;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > > > > +
> > > > >  /*
> > > > >   * Post an RCU callback to be invoked after the end of an RCU grace
> > > > >   * period.  But since we have but one CPU, that would be after any
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index a14e5fbbea46..d4b9d61e637d 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -2482,6 +2482,24 @@ static void rcu_leak_callback(struct rcu_head *rhp)
> > > > >  {
> > > > >  }
> > > > > 
> > > > > +/*
> > > > > + * Helpful for rate-limiting kfree_rcu/call_rcu callbacks.
> > > > > + */
> > > > > +bool call_rcu_outstanding(void)
> > > > > +{
> > > > > +	unsigned long flags;
> > > > > +	struct rcu_data *rdp;
> > > > > +	bool outstanding;
> > > > > +
> > > > > +	local_irq_save(flags);
> > > > > +	rdp = this_cpu_ptr(&rcu_data);
> > > > > +	outstanding = rcu_segcblist_empty(&rdp->cblist);
> > > > > +	local_irq_restore(flags);
> > > > > +
> > > > > +	return outstanding;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(call_rcu_outstanding);
> > > > > +
> > > > >  /*
> > > > >   * Helper function for call_rcu() and friends.  The cpu argument will
> > > > >   * normally be -1, indicating "currently running CPU".  It may specify
> > > 
>

Jason Gunthorpe July 22, 2019, 1:41 p.m. UTC | #10

On Mon, Jul 22, 2019 at 04:51:49AM -0700, Paul E. McKenney wrote:

> > > > Would it make sense to have call_rcu() check to see if there are many
> > > > outstanding requests on this CPU and if so process them before returning?
> > > > That would ensure that frequent callers usually ended up doing their
> > > > own processing.
> > > 
> > > Unfortunately, no.  Here is a code fragment illustrating why:

That is only true in the general case though, kfree_rcu() doesn't have
this problem since we know what the callback is doing. In general a
caller of kfree_rcu() should not need to hold any locks while calling
it.

We could apply the same idea more generally and have some
'call_immediate_or_rcu()' which has restrictions on the caller's
context.

I think if we have some kind of problem here it would be better to
handle it inside the core code and only require that callers use the
correct RCU API.

I can think of many places where kfree_rcu() is being used under user
control..

Jason

Joel Fernandes July 22, 2019, 3:14 p.m. UTC | #11

[snip]
> > Would it make sense to have call_rcu() check to see if there are many
> > outstanding requests on this CPU and if so process them before returning?
> > That would ensure that frequent callers usually ended up doing their
> > own processing.

Other than what Paul already mentioned about deadlocks, I am not sure if this
would even work for all cases since call_rcu() has to wait for a grace
period.

So, if the number of outstanding requests are higher than a certain amount,
then you *still* have to wait for some RCU configurations for the grace
period duration and cannot just execute the callback in-line. Did I miss
something?

Can waiting in-line for a grace period duration be tolerated in the vhost case?

thanks,

 - Joel

Michael S. Tsirkin July 22, 2019, 3:47 p.m. UTC | #12

On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote:
> [snip]
> > > Would it make sense to have call_rcu() check to see if there are many
> > > outstanding requests on this CPU and if so process them before returning?
> > > That would ensure that frequent callers usually ended up doing their
> > > own processing.
> 
> Other than what Paul already mentioned about deadlocks, I am not sure if this
> would even work for all cases since call_rcu() has to wait for a grace
> period.
> 
> So, if the number of outstanding requests are higher than a certain amount,
> then you *still* have to wait for some RCU configurations for the grace
> period duration and cannot just execute the callback in-line. Did I miss
> something?
> 
> Can waiting in-line for a grace period duration be tolerated in the vhost case?
> 
> thanks,
> 
>  - Joel

No, but it has many other ways to recover (try again later, drop a
packet, use a slower copy to/from user).

Paul E. McKenney July 22, 2019, 3:52 p.m. UTC | #13

On Mon, Jul 22, 2019 at 10:41:52AM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 22, 2019 at 04:51:49AM -0700, Paul E. McKenney wrote:
> 
> > > > > Would it make sense to have call_rcu() check to see if there are many
> > > > > outstanding requests on this CPU and if so process them before returning?
> > > > > That would ensure that frequent callers usually ended up doing their
> > > > > own processing.
> > > > 
> > > > Unfortunately, no.  Here is a code fragment illustrating why:
> 
> That is only true in the general case though, kfree_rcu() doesn't have
> this problem since we know what the callback is doing. In general a
> caller of kfree_rcu() should not need to hold any locks while calling
> it.

Good point, at least as long as the slab allocators don't call kfree_rcu()
while holding any of the slab locks.

However, that would require a separate list for the kfree_rcu() callbacks,
and concurrent access to those lists of kfree_rcu() callbacks.  So this
might work, but would add some complexity and also yet another restriction
between RCU and another kernel subsystem.  So I would like to try the
other approaches first, for example, the time-based approach in my
prototype and Eric Dumazet's more polished patch.

But the immediate-invocation possibility is still there if needed.

> We could apply the same idea more generally and have some
> 'call_immediate_or_rcu()' which has restrictions on the caller's
> context.
> 
> I think if we have some kind of problem here it would be better to
> handle it inside the core code and only require that callers use the
> correct RCU API.

Agreed.  Especially given that there are a number of things that can
be done within RCU.

> I can think of many places where kfree_rcu() is being used under user
> control..

And same for call_rcu().

And this is not the first time we have run into this.  The last time
was about 15 years ago, if I remember correctly, and that one led to
some of the quiescent-state forcing and callback-invocation batch size
tricks still in use today.  My only real surprise is that it took so
long for this to come up again.  ;-)

Please note also that in the common case on default configurations,
callback invocation is done on the CPU that posted the callback.
This means that callback invocation normally applies backpressure
to the callback-happy workload.

So why then is there a problem?

The problem is not the lack of backpressure, but rather that the
scheduling of callback invocation needs to be a bit more considerate
of the needs of the rest of the system.  In the common case, that is.
Except that the uncommon case is real-time configurations, in which care
is needed anyway.  But I am in the midst of helping those out as well,
details on the "dev" branch of -rcu.

							Thanx, Paul

Paul E. McKenney July 22, 2019, 3:55 p.m. UTC | #14

On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote:
> > [snip]
> > > > Would it make sense to have call_rcu() check to see if there are many
> > > > outstanding requests on this CPU and if so process them before returning?
> > > > That would ensure that frequent callers usually ended up doing their
> > > > own processing.
> > 
> > Other than what Paul already mentioned about deadlocks, I am not sure if this
> > would even work for all cases since call_rcu() has to wait for a grace
> > period.
> > 
> > So, if the number of outstanding requests are higher than a certain amount,
> > then you *still* have to wait for some RCU configurations for the grace
> > period duration and cannot just execute the callback in-line. Did I miss
> > something?
> > 
> > Can waiting in-line for a grace period duration be tolerated in the vhost case?
> > 
> > thanks,
> > 
> >  - Joel
> 
> No, but it has many other ways to recover (try again later, drop a
> packet, use a slower copy to/from user).

True enough!  And your idea of taking recovery action based on the number
of callbacks seems like a good one while we are getting RCU's callback
scheduling improved.

By the way, was this a real problem that you could make happen on real
hardware?  If not, I would suggest just letting RCU get improved over
the next couple of releases.

If it is something that you actually made happen, please let me know
what (if anything) you need from me for your callback-counting EBUSY
scheme.

							Thanx, Paul

Jason Gunthorpe July 22, 2019, 4:04 p.m. UTC | #15

On Mon, Jul 22, 2019 at 08:52:35AM -0700, Paul E. McKenney wrote:
> So why then is there a problem?

I'm not sure there is a real problem, I thought Michael was just
asking how to design with RCU in the case where the user controls the
kfree_rcu??

Sounds like the answer is "don't worry about it" ?

Thanks,
Jason

Michael S. Tsirkin July 22, 2019, 4:13 p.m. UTC | #16

On Mon, Jul 22, 2019 at 08:55:34AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote:
> > > [snip]
> > > > > Would it make sense to have call_rcu() check to see if there are many
> > > > > outstanding requests on this CPU and if so process them before returning?
> > > > > That would ensure that frequent callers usually ended up doing their
> > > > > own processing.
> > > 
> > > Other than what Paul already mentioned about deadlocks, I am not sure if this
> > > would even work for all cases since call_rcu() has to wait for a grace
> > > period.
> > > 
> > > So, if the number of outstanding requests are higher than a certain amount,
> > > then you *still* have to wait for some RCU configurations for the grace
> > > period duration and cannot just execute the callback in-line. Did I miss
> > > something?
> > > 
> > > Can waiting in-line for a grace period duration be tolerated in the vhost case?
> > > 
> > > thanks,
> > > 
> > >  - Joel
> > 
> > No, but it has many other ways to recover (try again later, drop a
> > packet, use a slower copy to/from user).
> 
> True enough!  And your idea of taking recovery action based on the number
> of callbacks seems like a good one while we are getting RCU's callback
> scheduling improved.
> 
> By the way, was this a real problem that you could make happen on real
> hardware?


>  If not, I would suggest just letting RCU get improved over
> the next couple of releases.


So basically use kfree_rcu but add a comment saying e.g. "WARNING:
in the future callers of kfree_rcu might need to check that
not too many callbacks get queued. In that case, we can
disable the optimization, or recover in some other way.
Watch this space."


> If it is something that you actually made happen, please let me know
> what (if anything) you need from me for your callback-counting EBUSY
> scheme.
> 
> 							Thanx, Paul

If you mean kfree_rcu causing OOM then no, it's all theoretical.
If you mean synchronize_rcu stalling to the point where guest will OOPs,
then yes, that's not too hard to trigger.

Michael S. Tsirkin July 22, 2019, 4:15 p.m. UTC | #17

On Mon, Jul 22, 2019 at 01:04:48PM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 22, 2019 at 08:52:35AM -0700, Paul E. McKenney wrote:
> > So why then is there a problem?
> 
> I'm not sure there is a real problem, I thought Michael was just
> asking how to design with RCU in the case where the user controls the
> kfree_rcu??


Right it's all based on documentation saying we should worry :)

> Sounds like the answer is "don't worry about it" ?
> 
> Thanks,
> Jason

Paul E. McKenney July 22, 2019, 4:15 p.m. UTC | #18

On Mon, Jul 22, 2019 at 01:04:48PM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 22, 2019 at 08:52:35AM -0700, Paul E. McKenney wrote:
> > So why then is there a problem?
> 
> I'm not sure there is a real problem, I thought Michael was just
> asking how to design with RCU in the case where the user controls the
> kfree_rcu??
> 
> Sounds like the answer is "don't worry about it" ?

Unless you can force failures, you should be good.

And either way, improvements to RCU's handling of this sort of situation
are in the works.  And rcutorture has gained tests of this stuff in the
last year or so as well, see its "fwd_progress" module parameter and
the related code.

							Thanx, Paul

Paul E. McKenney July 22, 2019, 4:25 p.m. UTC | #19

On Mon, Jul 22, 2019 at 12:13:40PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 22, 2019 at 08:55:34AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote:
> > > On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote:
> > > > [snip]
> > > > > > Would it make sense to have call_rcu() check to see if there are many
> > > > > > outstanding requests on this CPU and if so process them before returning?
> > > > > > That would ensure that frequent callers usually ended up doing their
> > > > > > own processing.
> > > > 
> > > > Other than what Paul already mentioned about deadlocks, I am not sure if this
> > > > would even work for all cases since call_rcu() has to wait for a grace
> > > > period.
> > > > 
> > > > So, if the number of outstanding requests are higher than a certain amount,
> > > > then you *still* have to wait for some RCU configurations for the grace
> > > > period duration and cannot just execute the callback in-line. Did I miss
> > > > something?
> > > > 
> > > > Can waiting in-line for a grace period duration be tolerated in the vhost case?
> > > > 
> > > > thanks,
> > > > 
> > > >  - Joel
> > > 
> > > No, but it has many other ways to recover (try again later, drop a
> > > packet, use a slower copy to/from user).
> > 
> > True enough!  And your idea of taking recovery action based on the number
> > of callbacks seems like a good one while we are getting RCU's callback
> > scheduling improved.
> > 
> > By the way, was this a real problem that you could make happen on real
> > hardware?
> 
> >  If not, I would suggest just letting RCU get improved over
> > the next couple of releases.
> 
> So basically use kfree_rcu but add a comment saying e.g. "WARNING:
> in the future callers of kfree_rcu might need to check that
> not too many callbacks get queued. In that case, we can
> disable the optimization, or recover in some other way.
> Watch this space."

That sounds fair.

> > If it is something that you actually made happen, please let me know
> > what (if anything) you need from me for your callback-counting EBUSY
> > scheme.
> > 
> > 							Thanx, Paul
> 
> If you mean kfree_rcu causing OOM then no, it's all theoretical.
> If you mean synchronize_rcu stalling to the point where guest will OOPs,
> then yes, that's not too hard to trigger.

Is synchronize_rcu() being stalled by the userspace loop that is invoking
your ioctl that does kfree_rcu()?  Or instead by the resulting callback
invocation?

							Thanx, Paul

Michael S. Tsirkin July 22, 2019, 4:32 p.m. UTC | #20

On Mon, Jul 22, 2019 at 09:25:51AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 22, 2019 at 12:13:40PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 22, 2019 at 08:55:34AM -0700, Paul E. McKenney wrote:
> > > On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote:
> > > > On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote:
> > > > > [snip]
> > > > > > > Would it make sense to have call_rcu() check to see if there are many
> > > > > > > outstanding requests on this CPU and if so process them before returning?
> > > > > > > That would ensure that frequent callers usually ended up doing their
> > > > > > > own processing.
> > > > > 
> > > > > Other than what Paul already mentioned about deadlocks, I am not sure if this
> > > > > would even work for all cases since call_rcu() has to wait for a grace
> > > > > period.
> > > > > 
> > > > > So, if the number of outstanding requests are higher than a certain amount,
> > > > > then you *still* have to wait for some RCU configurations for the grace
> > > > > period duration and cannot just execute the callback in-line. Did I miss
> > > > > something?
> > > > > 
> > > > > Can waiting in-line for a grace period duration be tolerated in the vhost case?
> > > > > 
> > > > > thanks,
> > > > > 
> > > > >  - Joel
> > > > 
> > > > No, but it has many other ways to recover (try again later, drop a
> > > > packet, use a slower copy to/from user).
> > > 
> > > True enough!  And your idea of taking recovery action based on the number
> > > of callbacks seems like a good one while we are getting RCU's callback
> > > scheduling improved.
> > > 
> > > By the way, was this a real problem that you could make happen on real
> > > hardware?
> > 
> > >  If not, I would suggest just letting RCU get improved over
> > > the next couple of releases.
> > 
> > So basically use kfree_rcu but add a comment saying e.g. "WARNING:
> > in the future callers of kfree_rcu might need to check that
> > not too many callbacks get queued. In that case, we can
> > disable the optimization, or recover in some other way.
> > Watch this space."
> 
> That sounds fair.
> 
> > > If it is something that you actually made happen, please let me know
> > > what (if anything) you need from me for your callback-counting EBUSY
> > > scheme.
> > > 
> > > 							Thanx, Paul
> > 
> > If you mean kfree_rcu causing OOM then no, it's all theoretical.
> > If you mean synchronize_rcu stalling to the point where guest will OOPs,
> > then yes, that's not too hard to trigger.
> 
> Is synchronize_rcu() being stalled by the userspace loop that is invoking
> your ioctl that does kfree_rcu()?  Or instead by the resulting callback
> invocation?
> 
> 							Thanx, Paul

Sorry, let me clarify.  We currently have synchronize_rcu in a userspace
loop. I have a patch replacing that with kfree_rcu.  This isn't the
first time synchronize_rcu is stalling a VM for a long while so I didn't
investigate further.

Paul E. McKenney July 22, 2019, 6:58 p.m. UTC | #21

On Mon, Jul 22, 2019 at 12:32:17PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jul 22, 2019 at 09:25:51AM -0700, Paul E. McKenney wrote:
> > On Mon, Jul 22, 2019 at 12:13:40PM -0400, Michael S. Tsirkin wrote:
> > > On Mon, Jul 22, 2019 at 08:55:34AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Jul 22, 2019 at 11:47:24AM -0400, Michael S. Tsirkin wrote:
> > > > > On Mon, Jul 22, 2019 at 11:14:39AM -0400, Joel Fernandes wrote:
> > > > > > [snip]
> > > > > > > > Would it make sense to have call_rcu() check to see if there are many
> > > > > > > > outstanding requests on this CPU and if so process them before returning?
> > > > > > > > That would ensure that frequent callers usually ended up doing their
> > > > > > > > own processing.
> > > > > > 
> > > > > > Other than what Paul already mentioned about deadlocks, I am not sure if this
> > > > > > would even work for all cases since call_rcu() has to wait for a grace
> > > > > > period.
> > > > > > 
> > > > > > So, if the number of outstanding requests are higher than a certain amount,
> > > > > > then you *still* have to wait for some RCU configurations for the grace
> > > > > > period duration and cannot just execute the callback in-line. Did I miss
> > > > > > something?
> > > > > > 
> > > > > > Can waiting in-line for a grace period duration be tolerated in the vhost case?
> > > > > > 
> > > > > > thanks,
> > > > > > 
> > > > > >  - Joel
> > > > > 
> > > > > No, but it has many other ways to recover (try again later, drop a
> > > > > packet, use a slower copy to/from user).
> > > > 
> > > > True enough!  And your idea of taking recovery action based on the number
> > > > of callbacks seems like a good one while we are getting RCU's callback
> > > > scheduling improved.
> > > > 
> > > > By the way, was this a real problem that you could make happen on real
> > > > hardware?
> > > 
> > > >  If not, I would suggest just letting RCU get improved over
> > > > the next couple of releases.
> > > 
> > > So basically use kfree_rcu but add a comment saying e.g. "WARNING:
> > > in the future callers of kfree_rcu might need to check that
> > > not too many callbacks get queued. In that case, we can
> > > disable the optimization, or recover in some other way.
> > > Watch this space."
> > 
> > That sounds fair.
> > 
> > > > If it is something that you actually made happen, please let me know
> > > > what (if anything) you need from me for your callback-counting EBUSY
> > > > scheme.
> > > 
> > > If you mean kfree_rcu causing OOM then no, it's all theoretical.
> > > If you mean synchronize_rcu stalling to the point where guest will OOPs,
> > > then yes, that's not too hard to trigger.
> > 
> > Is synchronize_rcu() being stalled by the userspace loop that is invoking
> > your ioctl that does kfree_rcu()?  Or instead by the resulting callback
> > invocation?
> 
> Sorry, let me clarify.  We currently have synchronize_rcu in a userspace
> loop. I have a patch replacing that with kfree_rcu.  This isn't the
> first time synchronize_rcu is stalling a VM for a long while so I didn't
> investigate further.

Ah, so a bunch of synchronize_rcu() calls within a single system call
inside the host is stalling the guest, correct?

If so, one straightforward approach is to do an rcu_barrier() every
(say) 1000 kfree_rcu() calls within that loop in the system call.
This will decrease the overhead by almost a factor of 1000 compared to
a synchronize_rcu() on each trip through that loop, and will prevent
callback overload.

Or if the situation is different (for example, the guest does a long
sequence of system calls, each of which does a single kfree_rcu() or
some such), please let me know what the situation is.

							Thanx, Paul

RFC: call_rcu_outstanding (was Re: WARNING in __mmdrop)

Commit Message

Comments

Patch