[1/1] rcu/tree: Drop the lock before entering to page allocator
diff mbox series

Message ID 20200715183537.4010-1-urezki@gmail.com
State New
Headers show
Series
  • [1/1] rcu/tree: Drop the lock before entering to page allocator
Related show

Commit Message

Uladzislau Rezki July 15, 2020, 6:35 p.m. UTC
If the kernel is built with CONFIG_PROVE_RAW_LOCK_NESTING
option, the lockedp will complain about violation of the
nesting rules. It does the raw_spinlock vs. spinlock nesting
checks.

Internally the kfree_rcu() uses raw_spinlock_t whereas the
page allocator internally deals with spinlock_t to access
to its zones.

In order to prevent such vialation that is in question we
can drop the internal raw_spinlock_t before entering to
the page allocaor.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 kernel/rcu/tree.c | 36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

Comments

Sebastian Andrzej Siewior July 15, 2020, 6:56 p.m. UTC | #1
On 2020-07-15 20:35:37 [+0200], Uladzislau Rezki (Sony) wrote:
> @@ -3306,6 +3307,9 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
>  			if (IS_ENABLED(CONFIG_PREEMPT_RT))
>  				return false;
>  
> +			preempt_disable();
> +			krc_this_cpu_unlock(*krcp, *flags);

Now you enter memory allocator with disabled preemption. This isn't any
better but we don't have a warning for this yet.
What happened to the part where I asked for a spinlock_t?

> +
>  			/*
>  			 * NOTE: For one argument of kvfree_rcu() we can
>  			 * drop the lock and get the page in sleepable

Sebastian
Uladzislau Rezki July 15, 2020, 7:02 p.m. UTC | #2
On Wed, Jul 15, 2020 at 08:56:28PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-07-15 20:35:37 [+0200], Uladzislau Rezki (Sony) wrote:
> > @@ -3306,6 +3307,9 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> >  			if (IS_ENABLED(CONFIG_PREEMPT_RT))
> >  				return false;
> >  
> > +			preempt_disable();
> > +			krc_this_cpu_unlock(*krcp, *flags);
> 
> Now you enter memory allocator with disabled preemption. This isn't any
> better but we don't have a warning for this yet.
>
Please elaborate why it is not allowed? Consider the code:

<snip>
    spin_lock();
    __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
    spin_unlock();
<snip>

Also, please note we do it for regular kernel.

>
> What happened to the part where I asked for a spinlock_t?
> 
What do you mean?

--
Vlad Rezki
Sebastian Andrzej Siewior July 15, 2020, 7:32 p.m. UTC | #3
On 2020-07-15 21:02:43 [+0200], Uladzislau Rezki wrote:
> 
> <snip>
>     spin_lock();
>     __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
>     spin_unlock();
> <snip>
> 
> Also, please note we do it for regular kernel.

ach right okay then.

> >
> > What happened to the part where I asked for a spinlock_t?
> > 
> What do you mean?

Please drop that raw_spinlock_t for the kfree_rcu() based locking and
use just a plain spinlock_t for the locking. Then you can keep the same
code flow for RT and !RT without any special cases and everything.

> --
> Vlad Rezki

Sebastian
Uladzislau Rezki July 15, 2020, 7:36 p.m. UTC | #4
On Wed, Jul 15, 2020 at 09:32:50PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-07-15 21:02:43 [+0200], Uladzislau Rezki wrote:
> > 
> > <snip>
> >     spin_lock();
> >     __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> >     spin_unlock();
> > <snip>
> > 
> > Also, please note we do it for regular kernel.
> 
> ach right okay then.
> 
> > >
> > > What happened to the part where I asked for a spinlock_t?
> > > 
> > What do you mean?
> 
> Please drop that raw_spinlock_t for the kfree_rcu() based locking and
> use just a plain spinlock_t for the locking. Then you can keep the same
> code flow for RT and !RT without any special cases and everything.
> 
Ahhh... That i see :)

I guess it is up to Paul. I wrote my thoughts. I have only one concern
that i explained in my previous mails.

Thank you, Sebastian, for your comments!

--
Vlad Rezki
Paul E. McKenney July 15, 2020, 10:14 p.m. UTC | #5
On Wed, Jul 15, 2020 at 09:32:50PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-07-15 21:02:43 [+0200], Uladzislau Rezki wrote:
> > 
> > <snip>
> >     spin_lock();
> >     __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> >     spin_unlock();
> > <snip>
> > 
> > Also, please note we do it for regular kernel.
> 
> ach right okay then.
> 
> > >
> > > What happened to the part where I asked for a spinlock_t?
> > > 
> > What do you mean?
> 
> Please drop that raw_spinlock_t for the kfree_rcu() based locking and
> use just a plain spinlock_t for the locking. Then you can keep the same
> code flow for RT and !RT without any special cases and everything.

My concern is that some critical bug will show up at some point
that requires double-argument kfree_rcu() be invoked while holding
a raw spinlock.  (Single-argument kfree_rcu() must sometimes invoke
synchronize_rcu(), so it can never be invoked in any state forbidding
invoking schedule().)

Yes, dropping to a plain spinlock would be simple in the here and now,
but experience indicates that it is only a matter of time, and that when
that time comes it will come as an emergency.

One approach would be to replace the "IS_ENABLED(CONFIG_PREEMPT_RT)"
with some sort of check for being in a context where spinlock acquisition
is not legal.  What could be done along those lines?

							Thanx, Paul
Joel Fernandes July 15, 2020, 11:13 p.m. UTC | #6
On Wed, Jul 15, 2020 at 2:56 PM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2020-07-15 20:35:37 [+0200], Uladzislau Rezki (Sony) wrote:
> > @@ -3306,6 +3307,9 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> >                       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> >                               return false;
> >
> > +                     preempt_disable();
> > +                     krc_this_cpu_unlock(*krcp, *flags);
>
> Now you enter memory allocator with disabled preemption. This isn't any
> better but we don't have a warning for this yet.
> What happened to the part where I asked for a spinlock_t?

Ulad,
Wouldn't the replacing of preempt_disable() with migrate_disable()
above resolve Sebastian's issue?

Or which scenario breaks?

thanks,

 - Joel


>
> > +
> >                       /*
> >                        * NOTE: For one argument of kvfree_rcu() we can
> >                        * drop the lock and get the page in sleepable
>
> Sebastian
Uladzislau Rezki July 16, 2020, 9:19 a.m. UTC | #7
On Wed, Jul 15, 2020 at 07:13:33PM -0400, Joel Fernandes wrote:
> On Wed, Jul 15, 2020 at 2:56 PM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> >
> > On 2020-07-15 20:35:37 [+0200], Uladzislau Rezki (Sony) wrote:
> > > @@ -3306,6 +3307,9 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > >                       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > >                               return false;
> > >
> > > +                     preempt_disable();
> > > +                     krc_this_cpu_unlock(*krcp, *flags);
> >
> > Now you enter memory allocator with disabled preemption. This isn't any
> > better but we don't have a warning for this yet.
> > What happened to the part where I asked for a spinlock_t?
> 
> Ulad,
> Wouldn't the replacing of preempt_disable() with migrate_disable()
> above resolve Sebastian's issue?
>
This for regular kernel only. That means that migrate_disable() is
equal to preempt_disable(). So, no difference.

> 
> Or which scenario breaks?
> 
Proposed patch fixes Sebastian's finding about CONFIG_PROVE_RAW_LOCK_NESTING
kernel option, that checks nesting rules and forbids raw_spinlock versus
spinlock mixing.

Sebastian, could you please confirm that if that patch that is in
question fixes it?

It would be appreciated!

--
Vlad Rezki
Joel Fernandes July 16, 2020, 1:36 p.m. UTC | #8
On Thu, Jul 16, 2020 at 11:19:13AM +0200, Uladzislau Rezki wrote:
> On Wed, Jul 15, 2020 at 07:13:33PM -0400, Joel Fernandes wrote:
> > On Wed, Jul 15, 2020 at 2:56 PM Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> > >
> > > On 2020-07-15 20:35:37 [+0200], Uladzislau Rezki (Sony) wrote:
> > > > @@ -3306,6 +3307,9 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > >                       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > >                               return false;
> > > >
> > > > +                     preempt_disable();
> > > > +                     krc_this_cpu_unlock(*krcp, *flags);
> > >
> > > Now you enter memory allocator with disabled preemption. This isn't any
> > > better but we don't have a warning for this yet.
> > > What happened to the part where I asked for a spinlock_t?
> > 
> > Ulad,
> > Wouldn't the replacing of preempt_disable() with migrate_disable()
> > above resolve Sebastian's issue?
> >
> This for regular kernel only. That means that migrate_disable() is
> equal to preempt_disable(). So, no difference.

But this will force preempt_disable() context into the low-level page
allocator on -RT kernels which I believe is not what Sebastian wants. The
whole reason why the spinlock vs raw-spinlock ordering matters is, because on
RT, the spinlock is sleeping. So if you have:

raw_spin_lock(..);
spin_lock(..);   <-- can sleep on RT, so Sleep while atomic (SWA) violation.

That's the main reason you are dropping the lock before calling the
allocator.

But if you call preempt_disable() and drop the lock, then that may fix the
lock ordering violation only to reintroduce the SWA issue, which is why you
wanted to drop the lock in the first place.

migrate_disable() on -RT kernels will not disable preemption which is where
Sebastian is coming from I think:
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git/tree/kernel/sched/core.c?h=v5.4-rt#n8178

In your defense, the "don't disable preemption on migrate_disable()" on -RT
part is not upstream yet. So maybe this will not work on upstream PREEMPT_RT,
but I'm not sure whether folks are really running upstream PREEMPT_RT without
going through the RT tree.

I did not see the drawback of just keeping it as migrate_disable() which
should make everyone happy but Sebastian may have other opinions such as
maybe that converting the lock to normal spinlock makes the code work on
upstream's version of PREEMPT_RT without requiring the migrate_disable()
change. But he can elaborate more.

Paul, how does all this sound to you?

thanks,

 - Joel
Sebastian Andrzej Siewior July 16, 2020, 2:14 p.m. UTC | #9
On 2020-07-15 15:14:49 [-0700], Paul E. McKenney wrote:
> 
> My concern is that some critical bug will show up at some point
> that requires double-argument kfree_rcu() be invoked while holding
> a raw spinlock.  (Single-argument kfree_rcu() must sometimes invoke
> synchronize_rcu(), so it can never be invoked in any state forbidding
> invoking schedule().)

So you are saying as of today we are good but in near future the
following
   synchronize_rcu() -> kfree_rcu()

may be needed?

> Yes, dropping to a plain spinlock would be simple in the here and now,
> but experience indicates that it is only a matter of time, and that when
> that time comes it will come as an emergency.

Hmmm.

> One approach would be to replace the "IS_ENABLED(CONFIG_PREEMPT_RT)"
> with some sort of check for being in a context where spinlock acquisition
> is not legal.  What could be done along those lines?

I would rethink the whole concept how this is implemented now and give
it another try. The code does not look pretty and is looking
complicated. The RT covering of this part then just added a simple
return because nothing else seemed to be possible. This patch here
looks like another duct tape attempt to avoid a warning.

> 							Thanx, Paul

Sebastian
Sebastian Andrzej Siewior July 16, 2020, 2:25 p.m. UTC | #10
On 2020-07-16 11:19:13 [+0200], Uladzislau Rezki wrote:
> Sebastian, could you please confirm that if that patch that is in
> question fixes it?
> 
> It would be appreciated!

So that preempt disable should in terms any warnings. However I don't
think that it is strictly needed and from scheduling point of view you
forbid a CPU migration which might be good otherwise.
Also if interrupts and everything is enabled then someone else might
invoke kfree_rcu() from BH context for instance.

Sebastian
Uladzislau Rezki July 16, 2020, 2:37 p.m. UTC | #11
On Thu, Jul 16, 2020 at 09:36:47AM -0400, Joel Fernandes wrote:
> On Thu, Jul 16, 2020 at 11:19:13AM +0200, Uladzislau Rezki wrote:
> > On Wed, Jul 15, 2020 at 07:13:33PM -0400, Joel Fernandes wrote:
> > > On Wed, Jul 15, 2020 at 2:56 PM Sebastian Andrzej Siewior
> > > <bigeasy@linutronix.de> wrote:
> > > >
> > > > On 2020-07-15 20:35:37 [+0200], Uladzislau Rezki (Sony) wrote:
> > > > > @@ -3306,6 +3307,9 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > > >                       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > >                               return false;
> > > > >
> > > > > +                     preempt_disable();
> > > > > +                     krc_this_cpu_unlock(*krcp, *flags);
> > > >
> > > > Now you enter memory allocator with disabled preemption. This isn't any
> > > > better but we don't have a warning for this yet.
> > > > What happened to the part where I asked for a spinlock_t?
> > > 
> > > Ulad,
> > > Wouldn't the replacing of preempt_disable() with migrate_disable()
> > > above resolve Sebastian's issue?
> > >
> > This for regular kernel only. That means that migrate_disable() is
> > equal to preempt_disable(). So, no difference.
> 
> But this will force preempt_disable() context into the low-level page
> allocator on -RT kernels which I believe is not what Sebastian wants. The
> whole reason why the spinlock vs raw-spinlock ordering matters is, because on
> RT, the spinlock is sleeping. So if you have:
> 
> raw_spin_lock(..);
> spin_lock(..);   <-- can sleep on RT, so Sleep while atomic (SWA) violation.
> 
> That's the main reason you are dropping the lock before calling the
> allocator.
> 
No. Please read the commit message of this patch. This is for regular kernel.

You did a patch:

<snip>
   if (IS_ENABLED(CONFIG_PREEMPT_RT))
       return false;
<snip>

--
Vlad Rezki
Uladzislau Rezki July 16, 2020, 2:47 p.m. UTC | #12
On Thu, Jul 16, 2020 at 04:25:37PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-07-16 11:19:13 [+0200], Uladzislau Rezki wrote:
> > Sebastian, could you please confirm that if that patch that is in
> > question fixes it?
> > 
> > It would be appreciated!
> 
> So that preempt disable should in terms any warnings. However I don't
> think that it is strictly needed and from scheduling point of view you
> forbid a CPU migration which might be good otherwise.
>
Please elaborate your point regarding "i do not think it is strictly needed".

Actually i can rework the patch to remove even such preempt_enable/disable
to stay on the same CPU, but i do not see the point of doing it.

Do you see the point?

As for scheduling point of view. Well, there are many places when there
is a demand in memory or pages from atomic context. Also, getting a page
is not considered as a hot path in the kfree_rcu(). 

>
> Also if interrupts and everything is enabled then someone else might
> invoke kfree_rcu() from BH context for instance.
> 
And what? What is a problem here, please elaborate if you see any
issues.

--
Vlad Rezki
Sebastian Andrzej Siewior July 16, 2020, 3:04 p.m. UTC | #13
On 2020-07-16 16:47:28 [+0200], Uladzislau Rezki wrote:
> On Thu, Jul 16, 2020 at 04:25:37PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2020-07-16 11:19:13 [+0200], Uladzislau Rezki wrote:
> > > Sebastian, could you please confirm that if that patch that is in
> > > question fixes it?
> > > 
> > > It would be appreciated!
> > 
> > So that preempt disable should in terms any warnings. However I don't
> > think that it is strictly needed and from scheduling point of view you
> > forbid a CPU migration which might be good otherwise.
> >
> Please elaborate your point regarding "i do not think it is strictly needed".
> 
> Actually i can rework the patch to remove even such preempt_enable/disable
> to stay on the same CPU, but i do not see the point of doing it.
> 
> Do you see the point?

You disable preemption for what reason? It is not documented, it is not
obvious - why is it required?

> As for scheduling point of view. Well, there are many places when there
> is a demand in memory or pages from atomic context. Also, getting a page
> is not considered as a hot path in the kfree_rcu(). 

If you disable preemption than you assume that you wouldn't be atomic
otherwise. You say that at this point it is not a hot path so if this is
not *that* important why not allow preemption and allow the schedule to
place you somewhere else if the scheduler decides that it is a good idea.

> > Also if interrupts and everything is enabled then someone else might
> > invoke kfree_rcu() from BH context for instance.
> > 
> And what? What is a problem here, please elaborate if you see any
> issues.

That the kfree_rcu() caller from BH context will end up here as well,
asking for a page.

Sebastian
Paul E. McKenney July 16, 2020, 3:20 p.m. UTC | #14
On Thu, Jul 16, 2020 at 04:14:21PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-07-15 15:14:49 [-0700], Paul E. McKenney wrote:
> > 
> > My concern is that some critical bug will show up at some point
> > that requires double-argument kfree_rcu() be invoked while holding
> > a raw spinlock.  (Single-argument kfree_rcu() must sometimes invoke
> > synchronize_rcu(), so it can never be invoked in any state forbidding
> > invoking schedule().)
> 
> So you are saying as of today we are good but in near future the
> following
>    synchronize_rcu() -> kfree_rcu()
> 
> may be needed?

You lost me on this one.  I am instead concerned that something like this
might be needed on short notice:

	raw_spin_lock(&some_lock);
	kfree_rcu(some_pointer, some_field_offset);

In contrast, single-argument kfree_rcu() cannot be invoked from any
environment where synchronize_rcu() cannot be invoked.

> > Yes, dropping to a plain spinlock would be simple in the here and now,
> > but experience indicates that it is only a matter of time, and that when
> > that time comes it will come as an emergency.
> 
> Hmmm.

I point out the call_rcu() experience.

> > One approach would be to replace the "IS_ENABLED(CONFIG_PREEMPT_RT)"
> > with some sort of check for being in a context where spinlock acquisition
> > is not legal.  What could be done along those lines?
> 
> I would rethink the whole concept how this is implemented now and give
> it another try. The code does not look pretty and is looking
> complicated. The RT covering of this part then just added a simple
> return because nothing else seemed to be possible. This patch here
> looks like another duct tape attempt to avoid a warning.

In addition to the possibility of invocation from BH?

							Thanx, Paul
Uladzislau Rezki July 16, 2020, 3:34 p.m. UTC | #15
On Thu, Jul 16, 2020 at 05:04:14PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-07-16 16:47:28 [+0200], Uladzislau Rezki wrote:
> > On Thu, Jul 16, 2020 at 04:25:37PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2020-07-16 11:19:13 [+0200], Uladzislau Rezki wrote:
> > > > Sebastian, could you please confirm that if that patch that is in
> > > > question fixes it?
> > > > 
> > > > It would be appreciated!
> > > 
> > > So that preempt disable should in terms any warnings. However I don't
> > > think that it is strictly needed and from scheduling point of view you
> > > forbid a CPU migration which might be good otherwise.
> > >
> > Please elaborate your point regarding "i do not think it is strictly needed".
> > 
> > Actually i can rework the patch to remove even such preempt_enable/disable
> > to stay on the same CPU, but i do not see the point of doing it.
> > 
> > Do you see the point?
> 
> You disable preemption for what reason? It is not documented, it is not
> obvious - why is it required?
> 
I can document it. Will it work for you? Actually i can get rid of it
but there can be other side effects which also can be addressed but
i do not see any issues of doing just "preemtion off". Please have
a look at sources across the kernel how many times a memory is
requested in atomic context:

<snip>
preempt_disable() os any spinlock or raw_locks, etc..
__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
or
kmalloc(GFP_ATOMIC);
<snip>

all those flags say to page allocator or SLAB that sleeping is not
allowed.

> > As for scheduling point of view. Well, there are many places when there
> > is a demand in memory or pages from atomic context. Also, getting a page
> > is not considered as a hot path in the kfree_rcu(). 
> 
> If you disable preemption than you assume that you wouldn't be atomic
> otherwise. You say that at this point it is not a hot path so if this is
> not *that* important why not allow preemption and allow the schedule to
>
If i disable preemption, it means that atomic section begins. Let me
explain why i disable preemption.

If i invoke a page allocator in full preemptable context, it can be
that we get a page but end up on another CPU. That another CPU does
not need it, because it has some free spots in its internal array
for pointer collecting. If we stay on the same CPU we eliminate it.

The question is what to do with that page. I see two solutions.

1) Queue it to the CPU2 page cache for further reuse or freeing.
2) Just proceed with newly allocated page thinking that previous
   "bulk arry" is partly populated, i.e. it can be that previous
   one has plenty of free spots.

Actually that is why i want to stay on the same CPU.

>
> place you somewhere else if the scheduler decides that it is a good idea.
> 
It is not a hot path, really. I do not consider it as critical, since the
page allocator will not do much work(we use restricted flags), on a high
level it is limited to just ask a page and return it. If no page, check
watermark, if low, wakeup kswapd and return NULL, that is it.

> > > Also if interrupts and everything is enabled then someone else might
> > > invoke kfree_rcu() from BH context for instance.
> > > 
> > And what? What is a problem here, please elaborate if you see any
> > issues.
> 
> That the kfree_rcu() caller from BH context will end up here as well,
> asking for a page.
> 
Please think about that CPU0 is somewhere in __get_free_page(), when it
is still there, there comes an interrupt that also calls kfree_rcu()
and end up somewhere in __get_free_page(). To prevent such internal
critical sections usually the code disables irqs and do some critical
things to prevent of breaking something. 

So, basically __get_free_page() can be interrupted and being invoked
one more time on the same CPU. It uses spin_lockirqsave() for such
scenarios.

Our internal lock is dropped.

--
Vlad Rezki
Sebastian Andrzej Siewior July 16, 2020, 3:36 p.m. UTC | #16
On 2020-07-16 08:20:27 [-0700], Paul E. McKenney wrote:
> You lost me on this one.  I am instead concerned that something like this
> might be needed on short notice:
> 
> 	raw_spin_lock(&some_lock);
> 	kfree_rcu(some_pointer, some_field_offset);
> 
> In contrast, single-argument kfree_rcu() cannot be invoked from any
> environment where synchronize_rcu() cannot be invoked.

I see. We don't have any kfree() in that context as far as I remember.
We had a few cases in "resize" where you allocate memory, copy content
and free old memory while under the lock but they are gone.

> > > Yes, dropping to a plain spinlock would be simple in the here and now,
> > > but experience indicates that it is only a matter of time, and that when
> > > that time comes it will come as an emergency.
> > 
> > Hmmm.
> 
> I point out the call_rcu() experience.
> 
> > > One approach would be to replace the "IS_ENABLED(CONFIG_PREEMPT_RT)"
> > > with some sort of check for being in a context where spinlock acquisition
> > > is not legal.  What could be done along those lines?
> > 
> > I would rethink the whole concept how this is implemented now and give
> > it another try. The code does not look pretty and is looking
> > complicated. The RT covering of this part then just added a simple
> > return because nothing else seemed to be possible. This patch here
> > looks like another duct tape attempt to avoid a warning.
> 
> In addition to the possibility of invocation from BH?

Invocation from BH should be possible because network would probably be
the first user. I don't remember anything wrong with BH if I remember
correctly.

> 							Thanx, Paul

Sebastian
Paul E. McKenney July 16, 2020, 4:36 p.m. UTC | #17
On Thu, Jul 16, 2020 at 05:36:38PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-07-16 08:20:27 [-0700], Paul E. McKenney wrote:
> > You lost me on this one.  I am instead concerned that something like this
> > might be needed on short notice:
> > 
> > 	raw_spin_lock(&some_lock);
> > 	kfree_rcu(some_pointer, some_field_offset);
> > 
> > In contrast, single-argument kfree_rcu() cannot be invoked from any
> > environment where synchronize_rcu() cannot be invoked.
> 
> I see. We don't have any kfree() in that context as far as I remember.
> We had a few cases in "resize" where you allocate memory, copy content
> and free old memory while under the lock but they are gone.

True, but we also didn't have any calls to call_rcu() prior to the call
to rcu_init() until suddenly we did.  (Yeah, I could have put my foot
down and prohibited that practice, but the workarounds were quite a bit
more complicated than just making call_rcu() work during very early boot.)

And last I checked, there really were calls to call_rcu() under raw
spinlocks, so the potential or calls to double-argument kfree_rcu()
clearly exists and is very real.

> > > > Yes, dropping to a plain spinlock would be simple in the here and now,
> > > > but experience indicates that it is only a matter of time, and that when
> > > > that time comes it will come as an emergency.
> > > 
> > > Hmmm.
> > 
> > I point out the call_rcu() experience.
> > 
> > > > One approach would be to replace the "IS_ENABLED(CONFIG_PREEMPT_RT)"
> > > > with some sort of check for being in a context where spinlock acquisition
> > > > is not legal.  What could be done along those lines?
> > > 
> > > I would rethink the whole concept how this is implemented now and give
> > > it another try. The code does not look pretty and is looking
> > > complicated. The RT covering of this part then just added a simple
> > > return because nothing else seemed to be possible. This patch here
> > > looks like another duct tape attempt to avoid a warning.
> > 
> > In addition to the possibility of invocation from BH?
> 
> Invocation from BH should be possible because network would probably be
> the first user. I don't remember anything wrong with BH if I remember
> correctly.

OK, that is reassuring.  Here is hoping!

							Thanx, Paul
Joel Fernandes July 16, 2020, 6:27 p.m. UTC | #18
On Thu, Jul 16, 2020 at 04:37:14PM +0200, Uladzislau Rezki wrote:
> On Thu, Jul 16, 2020 at 09:36:47AM -0400, Joel Fernandes wrote:
> > On Thu, Jul 16, 2020 at 11:19:13AM +0200, Uladzislau Rezki wrote:
> > > On Wed, Jul 15, 2020 at 07:13:33PM -0400, Joel Fernandes wrote:
> > > > On Wed, Jul 15, 2020 at 2:56 PM Sebastian Andrzej Siewior
> > > > <bigeasy@linutronix.de> wrote:
> > > > >
> > > > > On 2020-07-15 20:35:37 [+0200], Uladzislau Rezki (Sony) wrote:
> > > > > > @@ -3306,6 +3307,9 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > > > >                       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > > >                               return false;
> > > > > >
> > > > > > +                     preempt_disable();
> > > > > > +                     krc_this_cpu_unlock(*krcp, *flags);
> > > > >
> > > > > Now you enter memory allocator with disabled preemption. This isn't any
> > > > > better but we don't have a warning for this yet.
> > > > > What happened to the part where I asked for a spinlock_t?
> > > > 
> > > > Ulad,
> > > > Wouldn't the replacing of preempt_disable() with migrate_disable()
> > > > above resolve Sebastian's issue?
> > > >
> > > This for regular kernel only. That means that migrate_disable() is
> > > equal to preempt_disable(). So, no difference.
> > 
> > But this will force preempt_disable() context into the low-level page
> > allocator on -RT kernels which I believe is not what Sebastian wants. The
> > whole reason why the spinlock vs raw-spinlock ordering matters is, because on
> > RT, the spinlock is sleeping. So if you have:
> > 
> > raw_spin_lock(..);
> > spin_lock(..);   <-- can sleep on RT, so Sleep while atomic (SWA) violation.
> > 
> > That's the main reason you are dropping the lock before calling the
> > allocator.
> > 
> No. Please read the commit message of this patch. This is for regular kernel.

Wait, so what is the hesitation to put migrate_disable() here? It is even
further documentation (annotation) that the goal here is to stay on the same
CPU - as you indicated in later emails.

And the documentation aspect is also something Sebastian brought. A plain
preempt_disable() is frowned up if there are alternative API that document
the usage.

> You did a patch:
> 
> <snip>
>    if (IS_ENABLED(CONFIG_PREEMPT_RT))
>        return false;
> <snip>

I know, that's what we're discussing.

So again, why the hatred for migrate_disable() ? :)

thanks,

 - Joel
Uladzislau Rezki July 16, 2020, 7:03 p.m. UTC | #19
On Thu, Jul 16, 2020 at 02:27:07PM -0400, Joel Fernandes wrote:
> On Thu, Jul 16, 2020 at 04:37:14PM +0200, Uladzislau Rezki wrote:
> > On Thu, Jul 16, 2020 at 09:36:47AM -0400, Joel Fernandes wrote:
> > > On Thu, Jul 16, 2020 at 11:19:13AM +0200, Uladzislau Rezki wrote:
> > > > On Wed, Jul 15, 2020 at 07:13:33PM -0400, Joel Fernandes wrote:
> > > > > On Wed, Jul 15, 2020 at 2:56 PM Sebastian Andrzej Siewior
> > > > > <bigeasy@linutronix.de> wrote:
> > > > > >
> > > > > > On 2020-07-15 20:35:37 [+0200], Uladzislau Rezki (Sony) wrote:
> > > > > > > @@ -3306,6 +3307,9 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > > > > >                       if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > > > > >                               return false;
> > > > > > >
> > > > > > > +                     preempt_disable();
> > > > > > > +                     krc_this_cpu_unlock(*krcp, *flags);
> > > > > >
> > > > > > Now you enter memory allocator with disabled preemption. This isn't any
> > > > > > better but we don't have a warning for this yet.
> > > > > > What happened to the part where I asked for a spinlock_t?
> > > > > 
> > > > > Ulad,
> > > > > Wouldn't the replacing of preempt_disable() with migrate_disable()
> > > > > above resolve Sebastian's issue?
> > > > >
> > > > This for regular kernel only. That means that migrate_disable() is
> > > > equal to preempt_disable(). So, no difference.
> > > 
> > > But this will force preempt_disable() context into the low-level page
> > > allocator on -RT kernels which I believe is not what Sebastian wants. The
> > > whole reason why the spinlock vs raw-spinlock ordering matters is, because on
> > > RT, the spinlock is sleeping. So if you have:
> > > 
> > > raw_spin_lock(..);
> > > spin_lock(..);   <-- can sleep on RT, so Sleep while atomic (SWA) violation.
> > > 
> > > That's the main reason you are dropping the lock before calling the
> > > allocator.
> > > 
> > No. Please read the commit message of this patch. This is for regular kernel.
> 
> Wait, so what is the hesitation to put migrate_disable() here? It is even
> further documentation (annotation) that the goal here is to stay on the same
> CPU - as you indicated in later emails.
> 
Actually preempt_disable() does the same for !RT. I agree that
migrate_disable() annotation looks better from the point you
mentioned.

> And the documentation aspect is also something Sebastian brought. A plain
> preempt_disable() is frowned up if there are alternative API that document
> the usage.
> 
> > You did a patch:
> > 
> > <snip>
> >    if (IS_ENABLED(CONFIG_PREEMPT_RT))
> >        return false;
> > <snip>
> 
> I know, that's what we're discussing.
> 
> So again, why the hatred for migrate_disable() ? :)
> 
Let's do migrate_disable(), i do not mind :)

--
Vlad Rezki

Patch
diff mbox series

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 21c2fa5bd8c3..7469bd1e5c2c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3278,21 +3278,22 @@  static void kfree_rcu_monitor(struct work_struct *work)
 }
 
 static inline bool
-kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
+add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
+	unsigned long *flags, void *ptr)
 {
 	struct kvfree_rcu_bulk_data *bnode;
 	int idx;
 
-	if (unlikely(!krcp->initialized))
+	*krcp = krc_this_cpu_lock(flags);
+	if (unlikely(!(*krcp)->initialized))
 		return false;
 
-	lockdep_assert_held(&krcp->lock);
 	idx = !!is_vmalloc_addr(ptr);
 
 	/* Check if a new block is required. */
-	if (!krcp->bkvhead[idx] ||
-			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
-		bnode = get_cached_bnode(krcp);
+	if (!(*krcp)->bkvhead[idx] ||
+			(*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
+		bnode = get_cached_bnode(*krcp);
 		if (!bnode) {
 			/*
 			 * To keep this path working on raw non-preemptible
@@ -3306,6 +3307,9 @@  kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
 			if (IS_ENABLED(CONFIG_PREEMPT_RT))
 				return false;
 
+			preempt_disable();
+			krc_this_cpu_unlock(*krcp, *flags);
+
 			/*
 			 * NOTE: For one argument of kvfree_rcu() we can
 			 * drop the lock and get the page in sleepable
@@ -3315,6 +3319,9 @@  kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
 			 */
 			bnode = (struct kvfree_rcu_bulk_data *)
 				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
+
+			*krcp = krc_this_cpu_lock(flags);
+			preempt_enable();
 		}
 
 		/* Switch to emergency path. */
@@ -3323,15 +3330,15 @@  kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
 
 		/* Initialize the new block. */
 		bnode->nr_records = 0;
-		bnode->next = krcp->bkvhead[idx];
+		bnode->next = (*krcp)->bkvhead[idx];
 
 		/* Attach it to the head. */
-		krcp->bkvhead[idx] = bnode;
+		(*krcp)->bkvhead[idx] = bnode;
 	}
 
 	/* Finally insert. */
-	krcp->bkvhead[idx]->records
-		[krcp->bkvhead[idx]->nr_records++] = ptr;
+	(*krcp)->bkvhead[idx]->records
+		[(*krcp)->bkvhead[idx]->nr_records++] = ptr;
 
 	return true;
 }
@@ -3369,24 +3376,19 @@  void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 		ptr = (unsigned long *) func;
 	}
 
-	krcp = krc_this_cpu_lock(&flags);
-
 	// Queue the object but don't yet schedule the batch.
 	if (debug_rcu_head_queue(ptr)) {
 		// Probable double kfree_rcu(), just leak.
 		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
 			  __func__, head);
-
-		// Mark as success and leave.
-		success = true;
-		goto unlock_return;
+		return;
 	}
 
 	/*
 	 * Under high memory pressure GFP_NOWAIT can fail,
 	 * in that case the emergency path is maintained.
 	 */
-	success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
+	success = add_ptr_to_bulk_krc_lock(&krcp, &flags, ptr);
 	if (!success) {
 		if (head == NULL)
 			// Inline if kvfree_rcu(one_arg) call.