diff mbox series

[v2,1/2] seqlock: Do the lockdep annotation before locking in do_write_seqcount_begin_nested()

Message ID 20230623171232.892937-2-bigeasy@linutronix.de (mailing list archive)
State New
Headers show
Series seqlock,mm: lockdep annotation + write_seqlock_irqsave() | expand

Commit Message

Sebastian Andrzej Siewior June 23, 2023, 5:12 p.m. UTC
It was brought up by Tetsuo that the following sequence
   write_seqlock_irqsave()
   printk_deferred_enter()

could lead to a deadlock if the lockdep annotation within
write_seqlock_irqsave() triggers. The problem is that the sequence
counter is incremented before the lockdep annotation is performed. The
lockdep splat would then attempt to invoke printk() but the reader side,
of the same seqcount, could have a tty_port::lock acquired waiting for
the sequence number to become even again.

The other lockdep annotations come before the actual locking because "we
want to see the locking error before it happens". There is no reason why
seqcount should be different here.

Do the lockdep annotation first then perform the locking operation (the
sequence increment).

Fixes: 1ca7d67cf5d5a ("seqcount: Add lockdep functionality to seqcount/seqlock structures")
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/20230621130641.-5iueY1I@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/seqlock.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Tetsuo Handa June 24, 2023, 6:54 a.m. UTC | #1
On 2023/06/24 2:12, Sebastian Andrzej Siewior wrote:
>  include/linux/seqlock.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> index 3926e90279477..d778af83c8f36 100644
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -512,8 +512,8 @@ do {									\
>  
>  static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass)
>  {
> -	do_raw_write_seqcount_begin(s);
>  	seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_);
> +	do_raw_write_seqcount_begin(s);
>  }
>  
>  /**

Why not to do the same on the end side?

 static inline void do_write_seqcount_end(seqcount_t *s)
 {
- 	seqcount_release(&s->dep_map, _RET_IP_);
 	do_raw_write_seqcount_end(s);
+	seqcount_release(&s->dep_map, _RET_IP_);
 }
Sebastian Andrzej Siewior June 26, 2023, 8:12 a.m. UTC | #2
On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
> Why not to do the same on the end side?
> 
>  static inline void do_write_seqcount_end(seqcount_t *s)
>  {
> - 	seqcount_release(&s->dep_map, _RET_IP_);
>  	do_raw_write_seqcount_end(s);
> +	seqcount_release(&s->dep_map, _RET_IP_);
>  }

I don't have a compelling argument for doing it. It is probably better
to release the lock from lockdep's point of view and then really release
it (so it can't be acquired before it is released).

Looking at other locking primitives (spin_lock_unlock(),
mutex_unlock(),…) that is what they do in the unlock path: lockdep
annotation followed by the actual operation. Therefore I would keep the
current ordering to remain in-sync with the other primitives.

Sebastian
Tetsuo Handa June 26, 2023, 9:25 a.m. UTC | #3
On 2023/06/26 17:12, Sebastian Andrzej Siewior wrote:
> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
>> Why not to do the same on the end side?
>>
>>  static inline void do_write_seqcount_end(seqcount_t *s)
>>  {
>> - 	seqcount_release(&s->dep_map, _RET_IP_);
>>  	do_raw_write_seqcount_end(s);
>> +	seqcount_release(&s->dep_map, _RET_IP_);
>>  }
> 
> I don't have a compelling argument for doing it. It is probably better
> to release the lock from lockdep's point of view and then really release
> it (so it can't be acquired before it is released).

We must do it because this is a source of possible printk() deadlock.
Otherwise, I will nack on PATCH 2/2.

> 
> Looking at other locking primitives (spin_lock_unlock(),
> mutex_unlock(),…) that is what they do in the unlock path: lockdep
> annotation followed by the actual operation. Therefore I would keep the
> current ordering to remain in-sync with the other primitives.
Peter Zijlstra June 26, 2023, 10:48 a.m. UTC | #4
On Mon, Jun 26, 2023 at 06:25:56PM +0900, Tetsuo Handa wrote:
> On 2023/06/26 17:12, Sebastian Andrzej Siewior wrote:
> > On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
> >> Why not to do the same on the end side?
> >>
> >>  static inline void do_write_seqcount_end(seqcount_t *s)
> >>  {
> >> - 	seqcount_release(&s->dep_map, _RET_IP_);
> >>  	do_raw_write_seqcount_end(s);
> >> +	seqcount_release(&s->dep_map, _RET_IP_);
> >>  }
> > 
> > I don't have a compelling argument for doing it. It is probably better
> > to release the lock from lockdep's point of view and then really release
> > it (so it can't be acquired before it is released).
> 
> We must do it because this is a source of possible printk() deadlock.
> Otherwise, I will nack on PATCH 2/2.

Don't be like that... just hate on prink like the rest of us. In fact,
i've been patching out the actual printk code for years because its
unusable garbage.

Will this actually still be a problem once all the fancy printk stuff
lands? That shouldn't do synchronous prints except to 'atomic' consoles
by default IIRC.
Tetsuo Handa June 26, 2023, 11:26 a.m. UTC | #5
On 2023/06/26 19:48, Peter Zijlstra wrote:
> On Mon, Jun 26, 2023 at 06:25:56PM +0900, Tetsuo Handa wrote:
>> On 2023/06/26 17:12, Sebastian Andrzej Siewior wrote:
>>> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
>>>> Why not to do the same on the end side?
>>>>
>>>>  static inline void do_write_seqcount_end(seqcount_t *s)
>>>>  {
>>>> - 	seqcount_release(&s->dep_map, _RET_IP_);
>>>>  	do_raw_write_seqcount_end(s);
>>>> +	seqcount_release(&s->dep_map, _RET_IP_);
>>>>  }
>>>
>>> I don't have a compelling argument for doing it. It is probably better
>>> to release the lock from lockdep's point of view and then really release
>>> it (so it can't be acquired before it is released).
>>
>> We must do it because this is a source of possible printk() deadlock.
>> Otherwise, I will nack on PATCH 2/2.
> 
> Don't be like that... just hate on prink like the rest of us. In fact,
> i've been patching out the actual printk code for years because its
> unusable garbage.
> 
> Will this actually still be a problem once all the fancy printk stuff
> lands? That shouldn't do synchronous prints except to 'atomic' consoles
> by default IIRC.

Commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq
seqlock") was applied to 4.14-stable trees, and CONFIG_PREEMPT_RT is available
since 5.3. Thus, we want a fix which can be applied to 5.4-stable and later.
This means that we can't count on all the fancy printk stuff being available.
Michal Hocko June 26, 2023, 11:35 a.m. UTC | #6
On Mon 26-06-23 20:26:02, Tetsuo Handa wrote:
> On 2023/06/26 19:48, Peter Zijlstra wrote:
> > On Mon, Jun 26, 2023 at 06:25:56PM +0900, Tetsuo Handa wrote:
> >> On 2023/06/26 17:12, Sebastian Andrzej Siewior wrote:
> >>> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
> >>>> Why not to do the same on the end side?
> >>>>
> >>>>  static inline void do_write_seqcount_end(seqcount_t *s)
> >>>>  {
> >>>> - 	seqcount_release(&s->dep_map, _RET_IP_);
> >>>>  	do_raw_write_seqcount_end(s);
> >>>> +	seqcount_release(&s->dep_map, _RET_IP_);
> >>>>  }
> >>>
> >>> I don't have a compelling argument for doing it. It is probably better
> >>> to release the lock from lockdep's point of view and then really release
> >>> it (so it can't be acquired before it is released).
> >>
> >> We must do it because this is a source of possible printk() deadlock.
> >> Otherwise, I will nack on PATCH 2/2.
> > 
> > Don't be like that... just hate on prink like the rest of us. In fact,
> > i've been patching out the actual printk code for years because its
> > unusable garbage.
> > 
> > Will this actually still be a problem once all the fancy printk stuff
> > lands? That shouldn't do synchronous prints except to 'atomic' consoles
> > by default IIRC.
> 
> Commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq
> seqlock") was applied to 4.14-stable trees, and CONFIG_PREEMPT_RT is available
> since 5.3. Thus, we want a fix which can be applied to 5.4-stable and later.
> This means that we can't count on all the fancy printk stuff being available.

Is there any reason to backport RT specific fixup to stable trees? I
mean seriously, is there any actual memory hotplug user using
PREEMPT_RT? I would be more than curious to hear the usecase.
Tetsuo Handa June 26, 2023, 12:27 p.m. UTC | #7
On 2023/06/26 20:35, Michal Hocko wrote:
> On Mon 26-06-23 20:26:02, Tetsuo Handa wrote:
>> On 2023/06/26 19:48, Peter Zijlstra wrote:
>>> On Mon, Jun 26, 2023 at 06:25:56PM +0900, Tetsuo Handa wrote:
>>>> On 2023/06/26 17:12, Sebastian Andrzej Siewior wrote:
>>>>> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
>>>>>> Why not to do the same on the end side?
>>>>>>
>>>>>>  static inline void do_write_seqcount_end(seqcount_t *s)
>>>>>>  {
>>>>>> - 	seqcount_release(&s->dep_map, _RET_IP_);
>>>>>>  	do_raw_write_seqcount_end(s);
>>>>>> +	seqcount_release(&s->dep_map, _RET_IP_);
>>>>>>  }
>>>>>
>>>>> I don't have a compelling argument for doing it. It is probably better
>>>>> to release the lock from lockdep's point of view and then really release
>>>>> it (so it can't be acquired before it is released).
>>>>
>>>> We must do it because this is a source of possible printk() deadlock.
>>>> Otherwise, I will nack on PATCH 2/2.
>>>
>>> Don't be like that... just hate on prink like the rest of us. In fact,
>>> i've been patching out the actual printk code for years because its
>>> unusable garbage.
>>>
>>> Will this actually still be a problem once all the fancy printk stuff
>>> lands? That shouldn't do synchronous prints except to 'atomic' consoles
>>> by default IIRC.
>>
>> Commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq
>> seqlock") was applied to 4.14-stable trees, and CONFIG_PREEMPT_RT is available
>> since 5.3. Thus, we want a fix which can be applied to 5.4-stable and later.
>> This means that we can't count on all the fancy printk stuff being available.
> 
> Is there any reason to backport RT specific fixup to stable trees? I
> mean seriously, is there any actual memory hotplug user using
> PREEMPT_RT? I would be more than curious to hear the usecase.

Even if we don't backport RT specific fixup to stable trees, [PATCH 2/2] requires
that [PATCH 1/2] guarantees that synchronous printk() never happens (for whatever
reasons) between write_seqlock_irqsave(&zonelist_update_seq, flags) and
write_sequnlock_irqrestore(&zonelist_update_seq, flags).

If [PATCH 1/2] cannot guarantee it, [PATCH 2/2] will be automatically rejected.

If [PATCH 2/2] cannot be applied, we have several alternatives.

Alternative 1:

  Revert both commit 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
  and commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock").
  I don't think this will happen, for nobody will be happy.

Alternative 2:

  Revert commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
  and apply "mm/page_alloc: don't check zonelist_update_seq from atomic allocations" at
  https://lkml.kernel.org/r/dfdb9da6-ca8f-7a81-bfdd-d74b4c401f11@I-love.SAKURA.ne.jp .
  I think this is reasonable, for this reduces locking dependency. But Michal Hocko did not like it.

Alternative 3:

  Somehow preserve printk_deferred_enter() => write_seqlock(&zonelist_update_seq) and
  write_sequnlock(&zonelist_update_seq) => printk_deferred_exit() pattern. Something like below?

----------------------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b..ded3ac3856e7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5805,6 +5805,7 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+#ifndef CONFIG_PREEMPT_RT
 	unsigned long flags;
 
 	/*
@@ -5813,6 +5814,9 @@ static void __build_all_zonelists(void *data)
 	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
 	 */
 	local_irq_save(flags);
+#else
+	migrate_disable();
+#endif
 	/*
 	 * Explicitly disable this CPU's synchronous printk() before taking
 	 * seqlock to prevent any printk() from trying to hold port->lock, for
@@ -5859,7 +5863,11 @@ static void __build_all_zonelists(void *data)
 
 	write_sequnlock(&zonelist_update_seq);
 	printk_deferred_exit();
+#ifndef CONFIG_PREEMPT_RT
 	local_irq_restore(flags);
+#else
+	migrate_enable();
+#endif
 }
 
 static noinline void __init
----------------------------------------
Sebastian Andrzej Siewior June 26, 2023, 12:46 p.m. UTC | #8
On 2023-06-26 13:35:39 [+0200], Michal Hocko wrote:
> Is there any reason to backport RT specific fixup to stable trees? I
> mean seriously, is there any actual memory hotplug user using
> PREEMPT_RT? I would be more than curious to hear the usecase.

There is no need for stable backports for RT-only fixes. We have
RT-stable trees for that.
The reason why we fix it in the RT-stable tree is not have something
broken that was known to work.

Sebastian
Mel Gorman June 26, 2023, 12:56 p.m. UTC | #9
On Fri, Jun 23, 2023 at 07:12:31PM +0200, Sebastian Andrzej Siewior wrote:
> It was brought up by Tetsuo that the following sequence
>    write_seqlock_irqsave()
>    printk_deferred_enter()
> 
> could lead to a deadlock if the lockdep annotation within
> write_seqlock_irqsave() triggers. The problem is that the sequence
> counter is incremented before the lockdep annotation is performed. The
> lockdep splat would then attempt to invoke printk() but the reader side,
> of the same seqcount, could have a tty_port::lock acquired waiting for
> the sequence number to become even again.
> 
> The other lockdep annotations come before the actual locking because "we
> want to see the locking error before it happens". There is no reason why
> seqcount should be different here.
> 
> Do the lockdep annotation first then perform the locking operation (the
> sequence increment).
> 
> Fixes: 1ca7d67cf5d5a ("seqcount: Add lockdep functionality to seqcount/seqlock structures")
> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Link: https://lore.kernel.org/20230621130641.-5iueY1I@linutronix.de
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Sebastian Andrzej Siewior June 26, 2023, 1:13 p.m. UTC | #10
On 2023-06-26 12:48:31 [+0200], Peter Zijlstra wrote:
> 
> Don't be like that... just hate on prink like the rest of us. In fact,
> i've been patching out the actual printk code for years because its
> unusable garbage.

:)

> Will this actually still be a problem once all the fancy printk stuff
> lands? That shouldn't do synchronous prints except to 'atomic' consoles
> by default IIRC.

The fancy printk stuff should have synchronous printing in emergency
situations and threaded printing by default. But then I hear John saying
that there might be push back because atomic consoles may not be
available everywhere…

Anyway, the requirement for the deadlock to trigger, that Tetsuo Handa
is concerned here:
- lockdep enabled
- console and tty in use.
- tty_insert_flip_string_and_push_buffer() on one CPU with
  tty_port::lock acquired followed with a memory allocation blocking on
  read_seqbegin(&zonelist_update_seq) due the writer
- memory hotplug => __build_all_zonelists() acquiring
  write_seqlock(&zonelist_update_seq) and now lockdep creates a splat.
  This is accounted for if the lockdep annotation comes first (1/2 of
  the series). But the unlock annotation of the seq unlock operation may
  still create a splat so a possibility for a deadlock remains.

The requirement is _high_ and it hardly justifies ugly code.

Sebastian
Michal Hocko June 26, 2023, 1:16 p.m. UTC | #11
On Mon 26-06-23 21:27:05, Tetsuo Handa wrote:
> On 2023/06/26 20:35, Michal Hocko wrote:
> > On Mon 26-06-23 20:26:02, Tetsuo Handa wrote:
> >> On 2023/06/26 19:48, Peter Zijlstra wrote:
> >>> On Mon, Jun 26, 2023 at 06:25:56PM +0900, Tetsuo Handa wrote:
> >>>> On 2023/06/26 17:12, Sebastian Andrzej Siewior wrote:
> >>>>> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
> >>>>>> Why not to do the same on the end side?
> >>>>>>
> >>>>>>  static inline void do_write_seqcount_end(seqcount_t *s)
> >>>>>>  {
> >>>>>> - 	seqcount_release(&s->dep_map, _RET_IP_);
> >>>>>>  	do_raw_write_seqcount_end(s);
> >>>>>> +	seqcount_release(&s->dep_map, _RET_IP_);
> >>>>>>  }
> >>>>>
> >>>>> I don't have a compelling argument for doing it. It is probably better
> >>>>> to release the lock from lockdep's point of view and then really release
> >>>>> it (so it can't be acquired before it is released).
> >>>>
> >>>> We must do it because this is a source of possible printk() deadlock.
> >>>> Otherwise, I will nack on PATCH 2/2.
> >>>
> >>> Don't be like that... just hate on prink like the rest of us. In fact,
> >>> i've been patching out the actual printk code for years because its
> >>> unusable garbage.
> >>>
> >>> Will this actually still be a problem once all the fancy printk stuff
> >>> lands? That shouldn't do synchronous prints except to 'atomic' consoles
> >>> by default IIRC.
> >>
> >> Commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq
> >> seqlock") was applied to 4.14-stable trees, and CONFIG_PREEMPT_RT is available
> >> since 5.3. Thus, we want a fix which can be applied to 5.4-stable and later.
> >> This means that we can't count on all the fancy printk stuff being available.
> > 
> > Is there any reason to backport RT specific fixup to stable trees? I
> > mean seriously, is there any actual memory hotplug user using
> > PREEMPT_RT? I would be more than curious to hear the usecase.
> 
> Even if we don't backport RT specific fixup to stable trees, [PATCH 2/2] requires
> that [PATCH 1/2] guarantees that synchronous printk() never happens (for whatever
> reasons) between write_seqlock_irqsave(&zonelist_update_seq, flags) and
> write_sequnlock_irqrestore(&zonelist_update_seq, flags).

I suspect you are overcomplicating this. I do understand that you want
to have this 100% airtight but I would argue that this is actually not
really necessary. I would be perfectly fine living in the world where
this particular path could trigger an unintended printk. IIUC we are
mostly talking about lockup detector only, right? AFAIK there is no such
na issue _now_ so we are talking about a potential _risk_ only.
 
> If [PATCH 1/2] cannot guarantee it, [PATCH 2/2] will be automatically rejected.
> 
> If [PATCH 2/2] cannot be applied, we have several alternatives.
> 
> Alternative 1:
> 
>   Revert both commit 3d36424b3b58 ("mm/page_alloc: fix race condition between build_all_zonelists and page allocation")
>   and commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock").
>   I don't think this will happen, for nobody will be happy.
> 
> Alternative 2:
> 
>   Revert commit 1007843a9190 ("mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock")
>   and apply "mm/page_alloc: don't check zonelist_update_seq from atomic allocations" at
>   https://lkml.kernel.org/r/dfdb9da6-ca8f-7a81-bfdd-d74b4c401f11@I-love.SAKURA.ne.jp .
>   I think this is reasonable, for this reduces locking dependency. But Michal Hocko did not like it.
> 
> Alternative 3:
> 
>   Somehow preserve printk_deferred_enter() => write_seqlock(&zonelist_update_seq) and
>   write_sequnlock(&zonelist_update_seq) => printk_deferred_exit() pattern. Something like below?
> 

Alternative 4:
stop chasing shadows and deal with the fact that this code won't be
perfect. Seriously you are trying to address a non-existing problem and
blocking a working RT solution which doesn't clutter the code with RT
specific baggage.
Petr Mladek June 26, 2023, 2:44 p.m. UTC | #12
On Mon 2023-06-26 10:12:54, Sebastian Andrzej Siewior wrote:
> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
> > Why not to do the same on the end side?
> > 
> >  static inline void do_write_seqcount_end(seqcount_t *s)
> >  {
> > - 	seqcount_release(&s->dep_map, _RET_IP_);
> >  	do_raw_write_seqcount_end(s);
> > +	seqcount_release(&s->dep_map, _RET_IP_);
> >  }
> 
> I don't have a compelling argument for doing it. It is probably better
> to release the lock from lockdep's point of view and then really release
> it (so it can't be acquired before it is released).

If this is true then we should not change the ordering on the _begin
side either. I mean that we should call the lockdep code only
after the lock is taken. Anyway, both sides should be symmetric.

That said, lockdep is about chains of locks and not about timing.
We must not call lockdep annotation when the lock is still available
for a nested context. So the ordering is probably important only when
the lock might be taken from both normal and interrupt context.

Anyway, please do not do this change only because of printk().
IMHO, the current ordering is more logical and the printk() problem
should be solved another way.

Best Regards,
Petr
Tetsuo Handa June 28, 2023, 12:14 p.m. UTC | #13
On 2023/06/26 23:44, Petr Mladek wrote:
> On Mon 2023-06-26 10:12:54, Sebastian Andrzej Siewior wrote:
>> On 2023-06-24 15:54:12 [+0900], Tetsuo Handa wrote:
>>> Why not to do the same on the end side?
>>>
>>>  static inline void do_write_seqcount_end(seqcount_t *s)
>>>  {
>>> - 	seqcount_release(&s->dep_map, _RET_IP_);
>>>  	do_raw_write_seqcount_end(s);
>>> +	seqcount_release(&s->dep_map, _RET_IP_);
>>>  }
>>
>> I don't have a compelling argument for doing it. It is probably better
>> to release the lock from lockdep's point of view and then really release
>> it (so it can't be acquired before it is released).
> 
> If this is true then we should not change the ordering on the _begin
> side either. I mean that we should call the lockdep code only
> after the lock is taken. Anyway, both sides should be symmetric.
> 
> That said, lockdep is about chains of locks and not about timing.
> We must not call lockdep annotation when the lock is still available
> for a nested context. So the ordering is probably important only when
> the lock might be taken from both normal and interrupt context.
> 
> Anyway, please do not do this change only because of printk().
> IMHO, the current ordering is more logical and the printk() problem
> should be solved another way.

Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
rejected.



I found

  /*
   * Locking a pcp requires a PCP lookup followed by a spinlock. To avoid
   * a migration causing the wrong PCP to be locked and remote memory being
   * potentially allocated, pin the task to the CPU for the lookup+lock.
   * preempt_disable is used on !RT because it is faster than migrate_disable.
   * migrate_disable is used on RT because otherwise RT spinlock usage is
   * interfered with and a high priority task cannot preempt the allocator.
   */
  #ifndef CONFIG_PREEMPT_RT
  #define pcpu_task_pin()         preempt_disable()
  #define pcpu_task_unpin()       preempt_enable()
  #else
  #define pcpu_task_pin()         migrate_disable()
  #define pcpu_task_unpin()       migrate_enable()
  #endif

in mm/page_alloc.c . Thus, I think that calling migrate_disable() if CONFIG_PREEMPT_RT=y
and calling local_irq_save() if CONFIG_PREEMPT_RT=n (i.e. Alternative 3) will work.

But thinking again, since CONFIG_PREEMPT_RT=y uses special printk() approach where messages
are printed from a dedicated kernel thread, do we need to call printk_deferred_enter() if
CONFIG_PREEMPT_RT=y ? That is, isn't the fix as straightforward as below?

----------------------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b..a2a3bfa69a12 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5805,6 +5805,7 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+#ifndef CONFIG_PREEMPT_RT
 	unsigned long flags;
 
 	/*
@@ -5820,6 +5821,7 @@ static void __build_all_zonelists(void *data)
 	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
 	 */
 	printk_deferred_enter();
+#endif
 	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
@@ -5858,8 +5860,10 @@ static void __build_all_zonelists(void *data)
 	}
 
 	write_sequnlock(&zonelist_update_seq);
+#ifndef CONFIG_PREEMPT_RT
 	printk_deferred_exit();
 	local_irq_restore(flags);
+#endif
 }
 
 static noinline void __init
----------------------------------------
Sebastian Andrzej Siewior July 27, 2023, 3:10 p.m. UTC | #14
On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
> > Anyway, please do not do this change only because of printk().
> > IMHO, the current ordering is more logical and the printk() problem
> > should be solved another way.
> 
> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
> rejected.

My understanding is that this patch gets applied and your objection will
be noted.

> I found
> 
>   /*
>    * Locking a pcp requires a PCP lookup followed by a spinlock. To avoid
>    * a migration causing the wrong PCP to be locked and remote memory being
>    * potentially allocated, pin the task to the CPU for the lookup+lock.
>    * preempt_disable is used on !RT because it is faster than migrate_disable.
>    * migrate_disable is used on RT because otherwise RT spinlock usage is
>    * interfered with and a high priority task cannot preempt the allocator.
>    */
>   #ifndef CONFIG_PREEMPT_RT
>   #define pcpu_task_pin()         preempt_disable()
>   #define pcpu_task_unpin()       preempt_enable()
>   #else
>   #define pcpu_task_pin()         migrate_disable()
>   #define pcpu_task_unpin()       migrate_enable()
>   #endif
> 
> in mm/page_alloc.c . Thus, I think that calling migrate_disable() if CONFIG_PREEMPT_RT=y
> and calling local_irq_save() if CONFIG_PREEMPT_RT=n (i.e. Alternative 3) will work.
> 
> But thinking again, since CONFIG_PREEMPT_RT=y uses special printk() approach where messages
> are printed from a dedicated kernel thread, do we need to call printk_deferred_enter() if
> CONFIG_PREEMPT_RT=y ? That is, isn't the fix as straightforward as below?

That below will cause a splat with CONFIG_PROVE_RAW_LOCK_NESTING. That
is because seqlock_t::lock is acquired without disabling interrupts.
Additionally it is a bad example because the seqcount API is bypassed
due to printk's limitations and the problems, that are caused on
PREEMPT_RT, are "ifdefed away". None of this is documented/ explained.

Let me summarize your remaining problem:
- With (and only with) CONFIG_PROVE_LOCKING there can be a printk splat
  caused by a lock validation error noticed by lockdep during
  write_sequnlock_irqrestore().

- This can deadlock if there is a printing output on the tty which is
  using the same console as printk and memory hotplug is active at the
  same time.
  That is because the tty layer acquires the same lock as printk's
  console during memory allocation (of the tty layer).

Now:
- before this deadlocks (with CONFIG_PROVE_LOCKING) chances are high
  that a splat is seen first.

- printk is reworked and the printk output should either happen from a
  dedicated thread or directly via a different console driver which is
  not using uart_port::lock. Thus avoiding the deadlock.

Sebastian
Tetsuo Handa July 29, 2023, 5:31 a.m. UTC | #15
On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
>>> Anyway, please do not do this change only because of printk().
>>> IMHO, the current ordering is more logical and the printk() problem
>>> should be solved another way.
>>
>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
>> rejected.
> 
> My understanding is that this patch gets applied and your objection will
> be noted.

My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .

Maybe we can defer checking zonelist_update_seq till retry check like below,
for this is really an infrequent event.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7d3460c7a480..2f7b82af2590 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3642,22 +3642,27 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release);
  * retries the allocation if zonelist changes. Writer side is protected by the
  * embedded spin_lock.
  */
-static DEFINE_SEQLOCK(zonelist_update_seq);
+static unsigned int zonelist_update_seq;
 
 static unsigned int zonelist_iter_begin(void)
 {
 	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
-		return read_seqbegin(&zonelist_update_seq);
+		return data_race(READ_ONCE(zonelist_update_seq));
 
 	return 0;
 }
 
-static unsigned int check_retry_zonelist(unsigned int seq)
+static unsigned int check_retry_zonelist(gfp_t gfp, unsigned int seq)
 {
-	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
-		return read_seqretry(&zonelist_update_seq, seq);
+	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp & __GFP_DIRECT_RECLAIM)) {
+		unsigned int seq2;
+
+		smp_rmb();
+		seq2 = data_race(READ_ONCE(zonelist_update_seq));
+		return unlikely(seq != seq2 || (seq2 & 1));
+	}
 
-	return seq;
+	return 0;
 }
 
 /* Perform direct synchronous page reclaim */
@@ -4146,7 +4151,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
 		goto restart;
 
 	/* Reclaim has failed us, start killing things */
@@ -4172,7 +4177,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
 		goto restart;
 
 	/*
@@ -5136,22 +5141,12 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+	static DEFINE_SPINLOCK(lock);
 	unsigned long flags;
 
-	/*
-	 * Explicitly disable this CPU's interrupts before taking seqlock
-	 * to prevent any IRQ handler from calling into the page allocator
-	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
-	 */
-	local_irq_save(flags);
-	/*
-	 * Explicitly disable this CPU's synchronous printk() before taking
-	 * seqlock to prevent any printk() from trying to hold port->lock, for
-	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
-	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
-	 */
-	printk_deferred_enter();
-	write_seqlock(&zonelist_update_seq);
+	spin_lock_irqsave(&lock, flags);
+	data_race(zonelist_update_seq++);
+	smp_wmb();
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -5188,9 +5183,9 @@ static void __build_all_zonelists(void *data)
 #endif
 	}
 
-	write_sequnlock(&zonelist_update_seq);
-	printk_deferred_exit();
-	local_irq_restore(flags);
+	smp_wmb();
+	data_race(zonelist_update_seq++);
+	spin_unlock_irqrestore(&lock, flags);
 }
 
 static noinline void __init
Tetsuo Handa July 29, 2023, 11:05 a.m. UTC | #16
On 2023/07/29 14:31, Tetsuo Handa wrote:
> On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
>> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
>>>> Anyway, please do not do this change only because of printk().
>>>> IMHO, the current ordering is more logical and the printk() problem
>>>> should be solved another way.
>>>
>>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
>>> rejected.
>>
>> My understanding is that this patch gets applied and your objection will
>> be noted.
> 
> My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
> allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
> https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
> https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .
> 
> Maybe we can defer checking zonelist_update_seq till retry check like below,
> for this is really an infrequent event.
> 

An updated version with comments added.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7d3460c7a480..92ecf5f2f76b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3638,26 +3638,44 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release);
 
 /*
  * Zonelists may change due to hotplug during allocation. Detect when zonelists
- * have been rebuilt so allocation retries. Reader side does not lock and
- * retries the allocation if zonelist changes. Writer side is protected by the
- * embedded spin_lock.
+ * have been rebuilt so __GFP_DIRECT_RECLAIM allocation retries. Writer side
+ * makes this sequence odd before rebuilding zonelists and makes this sequence
+ * even after rebuilding zonelists. Sine writer side disables context switching
+ * when rebuilding zonelists, and !__GFP_DIRECT_RECLAIM allocation will not
+ * retry when zonelists changed, reader side does not need to hold a lock (but
+ * needs to use data_race() annotation), making locking dependency simpler.
  */
-static DEFINE_SEQLOCK(zonelist_update_seq);
+static unsigned int zonelist_update_seq;
 
 static unsigned int zonelist_iter_begin(void)
 {
 	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
-		return read_seqbegin(&zonelist_update_seq);
+		/* See comment above. */
+		return data_race(READ_ONCE(zonelist_update_seq));
 
 	return 0;
 }
 
-static unsigned int check_retry_zonelist(unsigned int seq)
+static unsigned int check_retry_zonelist(gfp_t gfp, unsigned int seq)
 {
-	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
-		return read_seqretry(&zonelist_update_seq, seq);
+	if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE) && (gfp & __GFP_DIRECT_RECLAIM)) {
+		/* See comment above. */
+		unsigned int seq2 = data_race(READ_ONCE(zonelist_update_seq));
 
-	return seq;
+		/*
+		 * "seq != seq2" indicates that __build_all_zonelists() has
+		 * started or has finished rebuilding zonelists, hence retry.
+		 * "seq == seq2 && (seq2 & 1)" indicates that
+		 * __build_all_zonelists() is still rebuilding zonelists
+		 * with context switching disabled, hence retry.
+		 * "seq == seq2 && !(seq2 & 1)" indicates that
+		 * __build_all_zonelists() did not rebuilt zonelists, hence
+		 * no retry.
+		 */
+		return unlikely(seq != seq2 || (seq2 & 1));
+	}
+
+	return 0;
 }
 
 /* Perform direct synchronous page reclaim */
@@ -4146,7 +4164,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
 		goto restart;
 
 	/* Reclaim has failed us, start killing things */
@@ -4172,7 +4190,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * a unnecessary OOM kill.
 	 */
 	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
-	    check_retry_zonelist(zonelist_iter_cookie))
+	    check_retry_zonelist(gfp_mask, zonelist_iter_cookie))
 		goto restart;
 
 	/*
@@ -5136,22 +5154,17 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+	static DEFINE_SPINLOCK(lock);
 	unsigned long flags;
 
-	/*
-	 * Explicitly disable this CPU's interrupts before taking seqlock
-	 * to prevent any IRQ handler from calling into the page allocator
-	 * (e.g. GFP_ATOMIC) that could hit zonelist_iter_begin and livelock.
-	 */
-	local_irq_save(flags);
-	/*
-	 * Explicitly disable this CPU's synchronous printk() before taking
-	 * seqlock to prevent any printk() from trying to hold port->lock, for
-	 * tty_insert_flip_string_and_push_buffer() on other CPU might be
-	 * calling kmalloc(GFP_ATOMIC | __GFP_NOWARN) with port->lock held.
-	 */
-	printk_deferred_enter();
-	write_seqlock(&zonelist_update_seq);
+#ifdef CONFIG_PREEMPT_RT
+	migrate_disable()
+#endif
+	/* Serialize increments of zonelist_update_seq. */
+	spin_lock_irqsave(&lock, flags);
+	zonelist_update_seq++;
+	/* Tell check_retry_zonelist() that we started rebuilding zonelists. */
+	smp_wmb();
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -5188,9 +5201,13 @@ static void __build_all_zonelists(void *data)
 #endif
 	}
 
-	write_sequnlock(&zonelist_update_seq);
-	printk_deferred_exit();
-	local_irq_restore(flags);
+	/* Tell check_retry_zonelist() that we finished rebuilding zonelists. */
+	smp_wmb();
+	zonelist_update_seq++;
+	spin_unlock_irqrestore(&lock, flags);
+#ifdef CONFIG_PREEMPT_RT
+	migrate_enable()
+#endif
 }
 
 static noinline void __init
Michal Hocko July 31, 2023, 2:25 p.m. UTC | #17
On Sat 29-07-23 20:05:43, Tetsuo Handa wrote:
> On 2023/07/29 14:31, Tetsuo Handa wrote:
> > On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
> >> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
> >>>> Anyway, please do not do this change only because of printk().
> >>>> IMHO, the current ordering is more logical and the printk() problem
> >>>> should be solved another way.
> >>>
> >>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
> >>> rejected.
> >>
> >> My understanding is that this patch gets applied and your objection will
> >> be noted.
> > 
> > My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
> > allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
> > https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
> > https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .
> > 
> > Maybe we can defer checking zonelist_update_seq till retry check like below,
> > for this is really an infrequent event.
> > 
> 
> An updated version with comments added.

Seriously, don't you see how hairy all this is? And for what? Nitpicking
something that doesn't seem to be a real problem in the first place?
Tetsuo Handa Aug. 3, 2023, 1:18 p.m. UTC | #18
On 2023/07/31 23:25, Michal Hocko wrote:
> On Sat 29-07-23 20:05:43, Tetsuo Handa wrote:
>> On 2023/07/29 14:31, Tetsuo Handa wrote:
>>> On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
>>>> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
>>>>>> Anyway, please do not do this change only because of printk().
>>>>>> IMHO, the current ordering is more logical and the printk() problem
>>>>>> should be solved another way.
>>>>>
>>>>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
>>>>> rejected.
>>>>
>>>> My understanding is that this patch gets applied and your objection will
>>>> be noted.
>>>
>>> My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
>>> allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
>>> https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
>>> https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .
>>>
>>> Maybe we can defer checking zonelist_update_seq till retry check like below,
>>> for this is really an infrequent event.
>>>
>>
>> An updated version with comments added.
> 
> Seriously, don't you see how hairy all this is? And for what? Nitpicking
> something that doesn't seem to be a real problem in the first place?

Seriously, can't you find "zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
allocations, which is a low-hanging fruit towards GFP_LOCKLESS" !?

My initial proposal was
"[PATCH] mm/page_alloc: don't check zonelist_update_seq from atomic allocations"
at https://lkml.kernel.org/r/dfdb9da6-ca8f-7a81-bfdd-d74b4c401f11@I-love.SAKURA.ne.jp .
Compared to that version, this what-you-call-hairy version has an improvement that

-	return read_seqbegin(&zonelist_update_seq);
+	return data_race(READ_ONCE(zonelist_update_seq));

can eliminate

	while ((__seq = seqprop_sequence(s)) & 1)
		cpu_relax();

path. There is no need to wait for completion of rebuilding zonelists, for
rebuilding zonelists being in flight (indicated by zonelist_update_seq being odd)
does not mean that allocation never succeeds. When allocation did not fail,
this "while" loop becomes nothing but a waste of CPU time, And it is very likely
that rebuilding zonelists being not in flight from the beginning.

We can make zonelist_iter_begin() (which is always called as long as
__alloc_pages_slowpath() is called) faster and simpler, which is an improvement
even without considering printk() and lockdep/KCSAN related problems.
Michal Hocko Aug. 3, 2023, 2:49 p.m. UTC | #19
On Thu 03-08-23 22:18:10, Tetsuo Handa wrote:
> On 2023/07/31 23:25, Michal Hocko wrote:
> > On Sat 29-07-23 20:05:43, Tetsuo Handa wrote:
> >> On 2023/07/29 14:31, Tetsuo Handa wrote:
> >>> On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
> >>>> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
> >>>>>> Anyway, please do not do this change only because of printk().
> >>>>>> IMHO, the current ordering is more logical and the printk() problem
> >>>>>> should be solved another way.
> >>>>>
> >>>>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
> >>>>> rejected.
> >>>>
> >>>> My understanding is that this patch gets applied and your objection will
> >>>> be noted.
> >>>
> >>> My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
> >>> allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
> >>> https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
> >>> https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .
> >>>
> >>> Maybe we can defer checking zonelist_update_seq till retry check like below,
> >>> for this is really an infrequent event.
> >>>
> >>
> >> An updated version with comments added.
> > 
> > Seriously, don't you see how hairy all this is? And for what? Nitpicking
> > something that doesn't seem to be a real problem in the first place?
> 
> Seriously, can't you find "zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
> allocations, which is a low-hanging fruit towards GFP_LOCKLESS" !?

I do not think we have concluded that we want to support GFP_LOCKLESS.
This might be trivial straightforward now but it imposes some constrains
for future maintainability. So far we haven't heard about many usecases
where this would be needed and a single one is not sufficient IMHO.
Tetsuo Handa Aug. 4, 2023, 1:27 p.m. UTC | #20
On 2023/08/03 23:49, Michal Hocko wrote:
> On Thu 03-08-23 22:18:10, Tetsuo Handa wrote:
>> On 2023/07/31 23:25, Michal Hocko wrote:
>>> On Sat 29-07-23 20:05:43, Tetsuo Handa wrote:
>>>> On 2023/07/29 14:31, Tetsuo Handa wrote:
>>>>> On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
>>>>>> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
>>>>>>>> Anyway, please do not do this change only because of printk().
>>>>>>>> IMHO, the current ordering is more logical and the printk() problem
>>>>>>>> should be solved another way.
>>>>>>>
>>>>>>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
>>>>>>> rejected.
>>>>>>
>>>>>> My understanding is that this patch gets applied and your objection will
>>>>>> be noted.
>>>>>
>>>>> My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
>>>>> allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
>>>>> https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
>>>>> https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .
>>>>>
>>>>> Maybe we can defer checking zonelist_update_seq till retry check like below,
>>>>> for this is really an infrequent event.
>>>>>
>>>>
>>>> An updated version with comments added.
>>>
>>> Seriously, don't you see how hairy all this is? And for what? Nitpicking
>>> something that doesn't seem to be a real problem in the first place?
>>
>> Seriously, can't you find "zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
>> allocations, which is a low-hanging fruit towards GFP_LOCKLESS" !?
> 
> I do not think we have concluded that we want to support GFP_LOCKLESS.
> This might be trivial straightforward now but it imposes some constrains
> for future maintainability. So far we haven't heard about many usecases
> where this would be needed and a single one is not sufficient IMHO.

When you introduced a word GFP_LOCKLESS in the link above, I was wondering the meaning
of "LESS" part. Since we know that it is difficult to achieve "hold 0 lock during memory
allocation", "hold least locks during memory allocation" will be at best. Therefore,
GFP_LOCKLESS is as misleading name as GFP_ATOMIC. GFP_LOCK_LEAST or GFP_LEAST_LOCKS will
represent the real behavior better.

Like I said

  I consider that memory allocations which do not do direct reclaim should be geared
  towards less locking dependency.

in the thread above, I still believe that this what-you-call-hairy version (which
matches "hold least locks during memory allocation" direction) is better than
"[PATCH v3 2/2] mm/page_alloc: Use write_seqlock_irqsave() instead write_seqlock() + local_irq_save()."
(which does not match "hold least locks during memory allocation"). My version not
only avoids possibility of deadlock, but also makes zonelist_iter_begin() faster and
simpler.

Not holding zonelist_update_seq lock is trivially doable compared to removing
__GFP_KSWAPD_RECLAIM from GFP_ATOMIC. Please give me feedback about which line of my
proposal is technically unsafe, instead of discarding my proposal with negative words
like "hairy".
Michal Hocko Aug. 7, 2023, 8:20 a.m. UTC | #21
On Fri 04-08-23 22:27:22, Tetsuo Handa wrote:
> On 2023/08/03 23:49, Michal Hocko wrote:
> > On Thu 03-08-23 22:18:10, Tetsuo Handa wrote:
> >> On 2023/07/31 23:25, Michal Hocko wrote:
> >>> On Sat 29-07-23 20:05:43, Tetsuo Handa wrote:
> >>>> On 2023/07/29 14:31, Tetsuo Handa wrote:
> >>>>> On 2023/07/28 0:10, Sebastian Andrzej Siewior wrote:
> >>>>>> On 2023-06-28 21:14:16 [+0900], Tetsuo Handa wrote:
> >>>>>>>> Anyway, please do not do this change only because of printk().
> >>>>>>>> IMHO, the current ordering is more logical and the printk() problem
> >>>>>>>> should be solved another way.
> >>>>>>>
> >>>>>>> Then, since [PATCH 1/2] cannot be applied, [PATCH 2/2] is automatically
> >>>>>>> rejected.
> >>>>>>
> >>>>>> My understanding is that this patch gets applied and your objection will
> >>>>>> be noted.
> >>>>>
> >>>>> My preference is that zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
> >>>>> allocations, which is a low-hanging fruit towards GFP_LOCKLESS mentioned at
> >>>>> https://lkml.kernel.org/r/ZG3+l4qcCWTPtSMD@dhcp22.suse.cz and
> >>>>> https://lkml.kernel.org/r/ZJWWpGZMJIADQvRS@dhcp22.suse.cz .
> >>>>>
> >>>>> Maybe we can defer checking zonelist_update_seq till retry check like below,
> >>>>> for this is really an infrequent event.
> >>>>>
> >>>>
> >>>> An updated version with comments added.
> >>>
> >>> Seriously, don't you see how hairy all this is? And for what? Nitpicking
> >>> something that doesn't seem to be a real problem in the first place?
> >>
> >> Seriously, can't you find "zonelist_update_seq is not checked by !__GFP_DIRECT_RECLAIM
> >> allocations, which is a low-hanging fruit towards GFP_LOCKLESS" !?
> > 
> > I do not think we have concluded that we want to support GFP_LOCKLESS.
> > This might be trivial straightforward now but it imposes some constrains
> > for future maintainability. So far we haven't heard about many usecases
> > where this would be needed and a single one is not sufficient IMHO.
> 
> When you introduced a word GFP_LOCKLESS in the link above, I was wondering the meaning
> of "LESS" part. Since we know that it is difficult to achieve "hold 0 lock during memory
> allocation", "hold least locks during memory allocation" will be at best. Therefore,
> GFP_LOCKLESS is as misleading name as GFP_ATOMIC. GFP_LOCK_LEAST or GFP_LEAST_LOCKS will
> represent the real behavior better.

I am not sure I understand what least locks mean actually. I guess what
you wanted to say is that there are no locks or other synchronization
means with external visibility/dependencies used. In other words a mode
which can be called from any locking context. That would be certainly
possible and whether any internal locks are used or not is an
implementation detail as long as the no external visibility/dependencies 
rule is held. I do not really want to start the naming discussion as it
is not really clear we want/need to provide such a strong guarantee.
diff mbox series

Patch

diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 3926e90279477..d778af83c8f36 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -512,8 +512,8 @@  do {									\
 
 static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass)
 {
-	do_raw_write_seqcount_begin(s);
 	seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_);
+	do_raw_write_seqcount_begin(s);
 }
 
 /**