net/skbuff: silence warnings under memory pressure
diff mbox series

Message ID 1567177025-11016-1-git-send-email-cai@lca.pw
State New
Headers show
Series
  • net/skbuff: silence warnings under memory pressure
Related show

Commit Message

Qian Cai Aug. 30, 2019, 2:57 p.m. UTC
When running heavy memory pressure workloads, the system is throwing
endless warnings below due to the allocation could fail from
__build_skb(), and the volume of this call could be huge which may
generate a lot of serial console output and cosumes all CPUs as
warn_alloc() could be expensive by calling dump_stack() and then
show_mem().

Fix it by silencing the warning in this call site. Also, it seems
unnecessary to even print a warning at all if the allocation failed in
__build_skb(), as it may just retransmit the packet and retry.

NMI watchdog: Watchdog detected hard LOCKUP on cpu 7
Hardware name: HP ProLiant XL420 Gen9/ProLiant XL420 Gen9, BIOS U19
12/27/2015
RIP: 0010:dump_stack+0xd/0x9a
Code: 5d c3 48 c7 c2 c0 ce aa bb 4c 89 ee 48 c7 c7 60 32 e3 ae e8 b3 3e
81 ff e9 ab ff ff ff 55 48 89 e5 41 55 41 83 cd ff 41 54 53 <9c> 41 5c
fa be 04 00 00 00 48 c7 c7 80 42 63 af 65 8b 1d f6 7c 8c
RSP: 0018:ffff888452389670 EFLAGS: 00000086
RAX: 000000000000000b RBX: 0000000000000007 RCX: ffffffffae75143f
RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffffaf634280
RBP: ffff888452389688 R08: 0000000000000004 R09: fffffbfff5ec6850
R10: fffffbfff5ec6850 R11: 0000000000000003 R12: 0000000000000086
R13: 00000000ffffffff R14: ffff88820547c040 R15: ffffffffaecc86a0
FS:  0000000000000000(0000) GS:ffff888452380000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f94f0537000 CR3: 00000005c8012006 CR4: 00000000001606a0
Call Trace:
 <IRQ>
 warn_alloc.cold.43+0x8a/0x148
 __alloc_pages_nodemask+0x1a5c/0x1bb0
 alloc_pages_current+0x9c/0x110
 allocate_slab+0x34a/0x11f0
 new_slab+0x46/0x70
 ___slab_alloc+0x604/0x950
 __slab_alloc+0x12/0x20
 kmem_cache_alloc+0x32a/0x400
 __build_skb+0x23/0x60
 build_skb+0x1a/0xb0
 igb_clean_rx_irq+0xafc/0x1010 [igb]
 igb_poll+0x4bb/0xe30 [igb]
 net_rx_action+0x244/0x7a0
 __do_softirq+0x1a0/0x60a
 irq_exit+0xb5/0xd0
 do_IRQ+0x81/0x170
 common_interrupt+0xf/0xf
 </IRQ>
RIP: 0010:cpuidle_enter_state+0x151/0x8d0
Code: 64 af e8 62 22 c2 ff 8b 05 04 f6 0b 01 85 c0 0f 8f 18 04 00 00 31
ff e8 9d 9e 97 ff 80 7d d0 00 0f 85 06 02 00 00 fb 45 85 ed <0f> 88 2d
02 00 00 4d 63 fd 49 83 ff 09 0f 87 8c 06 00 00 4b 8d 04
RSP: 0018:ffff888205487cf8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffdc
RAX: 0000000000000000 RBX: ffffe8fc09596f98 RCX: ffffffffadf093da
RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff8884523c2128
RBP: ffff888205487d48 R08: fffffbfff5ec9d62 R09: fffffbfff5ec9d62
R10: fffffbfff5ec9d61 R11: ffffffffaf64eb0b R12: ffffffffaf459580
R13: 0000000000000004 R14: 0000101fb3d64a9b R15: ffffe8fc09596f9c
 cpuidle_enter+0x41/0x70
 call_cpuidle+0x5e/0x90
 do_idle+0x313/0x340
 cpu_startup_entry+0x1d/0x1f
 start_secondary+0x28b/0x330
 secondary_startup_64+0xb6/0xc0

Signed-off-by: Qian Cai <cai@lca.pw>
---
 net/core/skbuff.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Eric Dumazet Aug. 30, 2019, 3:11 p.m. UTC | #1
On 8/30/19 4:57 PM, Qian Cai wrote:
> When running heavy memory pressure workloads, the system is throwing
> endless warnings below due to the allocation could fail from
> __build_skb(), and the volume of this call could be huge which may
> generate a lot of serial console output and cosumes all CPUs as
> warn_alloc() could be expensive by calling dump_stack() and then
> show_mem().
> 
> Fix it by silencing the warning in this call site. Also, it seems
> unnecessary to even print a warning at all if the allocation failed in
> __build_skb(), as it may just retransmit the packet and retry.
> 

Same patches are showing up there and there from time to time.

Why is this particular spot interesting, against all others not adding __GFP_NOWARN ?

Are we going to have hundred of patches adding __GFP_NOWARN at various points,
or should we get something generic to not flood the syslog in case of memory pressure ?
Qian Cai Aug. 30, 2019, 3:25 p.m. UTC | #2
On Fri, 2019-08-30 at 17:11 +0200, Eric Dumazet wrote:
> 
> On 8/30/19 4:57 PM, Qian Cai wrote:
> > When running heavy memory pressure workloads, the system is throwing
> > endless warnings below due to the allocation could fail from
> > __build_skb(), and the volume of this call could be huge which may
> > generate a lot of serial console output and cosumes all CPUs as
> > warn_alloc() could be expensive by calling dump_stack() and then
> > show_mem().
> > 
> > Fix it by silencing the warning in this call site. Also, it seems
> > unnecessary to even print a warning at all if the allocation failed in
> > __build_skb(), as it may just retransmit the packet and retry.
> > 
> 
> Same patches are showing up there and there from time to time.
> 
> Why is this particular spot interesting, against all others not adding
> __GFP_NOWARN ?
> 
> Are we going to have hundred of patches adding __GFP_NOWARN at various points,
> or should we get something generic to not flood the syslog in case of memory
> pressure ?
> 

From my testing which uses LTP oom* tests. There are only 3 places need to be
patched. The other two are in IOMMU code for both Intel and AMD. The place is
particular interesting because it could cause the system with floating serial
console output for days without making progress in OOM. I suppose it ends up in
a looping condition that warn_alloc() would end up generating more calls into
__build_skb() via ksoftirqd.
Eric Dumazet Aug. 30, 2019, 4:15 p.m. UTC | #3
On 8/30/19 5:25 PM, Qian Cai wrote:
> On Fri, 2019-08-30 at 17:11 +0200, Eric Dumazet wrote:
>>
>> On 8/30/19 4:57 PM, Qian Cai wrote:
>>> When running heavy memory pressure workloads, the system is throwing
>>> endless warnings below due to the allocation could fail from
>>> __build_skb(), and the volume of this call could be huge which may
>>> generate a lot of serial console output and cosumes all CPUs as
>>> warn_alloc() could be expensive by calling dump_stack() and then
>>> show_mem().
>>>
>>> Fix it by silencing the warning in this call site. Also, it seems
>>> unnecessary to even print a warning at all if the allocation failed in
>>> __build_skb(), as it may just retransmit the packet and retry.
>>>
>>
>> Same patches are showing up there and there from time to time.
>>
>> Why is this particular spot interesting, against all others not adding
>> __GFP_NOWARN ?
>>
>> Are we going to have hundred of patches adding __GFP_NOWARN at various points,
>> or should we get something generic to not flood the syslog in case of memory
>> pressure ?
>>
> 
> From my testing which uses LTP oom* tests. There are only 3 places need to be
> patched. The other two are in IOMMU code for both Intel and AMD. The place is
> particular interesting because it could cause the system with floating serial
> console output for days without making progress in OOM. I suppose it ends up in
> a looping condition that warn_alloc() would end up generating more calls into
> __build_skb() via ksoftirqd.
> 

Yes, but what about other tests done by other people ?

You do not really answer my last question, which was really the point I tried
to make.

If there is a risk of flooding the syslog, we should fix this generically
in mm layer, not adding hundred of __GFP_NOWARN all over the places.

Maybe just make __GFP_NOWARN the default, I dunno.
Qian Cai Aug. 30, 2019, 6:06 p.m. UTC | #4
On Fri, 2019-08-30 at 18:15 +0200, Eric Dumazet wrote:
> 
> On 8/30/19 5:25 PM, Qian Cai wrote:
> > On Fri, 2019-08-30 at 17:11 +0200, Eric Dumazet wrote:
> > > 
> > > On 8/30/19 4:57 PM, Qian Cai wrote:
> > > > When running heavy memory pressure workloads, the system is throwing
> > > > endless warnings below due to the allocation could fail from
> > > > __build_skb(), and the volume of this call could be huge which may
> > > > generate a lot of serial console output and cosumes all CPUs as
> > > > warn_alloc() could be expensive by calling dump_stack() and then
> > > > show_mem().
> > > > 
> > > > Fix it by silencing the warning in this call site. Also, it seems
> > > > unnecessary to even print a warning at all if the allocation failed in
> > > > __build_skb(), as it may just retransmit the packet and retry.
> > > > 
> > > 
> > > Same patches are showing up there and there from time to time.
> > > 
> > > Why is this particular spot interesting, against all others not adding
> > > __GFP_NOWARN ?
> > > 
> > > Are we going to have hundred of patches adding __GFP_NOWARN at various
> > > points,
> > > or should we get something generic to not flood the syslog in case of
> > > memory
> > > pressure ?
> > > 
> > 
> > From my testing which uses LTP oom* tests. There are only 3 places need to
> > be
> > patched. The other two are in IOMMU code for both Intel and AMD. The place
> > is
> > particular interesting because it could cause the system with floating
> > serial
> > console output for days without making progress in OOM. I suppose it ends up
> > in
> > a looping condition that warn_alloc() would end up generating more calls
> > into
> > __build_skb() via ksoftirqd.
> > 
> 
> Yes, but what about other tests done by other people ?

Sigh, I don't know what tests do you have in mind. I tried many memory pressure
tests including LTP, stress-ng, and mmtests etc running for years. I don't
recall see other places that could loop like this for days.

> 
> You do not really answer my last question, which was really the point I tried
> to make.
> 
> If there is a risk of flooding the syslog, we should fix this generically
> in mm layer, not adding hundred of __GFP_NOWARN all over the places.
> 
> Maybe just make __GFP_NOWARN the default, I dunno.

I don't really see how it could end up with adding hundred of _GFP_NOWARN in the
kernel code. If there is really a hundred places could loop like this, it may
make more sense looking into a general solution.
Vlastimil Babka Sept. 2, 2019, 2:24 p.m. UTC | #5
On 8/30/19 5:25 PM, Qian Cai wrote:
> On Fri, 2019-08-30 at 17:11 +0200, Eric Dumazet wrote:
>>
>> On 8/30/19 4:57 PM, Qian Cai wrote:
>>> When running heavy memory pressure workloads, the system is throwing
>>> endless warnings below due to the allocation could fail from
>>> __build_skb(), and the volume of this call could be huge which may
>>> generate a lot of serial console output and cosumes all CPUs as
>>> warn_alloc() could be expensive by calling dump_stack() and then
>>> show_mem().
>>>
>>> Fix it by silencing the warning in this call site. Also, it seems
>>> unnecessary to even print a warning at all if the allocation failed in
>>> __build_skb(), as it may just retransmit the packet and retry.
>>>

Well, __GFP_NOWARN would save me from explaining this warning to users
many times. OTOH usually it's an indication that min_free_kbytes should
be raised to better cope with network traffic.

>>
>> Same patches are showing up there and there from time to time.
>>
>> Why is this particular spot interesting, against all others not adding
>> __GFP_NOWARN ?

This one is interesting that it's a GFP_ATOMIC allocation triggered by
incoming packets, and has a fallback mechanism. I don't recall other so
notoric ones.

>> Are we going to have hundred of patches adding __GFP_NOWARN at various points,
>> or should we get something generic to not flood the syslog in case of memory
>> pressure ?
>>
> 
> From my testing which uses LTP oom* tests. There are only 3 places need to be
> patched. The other two are in IOMMU code for both Intel and AMD. The place is
> particular interesting because it could cause the system with floating serial
> console output for days without making progress in OOM. I suppose it ends up in
> a looping condition that warn_alloc() would end up generating more calls into
> __build_skb() via ksoftirqd.

Regardless of this particular allocation, if the reporting itself makes
the conditions so much worse, then at least some kind of general
ratelimit would make sense indeed.
Qian Cai Sept. 3, 2019, 10:23 a.m. UTC | #6
On Monday, September 2, 2019, Vlastimil Babka <vbabka@suse.cz> wrote:

> On 8/30/19 5:25 PM, Qian Cai wrote:
> > On Fri, 2019-08-30 at 17:11 +0200, Eric Dumazet wrote:
> >>
> >> On 8/30/19 4:57 PM, Qian Cai wrote:
> >>> When running heavy memory pressure workloads, the system is throwing
> >>> endless warnings below due to the allocation could fail from
> >>> __build_skb(), and the volume of this call could be huge which may
> >>> generate a lot of serial console output and cosumes all CPUs as
> >>> warn_alloc() could be expensive by calling dump_stack() and then
> >>> show_mem().
> >>>
> >>> Fix it by silencing the warning in this call site. Also, it seems
> >>> unnecessary to even print a warning at all if the allocation failed in
> >>> __build_skb(), as it may just retransmit the packet and retry.
> >>>
>
> Well, __GFP_NOWARN would save me from explaining this warning to users
> many times. OTOH usually it's an indication that min_free_kbytes should
> be raised to better cope with network traffic.


I think it is just a matter of time that the continuous memory pressure
will trigger the issue again, so raising min_free_kbytes does not sound a
solution in this case.


>
> >>
> >> Same patches are showing up there and there from time to time.
> >>
> >> Why is this particular spot interesting, against all others not adding
> >> __GFP_NOWARN ?
>
> This one is interesting that it's a GFP_ATOMIC allocation triggered by
> incoming packets, and has a fallback mechanism. I don't recall other so
> notoric ones.
>
> >> Are we going to have hundred of patches adding __GFP_NOWARN at various
> points,
> >> or should we get something generic to not flood the syslog in case of
> memory
> >> pressure ?
> >>
> >
> > From my testing which uses LTP oom* tests. There are only 3 places need
> to be
> > patched. The other two are in IOMMU code for both Intel and AMD. The
> place is
> > particular interesting because it could cause the system with floating
> serial
> > console output for days without making progress in OOM. I suppose it
> ends up in
> > a looping condition that warn_alloc() would end up generating more calls
> into
> > __build_skb() via ksoftirqd.
>
> Regardless of this particular allocation, if the reporting itself makes
> the conditions so much worse, then at least some kind of general
> ratelimit would make sense indeed.


There is a ratelimit in warn_alloc(), but that does not help in this case.
It occurs to me it is not the rate of this allocation failure causes the
issue, but rather the possible recursive and pure volume of __build_skb()
is the issue.
Michal Hocko Sept. 3, 2019, 1:22 p.m. UTC | #7
On Fri 30-08-19 18:15:22, Eric Dumazet wrote:
> If there is a risk of flooding the syslog, we should fix this generically
> in mm layer, not adding hundred of __GFP_NOWARN all over the places.

We do already ratelimit in warn_alloc. If it isn't sufficient then we
can think of a different parameters. Or maybe it is the ratelimiting
which doesn't work here. Hard to tell and something to explore.
Qian Cai Sept. 3, 2019, 3:42 p.m. UTC | #8
On Tue, 2019-09-03 at 15:22 +0200, Michal Hocko wrote:
> On Fri 30-08-19 18:15:22, Eric Dumazet wrote:
> > If there is a risk of flooding the syslog, we should fix this generically
> > in mm layer, not adding hundred of __GFP_NOWARN all over the places.
> 
> We do already ratelimit in warn_alloc. If it isn't sufficient then we
> can think of a different parameters. Or maybe it is the ratelimiting
> which doesn't work here. Hard to tell and something to explore.

The time-based ratelimit won't work for skb_build() as when a system under
memory pressure, and the CPU is fast and IO is so slow, it could take a long
time to swap and trigger OOM.

I suppose what happens is those skb_build() allocations are from softirq, and
once one of them failed, it calls printk() which generates more interrupts.
Hence, the infinite loop.
Michal Hocko Sept. 3, 2019, 6:53 p.m. UTC | #9
On Tue 03-09-19 11:42:22, Qian Cai wrote:
> On Tue, 2019-09-03 at 15:22 +0200, Michal Hocko wrote:
> > On Fri 30-08-19 18:15:22, Eric Dumazet wrote:
> > > If there is a risk of flooding the syslog, we should fix this generically
> > > in mm layer, not adding hundred of __GFP_NOWARN all over the places.
> > 
> > We do already ratelimit in warn_alloc. If it isn't sufficient then we
> > can think of a different parameters. Or maybe it is the ratelimiting
> > which doesn't work here. Hard to tell and something to explore.
> 
> The time-based ratelimit won't work for skb_build() as when a system under
> memory pressure, and the CPU is fast and IO is so slow, it could take a long
> time to swap and trigger OOM.

I really do not understand what does OOM and swapping have to do with
the ratelimiting here. The sole purpose of the ratelimit is to reduce
the amount of warnings to be printed. Slow IO might have an effect on
when the OOM killer is invoked but atomic allocations are not directly
dependent on IO.

> I suppose what happens is those skb_build() allocations are from softirq, and
> once one of them failed, it calls printk() which generates more interrupts.
> Hence, the infinite loop.

Please elaborate more.
Qian Cai Sept. 3, 2019, 9:42 p.m. UTC | #10
On Tue, 2019-09-03 at 20:53 +0200, Michal Hocko wrote:
> On Tue 03-09-19 11:42:22, Qian Cai wrote:
> > On Tue, 2019-09-03 at 15:22 +0200, Michal Hocko wrote:
> > > On Fri 30-08-19 18:15:22, Eric Dumazet wrote:
> > > > If there is a risk of flooding the syslog, we should fix this
> > > > generically
> > > > in mm layer, not adding hundred of __GFP_NOWARN all over the places.
> > > 
> > > We do already ratelimit in warn_alloc. If it isn't sufficient then we
> > > can think of a different parameters. Or maybe it is the ratelimiting
> > > which doesn't work here. Hard to tell and something to explore.
> > 
> > The time-based ratelimit won't work for skb_build() as when a system under
> > memory pressure, and the CPU is fast and IO is so slow, it could take a long
> > time to swap and trigger OOM.
> 
> I really do not understand what does OOM and swapping have to do with
> the ratelimiting here. The sole purpose of the ratelimit is to reduce
> the amount of warnings to be printed. Slow IO might have an effect on
> when the OOM killer is invoked but atomic allocations are not directly
> dependent on IO.

When there is a heavy memory pressure, the system is trying hard to reclaim
memory to fill up the watermark. However, the IO is slow to page out, but the
memory pressure keep draining atomic reservoir, and some of those skb_build()
will fail eventually.

Only if there is a fast IO, it will finish swapping sooner and then invoke the
OOM to end the memory pressure.

> 
> > I suppose what happens is those skb_build() allocations are from softirq,
> > and
> > once one of them failed, it calls printk() which generates more interrupts.
> > Hence, the infinite loop.
> 
> Please elaborate more.
> 

If you look at the original report, the failed allocation dump_stack() is,

 <IRQ>
 warn_alloc.cold.43+0x8a/0x148
 __alloc_pages_nodemask+0x1a5c/0x1bb0
 alloc_pages_current+0x9c/0x110
 allocate_slab+0x34a/0x11f0
 new_slab+0x46/0x70
 ___slab_alloc+0x604/0x950
 __slab_alloc+0x12/0x20
 kmem_cache_alloc+0x32a/0x400
 __build_skb+0x23/0x60
 build_skb+0x1a/0xb0
 igb_clean_rx_irq+0xafc/0x1010 [igb]
 igb_poll+0x4bb/0xe30 [igb]
 net_rx_action+0x244/0x7a0
 __do_softirq+0x1a0/0x60a
 irq_exit+0xb5/0xd0
 do_IRQ+0x81/0x170
 common_interrupt+0xf/0xf
 </IRQ>

Since it has no __GFP_NOWARN to begin with, it will call,

printk
  vprintk_default
    vprintk_emit
      wake_up_klogd
        irq_work_queue
          __irq_work_queue_local
            arch_irq_work_raise
              apic->send_IPI_self(IRQ_WORK_VECTOR)
                smp_irq_work_interrupt
                  exiting_irq
                    irq_exit

and end up processing pending net_rx_action softirqs again which are plenty due
to connected via ssh etc.
Michal Hocko Sept. 4, 2019, 6:15 a.m. UTC | #11
Cc printk maintainers

On Tue 03-09-19 17:42:28, Qian Cai wrote:
> > > I suppose what happens is those skb_build() allocations are from softirq,
> > > and
> > > once one of them failed, it calls printk() which generates more interrupts.
> > > Hence, the infinite loop.
> > 
> > Please elaborate more.
> > 
> 
> If you look at the original report, the failed allocation dump_stack() is,
> 
>  <IRQ>
>  warn_alloc.cold.43+0x8a/0x148
>  __alloc_pages_nodemask+0x1a5c/0x1bb0
>  alloc_pages_current+0x9c/0x110
>  allocate_slab+0x34a/0x11f0
>  new_slab+0x46/0x70
>  ___slab_alloc+0x604/0x950
>  __slab_alloc+0x12/0x20
>  kmem_cache_alloc+0x32a/0x400
>  __build_skb+0x23/0x60
>  build_skb+0x1a/0xb0
>  igb_clean_rx_irq+0xafc/0x1010 [igb]
>  igb_poll+0x4bb/0xe30 [igb]
>  net_rx_action+0x244/0x7a0
>  __do_softirq+0x1a0/0x60a
>  irq_exit+0xb5/0xd0
>  do_IRQ+0x81/0x170
>  common_interrupt+0xf/0xf
>  </IRQ>
> 
> Since it has no __GFP_NOWARN to begin with, it will call,
> 
> printk
>   vprintk_default
>     vprintk_emit
>       wake_up_klogd
>         irq_work_queue
>           __irq_work_queue_local
>             arch_irq_work_raise
>               apic->send_IPI_self(IRQ_WORK_VECTOR)
>                 smp_irq_work_interrupt
>                   exiting_irq
>                     irq_exit
> 
> and end up processing pending net_rx_action softirqs again which are plenty due
> to connected via ssh etc.
Michal Hocko Sept. 4, 2019, 6:15 a.m. UTC | #12
> On Tue, 2019-09-03 at 20:53 +0200, Michal Hocko wrote:
> > On Tue 03-09-19 11:42:22, Qian Cai wrote:
> > > On Tue, 2019-09-03 at 15:22 +0200, Michal Hocko wrote:
> > > > On Fri 30-08-19 18:15:22, Eric Dumazet wrote:
> > > > > If there is a risk of flooding the syslog, we should fix this
> > > > > generically
> > > > > in mm layer, not adding hundred of __GFP_NOWARN all over the places.
> > > > 
> > > > We do already ratelimit in warn_alloc. If it isn't sufficient then we
> > > > can think of a different parameters. Or maybe it is the ratelimiting
> > > > which doesn't work here. Hard to tell and something to explore.
> > > 
> > > The time-based ratelimit won't work for skb_build() as when a system under
> > > memory pressure, and the CPU is fast and IO is so slow, it could take a long
> > > time to swap and trigger OOM.
> > 
> > I really do not understand what does OOM and swapping have to do with
> > the ratelimiting here. The sole purpose of the ratelimit is to reduce
> > the amount of warnings to be printed. Slow IO might have an effect on
> > when the OOM killer is invoked but atomic allocations are not directly
> > dependent on IO.
> 
> When there is a heavy memory pressure, the system is trying hard to reclaim
> memory to fill up the watermark. However, the IO is slow to page out, but the
> memory pressure keep draining atomic reservoir, and some of those skb_build()
> will fail eventually.

Yes this is true but this has nothing to do with the ratelimitted
warn_alloc AFAICS. It is natural that atomic allocations are going
to fail more likely under extreme memory pressure but we are talking
about an excessive amount of debugging output that is generated and
that should be throttled. And that's why we have ratelimit there. If it
doesn't work well then we should look into why.

> Only if there is a fast IO, it will finish swapping sooner and then invoke the
> OOM to end the memory pressure.
Sergey Senozhatsky Sept. 4, 2019, 6:41 a.m. UTC | #13
On (09/04/19 08:15), Michal Hocko wrote:
> > If you look at the original report, the failed allocation dump_stack() is,
> > 
> >  <IRQ>
> >  warn_alloc.cold.43+0x8a/0x148
> >  __alloc_pages_nodemask+0x1a5c/0x1bb0
> >  alloc_pages_current+0x9c/0x110
> >  allocate_slab+0x34a/0x11f0
> >  new_slab+0x46/0x70
> >  ___slab_alloc+0x604/0x950
> >  __slab_alloc+0x12/0x20
> >  kmem_cache_alloc+0x32a/0x400
> >  __build_skb+0x23/0x60
> >  build_skb+0x1a/0xb0
> >  igb_clean_rx_irq+0xafc/0x1010 [igb]
> >  igb_poll+0x4bb/0xe30 [igb]
> >  net_rx_action+0x244/0x7a0
> >  __do_softirq+0x1a0/0x60a
> >  irq_exit+0xb5/0xd0
> >  do_IRQ+0x81/0x170
> >  common_interrupt+0xf/0xf
> >  </IRQ>
> > 
> > Since it has no __GFP_NOWARN to begin with, it will call,

I think that DEFAULT_RATELIMIT_INTERVAL and DEFAULT_RATELIMIT_BURST
are good when we ratelimit just a single printk() call, so the ratelimit
is "max 10 kernel log lines in 5 seconds".

But the thing is different in case of dump_stack() + show_mem() +
some other output. Because now we ratelimit not a single printk() line,
but hundreds of them. The ratelimit becomes - 10 * $$$ lines in 5 seconds
(IOW, now we talk about thousands of lines). Significantly more permissive
ratelimiting.

	-ss
Michal Hocko Sept. 4, 2019, 6:54 a.m. UTC | #14
On Wed 04-09-19 15:41:44, Sergey Senozhatsky wrote:
> On (09/04/19 08:15), Michal Hocko wrote:
> > > If you look at the original report, the failed allocation dump_stack() is,
> > > 
> > >  <IRQ>
> > >  warn_alloc.cold.43+0x8a/0x148
> > >  __alloc_pages_nodemask+0x1a5c/0x1bb0
> > >  alloc_pages_current+0x9c/0x110
> > >  allocate_slab+0x34a/0x11f0
> > >  new_slab+0x46/0x70
> > >  ___slab_alloc+0x604/0x950
> > >  __slab_alloc+0x12/0x20
> > >  kmem_cache_alloc+0x32a/0x400
> > >  __build_skb+0x23/0x60
> > >  build_skb+0x1a/0xb0
> > >  igb_clean_rx_irq+0xafc/0x1010 [igb]
> > >  igb_poll+0x4bb/0xe30 [igb]
> > >  net_rx_action+0x244/0x7a0
> > >  __do_softirq+0x1a0/0x60a
> > >  irq_exit+0xb5/0xd0
> > >  do_IRQ+0x81/0x170
> > >  common_interrupt+0xf/0xf
> > >  </IRQ>
> > > 
> > > Since it has no __GFP_NOWARN to begin with, it will call,
> 
> I think that DEFAULT_RATELIMIT_INTERVAL and DEFAULT_RATELIMIT_BURST
> are good when we ratelimit just a single printk() call, so the ratelimit
> is "max 10 kernel log lines in 5 seconds".

I am sorry, I could have been more explicit when CCing you. Sure the
ratelimit is part of the problem. But I was more interested in the
potential livelock (infinite loop) mentioned by Qian Cai. It is not
important whether we generate one or more lines of output from the
softirq context as long as the printk generates more irq processing
which might end up doing the same. Is this really possible?
Sergey Senozhatsky Sept. 4, 2019, 7 a.m. UTC | #15
On (09/04/19 15:41), Sergey Senozhatsky wrote:
> But the thing is different in case of dump_stack() + show_mem() +
> some other output. Because now we ratelimit not a single printk() line,
> but hundreds of them. The ratelimit becomes - 10 * $$$ lines in 5 seconds
> (IOW, now we talk about thousands of lines).

And on devices with slow serial consoles this can be somewhat close to
"no ratelimit". *Suppose* that warn_alloc() adds 700 lines each time.
Within 5 seconds we can call warn_alloc() 10 times, which will add 7000
lines to the logbuf. If printk() can evict only 6000 lines in 5 seconds
then we have a growing number of pending logbuf messages.

	-ss
Sergey Senozhatsky Sept. 4, 2019, 7:19 a.m. UTC | #16
On (09/04/19 08:54), Michal Hocko wrote:
> I am sorry, I could have been more explicit when CCing you.

Oh, sorry! My bad!

> Sure the ratelimit is part of the problem. But I was more interested
> in the potential livelock (infinite loop) mentioned by Qian Cai. It
> is not important whether we generate one or more lines of output from
> the softirq context as long as the printk generates more irq processing
> which might end up doing the same. Is this really possible?

Hmm. I need to look at this more... wake_up_klogd() queues work only once
on particular CPU: irq_work_queue(this_cpu_ptr(&wake_up_klogd_work));

bool irq_work_queue()
{
	/* Only queue if not already pending */
	if (!irq_work_claim(work))
		return false;

	 __irq_work_queue_local(work);
}

softirqs are processed in batches, right? The softirq batch can add XXXX
lines to printk logbuf, but there will be only one PRINTK_PENDING_WAKEUP
queued. Qian Cai mentioned that "net_rx_action softirqs again which are
plenty due to connected via ssh etc." so the proportion still seems to be
N:1 - we process N softirqs, add 1 printk irq_work.

But need to think more.
Interesting question.

	-ss
Sergey Senozhatsky Sept. 4, 2019, 7:43 a.m. UTC | #17
On (09/04/19 16:19), Sergey Senozhatsky wrote:
> Hmm. I need to look at this more... wake_up_klogd() queues work only once
> on particular CPU: irq_work_queue(this_cpu_ptr(&wake_up_klogd_work));
> 
> bool irq_work_queue()
> {
> 	/* Only queue if not already pending */
> 	if (!irq_work_claim(work))
> 		return false;
> 
> 	 __irq_work_queue_local(work);
> }

Plus one more check - waitqueue_active(&log_wait). printk() adds
pending irq_work only if there is a user-space process sleeping on
log_wait and irq_work is not already scheduled. If the syslog is
active or there is noone to wakeup then we don't queue irq_work.

	-ss
Michal Hocko Sept. 4, 2019, 8:25 a.m. UTC | #18
On Wed 04-09-19 16:00:42, Sergey Senozhatsky wrote:
> On (09/04/19 15:41), Sergey Senozhatsky wrote:
> > But the thing is different in case of dump_stack() + show_mem() +
> > some other output. Because now we ratelimit not a single printk() line,
> > but hundreds of them. The ratelimit becomes - 10 * $$$ lines in 5 seconds
> > (IOW, now we talk about thousands of lines).
> 
> And on devices with slow serial consoles this can be somewhat close to
> "no ratelimit". *Suppose* that warn_alloc() adds 700 lines each time.
> Within 5 seconds we can call warn_alloc() 10 times, which will add 7000
> lines to the logbuf. If printk() can evict only 6000 lines in 5 seconds
> then we have a growing number of pending logbuf messages.

Yes, ratelimit is problematic when the ratelimited operation is slow. I
guess that is a well known problem and we would need to rework both the
api and the implementation to make it work in those cases as well.
Essentially we need to make the ratelimit act as a gatekeeper to an
operation section - something like a critical section except you can
tolerate more code executions but not too many. So effectively

	start_throttle(rate, number);
	/* here goes your operation */
	end_throttle();

one operation is not considered done until the whole section ends.
Or something along those lines.

In this particular case we can increase the rate limit parameters of
course but I think that longterm we need a better api.
Qian Cai Sept. 4, 2019, 11:59 a.m. UTC | #19
On Wed, 2019-09-04 at 10:25 +0200, Michal Hocko wrote:
> On Wed 04-09-19 16:00:42, Sergey Senozhatsky wrote:
> > On (09/04/19 15:41), Sergey Senozhatsky wrote:
> > > But the thing is different in case of dump_stack() + show_mem() +
> > > some other output. Because now we ratelimit not a single printk() line,
> > > but hundreds of them. The ratelimit becomes - 10 * $$$ lines in 5 seconds
> > > (IOW, now we talk about thousands of lines).
> > 
> > And on devices with slow serial consoles this can be somewhat close to
> > "no ratelimit". *Suppose* that warn_alloc() adds 700 lines each time.
> > Within 5 seconds we can call warn_alloc() 10 times, which will add 7000
> > lines to the logbuf. If printk() can evict only 6000 lines in 5 seconds
> > then we have a growing number of pending logbuf messages.
> 
> Yes, ratelimit is problematic when the ratelimited operation is slow. I
> guess that is a well known problem and we would need to rework both the
> api and the implementation to make it work in those cases as well.
> Essentially we need to make the ratelimit act as a gatekeeper to an
> operation section - something like a critical section except you can
> tolerate more code executions but not too many. So effectively
> 
> 	start_throttle(rate, number);
> 	/* here goes your operation */
> 	end_throttle();
> 
> one operation is not considered done until the whole section ends.
> Or something along those lines.
> 
> In this particular case we can increase the rate limit parameters of
> course but I think that longterm we need a better api.

The problem is when a system is under heavy memory pressure, everything is
becoming slower, so I don't know how to come up with a sane default for rate
limit parameters as a generic solution that would work for every machine out
there. Sure, it is possible to set a limit as low as possible that would work
for the majority of systems apart from people may complain that they are now
missing important warnings, but using __GFP_NOWARN in this code would work for
all systems. You could even argument there is even a separate benefit that it
could reduce the noise-level overall from those build_skb() allocation failures
as it has a fall-back mechanism anyway.
Michal Hocko Sept. 4, 2019, 12:07 p.m. UTC | #20
On Wed 04-09-19 07:59:17, Qian Cai wrote:
> On Wed, 2019-09-04 at 10:25 +0200, Michal Hocko wrote:
> > On Wed 04-09-19 16:00:42, Sergey Senozhatsky wrote:
> > > On (09/04/19 15:41), Sergey Senozhatsky wrote:
> > > > But the thing is different in case of dump_stack() + show_mem() +
> > > > some other output. Because now we ratelimit not a single printk() line,
> > > > but hundreds of them. The ratelimit becomes - 10 * $$$ lines in 5 seconds
> > > > (IOW, now we talk about thousands of lines).
> > > 
> > > And on devices with slow serial consoles this can be somewhat close to
> > > "no ratelimit". *Suppose* that warn_alloc() adds 700 lines each time.
> > > Within 5 seconds we can call warn_alloc() 10 times, which will add 7000
> > > lines to the logbuf. If printk() can evict only 6000 lines in 5 seconds
> > > then we have a growing number of pending logbuf messages.
> > 
> > Yes, ratelimit is problematic when the ratelimited operation is slow. I
> > guess that is a well known problem and we would need to rework both the
> > api and the implementation to make it work in those cases as well.
> > Essentially we need to make the ratelimit act as a gatekeeper to an
> > operation section - something like a critical section except you can
> > tolerate more code executions but not too many. So effectively
> > 
> > 	start_throttle(rate, number);
> > 	/* here goes your operation */
> > 	end_throttle();
> > 
> > one operation is not considered done until the whole section ends.
> > Or something along those lines.
> > 
> > In this particular case we can increase the rate limit parameters of
> > course but I think that longterm we need a better api.
> 
> The problem is when a system is under heavy memory pressure, everything is
> becoming slower, so I don't know how to come up with a sane default for rate
> limit parameters as a generic solution that would work for every machine out
> there. Sure, it is possible to set a limit as low as possible that would work
> for the majority of systems apart from people may complain that they are now
> missing important warnings, but using __GFP_NOWARN in this code would work for
> all systems. You could even argument there is even a separate benefit that it
> could reduce the noise-level overall from those build_skb() allocation failures
> as it has a fall-back mechanism anyway.

As Vlastimil already pointed out, __GFP_NOWARN would hide that reserves
might be configured too low.
Qian Cai Sept. 4, 2019, 12:14 p.m. UTC | #21
On Wed, 2019-09-04 at 16:43 +0900, Sergey Senozhatsky wrote:
> On (09/04/19 16:19), Sergey Senozhatsky wrote:
> > Hmm. I need to look at this more... wake_up_klogd() queues work only once
> > on particular CPU: irq_work_queue(this_cpu_ptr(&wake_up_klogd_work));
> > 
> > bool irq_work_queue()
> > {
> > 	/* Only queue if not already pending */
> > 	if (!irq_work_claim(work))
> > 		return false;
> > 
> > 	 __irq_work_queue_local(work);
> > }
> 
> Plus one more check - waitqueue_active(&log_wait). printk() adds
> pending irq_work only if there is a user-space process sleeping on
> log_wait and irq_work is not already scheduled. If the syslog is
> active or there is noone to wakeup then we don't queue irq_work.

Another possibility for this potential livelock is that those printk() from
warn_alloc(), dump_stack() and show_mem() increase the time it needs to process
build_skb() allocation failures significantly under memory pressure. As the
result, ksoftirqd() could be rescheduled during that time via a different CPU
(this is a large x86 NUMA system anyway),

[83605.577256][   C31]  run_ksoftirqd+0x1f/0x40
[83605.577256][   C31]  smpboot_thread_fn+0x255/0x440
[83605.577256][   C31]  kthread+0x1df/0x200
[83605.577256][   C31]  ret_from_fork+0x35/0x40

In addition, those printk() will deal with console drivers or even a networking
console, so it is probably not unusual that it could call irq_exit()-
>__do_softirq() at one point and then this livelock.
Qian Cai Sept. 4, 2019, 12:28 p.m. UTC | #22
On Wed, 2019-09-04 at 14:07 +0200, Michal Hocko wrote:
> On Wed 04-09-19 07:59:17, Qian Cai wrote:
> > On Wed, 2019-09-04 at 10:25 +0200, Michal Hocko wrote:
> > > On Wed 04-09-19 16:00:42, Sergey Senozhatsky wrote:
> > > > On (09/04/19 15:41), Sergey Senozhatsky wrote:
> > > > > But the thing is different in case of dump_stack() + show_mem() +
> > > > > some other output. Because now we ratelimit not a single printk()
> > > > > line,
> > > > > but hundreds of them. The ratelimit becomes - 10 * $$$ lines in 5
> > > > > seconds
> > > > > (IOW, now we talk about thousands of lines).
> > > > 
> > > > And on devices with slow serial consoles this can be somewhat close to
> > > > "no ratelimit". *Suppose* that warn_alloc() adds 700 lines each time.
> > > > Within 5 seconds we can call warn_alloc() 10 times, which will add 7000
> > > > lines to the logbuf. If printk() can evict only 6000 lines in 5 seconds
> > > > then we have a growing number of pending logbuf messages.
> > > 
> > > Yes, ratelimit is problematic when the ratelimited operation is slow. I
> > > guess that is a well known problem and we would need to rework both the
> > > api and the implementation to make it work in those cases as well.
> > > Essentially we need to make the ratelimit act as a gatekeeper to an
> > > operation section - something like a critical section except you can
> > > tolerate more code executions but not too many. So effectively
> > > 
> > > 	start_throttle(rate, number);
> > > 	/* here goes your operation */
> > > 	end_throttle();
> > > 
> > > one operation is not considered done until the whole section ends.
> > > Or something along those lines.
> > > 
> > > In this particular case we can increase the rate limit parameters of
> > > course but I think that longterm we need a better api.
> > 
> > The problem is when a system is under heavy memory pressure, everything is
> > becoming slower, so I don't know how to come up with a sane default for rate
> > limit parameters as a generic solution that would work for every machine out
> > there. Sure, it is possible to set a limit as low as possible that would
> > work
> > for the majority of systems apart from people may complain that they are now
> > missing important warnings, but using __GFP_NOWARN in this code would work
> > for
> > all systems. You could even argument there is even a separate benefit that
> > it
> > could reduce the noise-level overall from those build_skb() allocation
> > failures
> > as it has a fall-back mechanism anyway.
> 
> As Vlastimil already pointed out, __GFP_NOWARN would hide that reserves
> might be configured too low.

Tune "min_free_kbytes" is also an unreliable solution and situational as the
same reason mentioned previously. It may also need a lot of testing to find out
the right value of it on one particular system.

"
When there is a heavy memory pressure, the system is trying hard to reclaim
memory to fill up the watermark. However, the IO is slow to page out, but the
memory pressure keep draining atomic reservoir, and some of those skb_build()
will fail eventually.

Only if there is a fast IO, it will finish swapping sooner and then invoke the
OOM to end the memory pressure.
"

It also have a drawback that "waste" precious memory resources, as allocations
other than GPF_ATOMIC are unable to use those reserved memory anymore.
Sergey Senozhatsky Sept. 4, 2019, 2:48 p.m. UTC | #23
On (09/04/19 08:14), Qian Cai wrote:
> > Plus one more check - waitqueue_active(&log_wait). printk() adds
> > pending irq_work only if there is a user-space process sleeping on
> > log_wait and irq_work is not already scheduled. If the syslog is
> > active or there is noone to wakeup then we don't queue irq_work.
> 
> Another possibility for this potential livelock is that those printk() from
> warn_alloc(), dump_stack() and show_mem() increase the time it needs to process
> build_skb() allocation failures significantly under memory pressure. As the
> result, ksoftirqd() could be rescheduled during that time via a different CPU
> (this is a large x86 NUMA system anyway),
> 
> [83605.577256][   C31]  run_ksoftirqd+0x1f/0x40
> [83605.577256][   C31]  smpboot_thread_fn+0x255/0x440
> [83605.577256][   C31]  kthread+0x1df/0x200
> [83605.577256][   C31]  ret_from_fork+0x35/0x40

Hum hum hum...

So I can, _probably_, think of several patches.

First, move wake_up_klogd() back to console_unlock().

Second, move `printk_pending' out of per-CPU region and make it global.
So we will have just one printk irq_work scheduled across all CPUs;
currently we have one irq_work per CPU. I think I sent a patch a long
long time ago, but we never discussed it, as far as I remember.

> In addition, those printk() will deal with console drivers or even a networking
> console, so it is probably not unusual that it could call irq_exit()-
>__do_softirq() at one point and then this livelock.

Do you use netcon? Because this, theoretically, can open up one more
vector. netcon allocates skbs from ->write() path. We call con drivers'
->write() from printk_safe context, so should netcon skb allocation
warn we will scedule one more irq_work on that CPU to flush per-CPU
printk_safe buffer.

If this is the case, then we can stop calling console_driver() under
printk_safe. I sent a patch a while ago, but we agreed to keep the
things the way they are, fot the time being.

Let me think more.

	-ss
Qian Cai Sept. 4, 2019, 3:07 p.m. UTC | #24
On Wed, 2019-09-04 at 23:48 +0900, Sergey Senozhatsky wrote:
> On (09/04/19 08:14), Qian Cai wrote:
> > > Plus one more check - waitqueue_active(&log_wait). printk() adds
> > > pending irq_work only if there is a user-space process sleeping on
> > > log_wait and irq_work is not already scheduled. If the syslog is
> > > active or there is noone to wakeup then we don't queue irq_work.
> > 
> > Another possibility for this potential livelock is that those printk() from
> > warn_alloc(), dump_stack() and show_mem() increase the time it needs to
> > process
> > build_skb() allocation failures significantly under memory pressure. As the
> > result, ksoftirqd() could be rescheduled during that time via a different
> > CPU
> > (this is a large x86 NUMA system anyway),
> > 
> > [83605.577256][   C31]  run_ksoftirqd+0x1f/0x40
> > [83605.577256][   C31]  smpboot_thread_fn+0x255/0x440
> > [83605.577256][   C31]  kthread+0x1df/0x200
> > [83605.577256][   C31]  ret_from_fork+0x35/0x40
> 
> Hum hum hum...
> 
> So I can, _probably_, think of several patches.
> 
> First, move wake_up_klogd() back to console_unlock().
> 
> Second, move `printk_pending' out of per-CPU region and make it global.
> So we will have just one printk irq_work scheduled across all CPUs;
> currently we have one irq_work per CPU. I think I sent a patch a long
> long time ago, but we never discussed it, as far as I remember.
> 
> > In addition, those printk() will deal with console drivers or even a
> > networking
> > console, so it is probably not unusual that it could call irq_exit()-
> > __do_softirq() at one point and then this livelock.
> 
> Do you use netcon? Because this, theoretically, can open up one more
> vector. netcon allocates skbs from ->write() path. We call con drivers'
> ->write() from printk_safe context, so should netcon skb allocation
> warn we will scedule one more irq_work on that CPU to flush per-CPU
> printk_safe buffer.

No, I don't use netcon. Just thought to mention it anyway since there could
other people use it.

> 
> If this is the case, then we can stop calling console_driver() under
> printk_safe. I sent a patch a while ago, but we agreed to keep the
> things the way they are, fot the time being.
> 
> Let me think more.
> 
> 	-ss
Qian Cai Sept. 4, 2019, 8:42 p.m. UTC | #25
On Wed, 2019-09-04 at 23:48 +0900, Sergey Senozhatsky wrote:
> On (09/04/19 08:14), Qian Cai wrote:
> > > Plus one more check - waitqueue_active(&log_wait). printk() adds
> > > pending irq_work only if there is a user-space process sleeping on
> > > log_wait and irq_work is not already scheduled. If the syslog is
> > > active or there is noone to wakeup then we don't queue irq_work.
> > 
> > Another possibility for this potential livelock is that those printk() from
> > warn_alloc(), dump_stack() and show_mem() increase the time it needs to
> > process
> > build_skb() allocation failures significantly under memory pressure. As the
> > result, ksoftirqd() could be rescheduled during that time via a different
> > CPU
> > (this is a large x86 NUMA system anyway),
> > 
> > [83605.577256][   C31]  run_ksoftirqd+0x1f/0x40
> > [83605.577256][   C31]  smpboot_thread_fn+0x255/0x440
> > [83605.577256][   C31]  kthread+0x1df/0x200
> > [83605.577256][   C31]  ret_from_fork+0x35/0x40
> 
> Hum hum hum...
> 
> So I can, _probably_, think of several patches.
> 
> First, move wake_up_klogd() back to console_unlock().
> 
> Second, move `printk_pending' out of per-CPU region and make it global.
> So we will have just one printk irq_work scheduled across all CPUs;
> currently we have one irq_work per CPU. I think I sent a patch a long
> long time ago, but we never discussed it, as far as I remember.
> 
> > In addition, those printk() will deal with console drivers or even a
> > networking
> > console, so it is probably not unusual that it could call irq_exit()-
> > __do_softirq() at one point and then this livelock.
> 
> Do you use netcon? Because this, theoretically, can open up one more
> vector. netcon allocates skbs from ->write() path. We call con drivers'
> ->write() from printk_safe context, so should netcon skb allocation
> warn we will scedule one more irq_work on that CPU to flush per-CPU
> printk_safe buffer.
> 
> If this is the case, then we can stop calling console_driver() under
> printk_safe. I sent a patch a while ago, but we agreed to keep the
> things the way they are, fot the time being.
> 
> Let me think more.

To summary, those look to me are all good long-term improvement that would
reduce the likelihood of this kind of livelock in general especially for other
unknown allocations that happen while processing softirqs, but it is still up to
the air if it fixes it 100% in all situations as printk() is going to take more
time and could deal with console hardware that involve irq_exit() anyway.

On the other hand, adding __GPF_NOWARN in the build_skb() allocation will fix
this known NET_TX_SOFTIRQ case which is common when softirqd involved at least
in short-term. It even have a benefit to reduce the overall warn_alloc() noise
out there.

I can resubmit with an update changelog. Does it make any sense?
Eric Dumazet Sept. 5, 2019, 8:32 a.m. UTC | #26
On 9/4/19 10:42 PM, Qian Cai wrote:

> To summary, those look to me are all good long-term improvement that would
> reduce the likelihood of this kind of livelock in general especially for other
> unknown allocations that happen while processing softirqs, but it is still up to
> the air if it fixes it 100% in all situations as printk() is going to take more
> time and could deal with console hardware that involve irq_exit() anyway.
> 
> On the other hand, adding __GPF_NOWARN in the build_skb() allocation will fix
> this known NET_TX_SOFTIRQ case which is common when softirqd involved at least
> in short-term. It even have a benefit to reduce the overall warn_alloc() noise
> out there.
> 
> I can resubmit with an update changelog. Does it make any sense?

It does not make sense.

We have thousands other GFP_ATOMIC allocations in the networking stacks.

Soon you will have to send more and more patches adding __GFP_NOWARN once
your workloads/tests can hit all these various points.

It is really time to fix this problem generically, instead of having
to review hundreds of patches.

This was my initial feedback really, nothing really has changed since.

The ability to send a warning with a stack trace, holding the cpu
for many milliseconds should not be decided case by case, otherwise
every call points will decide to opt-out from the harmful warnings.
Sergey Senozhatsky Sept. 5, 2019, 11:32 a.m. UTC | #27
On (09/04/19 16:42), Qian Cai wrote:
> > Let me think more.
> 
> To summary, those look to me are all good long-term improvement that would
> reduce the likelihood of this kind of livelock in general especially for other
> unknown allocations that happen while processing softirqs, but it is still up to
> the air if it fixes it 100% in all situations as printk() is going to take more
> time

Well. So. I guess that we don't need irq_work most of the time.

We need to queue irq_work for "safe" wake_up_interruptible(), when we
know that we can deadlock in scheduler. IOW, only when we are invoked
from the scheduler. Scheduler has printk_deferred(), which tells printk()
that it cannot do wake_up_interruptible(). Otherwise we can just use
normal wake_up_process() and don't need that irq_work->wake_up_interruptible()
indirection. The parts of the scheduler, which by mistake call plain printk()
from under pi_lock or rq_lock have chances to deadlock anyway and should
be switched to printk_deferred().

I think we can queue significantly much less irq_work-s from printk().

Petr, Steven, what do you think?

Something like this. Call wake_up_interruptible(), switch to
wake_up_klogd() only when called from sched code.

---
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index cd51aa7d08a9..89cb47882254 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2027,8 +2027,11 @@ asmlinkage int vprintk_emit(int facility, int level,
 	pending_output = (curr_log_seq != log_next_seq);
 	logbuf_unlock_irqrestore(flags);
 
+	if (!pending_output)
+		return printed_len;
+
 	/* If called from the scheduler, we can not call up(). */
-	if (!in_sched && pending_output) {
+	if (!in_sched) {
 		/*
 		 * Disable preemption to avoid being preempted while holding
 		 * console_sem which would prevent anyone from printing to
@@ -2043,10 +2046,11 @@ asmlinkage int vprintk_emit(int facility, int level,
 		if (console_trylock_spinning())
 			console_unlock();
 		preempt_enable();
-	}
 
-	if (pending_output)
+		wake_up_interruptible(&log_wait);
+	} else {
 		wake_up_klogd();
+	}
 	return printed_len;
 }
 EXPORT_SYMBOL(vprintk_emit);
---

> and could deal with console hardware that involve irq_exit() anyway.

printk->console_driver->write() does not involve irq.

> On the other hand, adding __GPF_NOWARN in the build_skb() allocation will fix
> this known NET_TX_SOFTIRQ case which is common when softirqd involved at least
> in short-term. It even have a benefit to reduce the overall warn_alloc() noise
> out there.

That's not up to me to decide.

	-ss
Qian Cai Sept. 5, 2019, 2:09 p.m. UTC | #28
On Thu, 2019-09-05 at 10:32 +0200, Eric Dumazet wrote:
> 
> On 9/4/19 10:42 PM, Qian Cai wrote:
> 
> > To summary, those look to me are all good long-term improvement that would
> > reduce the likelihood of this kind of livelock in general especially for
> > other
> > unknown allocations that happen while processing softirqs, but it is still
> > up to
> > the air if it fixes it 100% in all situations as printk() is going to take
> > more
> > time and could deal with console hardware that involve irq_exit() anyway.
> > 
> > On the other hand, adding __GPF_NOWARN in the build_skb() allocation will
> > fix
> > this known NET_TX_SOFTIRQ case which is common when softirqd involved at
> > least
> > in short-term. It even have a benefit to reduce the overall warn_alloc()
> > noise
> > out there.
> > 
> > I can resubmit with an update changelog. Does it make any sense?
> 
> It does not make sense.
> 
> We have thousands other GFP_ATOMIC allocations in the networking stacks.

Instead of repeatedly make generalize statements, could you enlighten me with
some concrete examples that have the similar properties which would trigger a
livelock,

- guaranteed GFP_ATOMIC allocations when processing softirq batches.
- the allocation has a fallback mechanism that is unnecessary to warn a failure.

I thought "skb" is a special-case here as every packet sent or received is
handled using this data structure.

> 
> Soon you will have to send more and more patches adding __GFP_NOWARN once
> your workloads/tests can hit all these various points.

I doubt so.

> 
> It is really time to fix this problem generically, instead of having
> to review hundreds of patches.
> 
> This was my initial feedback really, nothing really has changed since.

I feel like you may not follow the thread closely. There are more details
uncovered in the last few days and narrowed down to the culprits.

> 
> The ability to send a warning with a stack trace, holding the cpu
> for many milliseconds should not be decided case by case, otherwise
> every call points will decide to opt-out from the harmful warnings.

That is not really the reasons anymore why I asked to add a __GPF_NOWARN here.
Eric Dumazet Sept. 5, 2019, 3:06 p.m. UTC | #29
On 9/5/19 4:09 PM, Qian Cai wrote:
> 
> I feel like you may not follow the thread closely. There are more details
> uncovered in the last few days and narrowed down to the culprits.
> 

I have followed the thread closely, thank you very much.

I am happy that the problem is addressed as I suggested.
Ie not individual patches adding selected __GFP_NOWARN.
Eric Dumazet Sept. 5, 2019, 3:14 p.m. UTC | #30
On 9/5/19 4:09 PM, Qian Cai wrote:

> Instead of repeatedly make generalize statements, could you enlighten me with
> some concrete examples that have the similar properties which would trigger a
> livelock,
> 
> - guaranteed GFP_ATOMIC allocations when processing softirq batches.
> - the allocation has a fallback mechanism that is unnecessary to warn a failure.
> 
> I thought "skb" is a special-case here as every packet sent or received is
> handled using this data structure.
>

Just  'git grep GFP_ATOMIC -- net' and carefully study all the places.

You will discover many allocations done for incoming packets.

All of them can fail and trigger a trace.

Please fix the problem for good, do not pretend addressing the skb allocations
will solve it.

The skb allocation can succeed, then the following allocation might fail.

skb are one of the many objects that networking need to allocate dynamically.
Qian Cai Sept. 5, 2019, 4:03 p.m. UTC | #31
On Thu, 2019-09-05 at 20:32 +0900, Sergey Senozhatsky wrote:
> On (09/04/19 16:42), Qian Cai wrote:
> > > Let me think more.
> > 
> > To summary, those look to me are all good long-term improvement that would
> > reduce the likelihood of this kind of livelock in general especially for
> > other
> > unknown allocations that happen while processing softirqs, but it is still
> > up to
> > the air if it fixes it 100% in all situations as printk() is going to take
> > more
> > time
> 
> Well. So. I guess that we don't need irq_work most of the time.
> 
> We need to queue irq_work for "safe" wake_up_interruptible(), when we
> know that we can deadlock in scheduler. IOW, only when we are invoked
> from the scheduler. Scheduler has printk_deferred(), which tells printk()
> that it cannot do wake_up_interruptible(). Otherwise we can just use
> normal wake_up_process() and don't need that irq_work->wake_up_interruptible()
> indirection. The parts of the scheduler, which by mistake call plain printk()
> from under pi_lock or rq_lock have chances to deadlock anyway and should
> be switched to printk_deferred().
> 
> I think we can queue significantly much less irq_work-s from printk().
> 
> Petr, Steven, what do you think?
> 
> Something like this. Call wake_up_interruptible(), switch to
> wake_up_klogd() only when called from sched code.
> 
> ---
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index cd51aa7d08a9..89cb47882254 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2027,8 +2027,11 @@ asmlinkage int vprintk_emit(int facility, int level,
>  	pending_output = (curr_log_seq != log_next_seq);
>  	logbuf_unlock_irqrestore(flags);
>  
> +	if (!pending_output)
> +		return printed_len;
> +
>  	/* If called from the scheduler, we can not call up(). */
> -	if (!in_sched && pending_output) {
> +	if (!in_sched) {
>  		/*
>  		 * Disable preemption to avoid being preempted while holding
>  		 * console_sem which would prevent anyone from printing to
> @@ -2043,10 +2046,11 @@ asmlinkage int vprintk_emit(int facility, int level,
>  		if (console_trylock_spinning())
>  			console_unlock();
>  		preempt_enable();
> -	}
>  
> -	if (pending_output)
> +		wake_up_interruptible(&log_wait);
> +	} else {
>  		wake_up_klogd();
> +	}
>  	return printed_len;
>  }
>  EXPORT_SYMBOL(vprintk_emit);
> ---
> 
> > and could deal with console hardware that involve irq_exit() anyway.
> 
> printk->console_driver->write() does not involve irq.

Hmm, from the article,

https://en.wikipedia.org/wiki/Universal_asynchronous_receiver-transmitter

"Since transmission of a single or multiple characters may take a long time
relative to CPU speeds, a UART maintains a flag showing busy status so that the
host system knows if there is at least one character in the transmit buffer or
shift register; "ready for next character(s)" may also be signaled with an
interrupt."
Steven Rostedt Sept. 5, 2019, 5:14 p.m. UTC | #32
On Thu, 05 Sep 2019 12:03:13 -0400
Qian Cai <cai@lca.pw> wrote:

> > > and could deal with console hardware that involve irq_exit() anyway.  
> > 
> > printk->console_driver->write() does not involve irq.  
> 
> Hmm, from the article,
> 
> https://en.wikipedia.org/wiki/Universal_asynchronous_receiver-transmitter
> 
> "Since transmission of a single or multiple characters may take a long time
> relative to CPU speeds, a UART maintains a flag showing busy status so that the
> host system knows if there is at least one character in the transmit buffer or
> shift register; "ready for next character(s)" may also be signaled with an
> interrupt."

I'm pretty sure all serial consoles do a busy loop on the UART and not
use interrupts to notify when it's available. That would require an
asynchronous implementation of printk() which would be quite complex to
implement.

-- Steve
Steven Rostedt Sept. 5, 2019, 5:23 p.m. UTC | #33
On Thu, 5 Sep 2019 20:32:08 +0900
Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> wrote:

> I think we can queue significantly much less irq_work-s from printk().
> 
> Petr, Steven, what do you think?

What if we just rate limit the wake ups of klogd? I mean, really, do we
need to keep calling wake up if it probably never even executed?

-- Steve
Sergey Senozhatsky Sept. 6, 2019, 2:50 a.m. UTC | #34
On (09/05/19 13:14), Steven Rostedt wrote:
> > Hmm, from the article,
> > 
> > https://en.wikipedia.org/wiki/Universal_asynchronous_receiver-transmitter
> > 
> > "Since transmission of a single or multiple characters may take a long time
> > relative to CPU speeds, a UART maintains a flag showing busy status so that the
> > host system knows if there is at least one character in the transmit buffer or
> > shift register; "ready for next character(s)" may also be signaled with an
> > interrupt."
> 
> I'm pretty sure all serial consoles do a busy loop on the UART and not
> use interrupts to notify when it's available.

Yes. Besides, we call console drivers with local IRQs disabled.

	-ss
Sergey Senozhatsky Sept. 6, 2019, 3:39 a.m. UTC | #35
On (09/05/19 13:23), Steven Rostedt wrote:
> > I think we can queue significantly much less irq_work-s from printk().
> > 
> > Petr, Steven, what do you think?

[..]
> I mean, really, do we need to keep calling wake up if it
> probably never even executed?

I guess ratelimiting you are talking about ("if it probably never even
executed") would be to check if we have already called wake up on the
log_wait ->head. For that we need to, at least, take log_wait spin_lock
and check that ->head is still in TASK_INTERRUPTIBLE; which is (quite,
but not exactly) close to what wake_up_interruptible() does - it doesn't
wake up the same task twice, it bails out on `p->state & state' check.

Or did I miss something?

	-ss
Sergey Senozhatsky Sept. 6, 2019, 4:32 a.m. UTC | #36
On (09/05/19 12:03), Qian Cai wrote:
> > ---
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index cd51aa7d08a9..89cb47882254 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -2027,8 +2027,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> >  	pending_output = (curr_log_seq != log_next_seq);
> >  	logbuf_unlock_irqrestore(flags);
> >  
> > +	if (!pending_output)
> > +		return printed_len;
> > +
> >  	/* If called from the scheduler, we can not call up(). */
> > -	if (!in_sched && pending_output) {
> > +	if (!in_sched) {
> >  		/*
> >  		 * Disable preemption to avoid being preempted while holding
> >  		 * console_sem which would prevent anyone from printing to
> > @@ -2043,10 +2046,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> >  		if (console_trylock_spinning())
> >  			console_unlock();
> >  		preempt_enable();
> > -	}
> >  
> > -	if (pending_output)
> > +		wake_up_interruptible(&log_wait);
> > +	} else {
> >  		wake_up_klogd();
> > +	}
> >  	return printed_len;
> >  }
> >  EXPORT_SYMBOL(vprintk_emit);
> > ---

Qian Cai, any chance you can test that patch?

	-ss
Petr Mladek Sept. 6, 2019, 2:55 p.m. UTC | #37
On Thu 2019-09-05 20:32:08, Sergey Senozhatsky wrote:
> On (09/04/19 16:42), Qian Cai wrote:
> > > Let me think more.
> > 
> > To summary, those look to me are all good long-term improvement that would
> > reduce the likelihood of this kind of livelock in general especially for other
> > unknown allocations that happen while processing softirqs, but it is still up to
> > the air if it fixes it 100% in all situations as printk() is going to take more
> > time
> 
> Well. So. I guess that we don't need irq_work most of the time.
> 
> We need to queue irq_work for "safe" wake_up_interruptible(), when we
> know that we can deadlock in scheduler. IOW, only when we are invoked
> from the scheduler. Scheduler has printk_deferred(), which tells printk()
> that it cannot do wake_up_interruptible(). Otherwise we can just use
> normal wake_up_process() and don't need that irq_work->wake_up_interruptible()
> indirection. The parts of the scheduler, which by mistake call plain printk()
> from under pi_lock or rq_lock have chances to deadlock anyway and should
> be switched to printk_deferred().
> 
> I think we can queue significantly much less irq_work-s from printk().
> 
> Petr, Steven, what do you think?
> 
> Something like this. Call wake_up_interruptible(), switch to
> wake_up_klogd() only when called from sched code.

Replacing irq_work_queue() with wake_up_interruptible() looks
dangerous to me.

As a result, all "normal" printk() calls from the scheduler
code will deadlock. There is almost always a userspace
logger registered.

By "normal" I mean anything that is not printk_deferred(). For
example, any WARN() from sheduler will cause a deadlock.
We will not even have chance to catch these problems in
advance by lockdep.

The difference is that console_unlock() calls wake_up_process()
only when there is a waiter. And the hard console_lock() is not
called that often.


Honestly, scheduling IRQ looks like the most lightweight and reliable
solution for offloading. We are in big troubles if we could not use
it in printk() code.

IMHO, the best solution is to ratelimit the warnings about the
allocation failures. It does not make sense to repeat the same
warning again and again. We might need a better ratelimiting API
if the current one is not reliable.

Best Regards,
Petr
Petr Mladek Sept. 6, 2019, 3:32 p.m. UTC | #38
On Fri 2019-09-06 12:39:00, Sergey Senozhatsky wrote:
> On (09/05/19 13:23), Steven Rostedt wrote:
> > > I think we can queue significantly much less irq_work-s from printk().
> > > 
> > > Petr, Steven, what do you think?
> 
> [..]
> > I mean, really, do we need to keep calling wake up if it
> > probably never even executed?
> 
> I guess ratelimiting you are talking about ("if it probably never even
> executed") would be to check if we have already called wake up on the
> log_wait ->head. For that we need to, at least, take log_wait spin_lock
> and check that ->head is still in TASK_INTERRUPTIBLE; which is (quite,
> but not exactly) close to what wake_up_interruptible() does - it doesn't
> wake up the same task twice, it bails out on `p->state & state' check.

I have just realized that only sleeping tasks are in the waitqueue.
It is already handled by waitqueue_active() check.

I am afraid that we could not ratelimit the wakeups. The userspace
loggers might then miss the last lines for a long.

We could move wake_up_klogd() back to console_unlock(). But it might
end up with a back-and-forth games according to who is currently
complaining.

Sigh, I still suggest to ratelimit the warning about failed
allocation.

Best Regards,
Petr
Sergey Senozhatsky Sept. 6, 2019, 7:51 p.m. UTC | #39
On (09/06/19 16:55), Petr Mladek wrote:
> > I think we can queue significantly much less irq_work-s from printk().
> > 
> > Petr, Steven, what do you think?
> > 
> > Something like this. Call wake_up_interruptible(), switch to
> > wake_up_klogd() only when called from sched code.
> 
> Replacing irq_work_queue() with wake_up_interruptible() looks
> dangerous to me.
> 
> As a result, all "normal" printk() calls from the scheduler
> code will deadlock. There is almost always a userspace
> logger registered.

I don't see why all printk()-s should deadlock.

A "normal" printk() call will deadlock only when scheduler calls
"normal" printk() under rq or pi locks. But this is illegal anyway,
because console_sem up() calls wake_up_process() - the same function
wake_up_interruptible() calls. IOW "normal" printk() calls from
scheduler end up in scheduler, via console_sem->sched chain. We
already execute wake_up_process()->try_to_wake_up() in printk(),
even when a non-LOGLEVEL_SCHED printk() comes from scheduler.

What am I missing something?

	-ss
Qian Cai Sept. 6, 2019, 9:17 p.m. UTC | #40
On Fri, 2019-09-06 at 13:32 +0900, Sergey Senozhatsky wrote:
> On (09/05/19 12:03), Qian Cai wrote:
> > > ---
> > > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > > index cd51aa7d08a9..89cb47882254 100644
> > > --- a/kernel/printk/printk.c
> > > +++ b/kernel/printk/printk.c
> > > @@ -2027,8 +2027,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> > >  	pending_output = (curr_log_seq != log_next_seq);
> > >  	logbuf_unlock_irqrestore(flags);
> > >  
> > > +	if (!pending_output)
> > > +		return printed_len;
> > > +
> > >  	/* If called from the scheduler, we can not call up(). */
> > > -	if (!in_sched && pending_output) {
> > > +	if (!in_sched) {
> > >  		/*
> > >  		 * Disable preemption to avoid being preempted while holding
> > >  		 * console_sem which would prevent anyone from printing to
> > > @@ -2043,10 +2046,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> > >  		if (console_trylock_spinning())
> > >  			console_unlock();
> > >  		preempt_enable();
> > > -	}
> > >  
> > > -	if (pending_output)
> > > +		wake_up_interruptible(&log_wait);
> > > +	} else {
> > >  		wake_up_klogd();
> > > +	}
> > >  	return printed_len;
> > >  }
> > >  EXPORT_SYMBOL(vprintk_emit);
> > > ---
> 
> Qian Cai, any chance you can test that patch?

So far as good, but it is hard to tell if this really nail the issue down. I'll
leave it running over the weekend and report back if it occurs again.
Sergey Senozhatsky Sept. 9, 2019, 1:10 a.m. UTC | #41
On (09/06/19 17:32), Petr Mladek wrote:
> > [..]
> > > I mean, really, do we need to keep calling wake up if it
> > > probably never even executed?
> > 
> > I guess ratelimiting you are talking about ("if it probably never even
> > executed") would be to check if we have already called wake up on the
> > log_wait ->head. For that we need to, at least, take log_wait spin_lock
> > and check that ->head is still in TASK_INTERRUPTIBLE; which is (quite,
> > but not exactly) close to what wake_up_interruptible() does - it doesn't
> > wake up the same task twice, it bails out on `p->state & state' check.
> 
> I have just realized that only sleeping tasks are in the waitqueue.
> It is already handled by waitqueue_active() check.

Yes.

> I am afraid that we could not ratelimit the wakeups. The userspace
> loggers might then miss the last lines for a long.

That's my concern as well.

> We could move wake_up_klogd() back to console_unlock(). But it might
> end up with a back-and-forth games according to who is currently
> complaining.

We still don't need irq_work, tho.

If we can do
	printk()->console_unlock()->up()->try_to_wake_up()
then we can also do
	printk()           ->             try_to_wake_up()

It's LOGLEVEL_SCHED which tells us if we can try_to_wake_up()
or cannot.

> Sigh, I still suggest to ratelimit the warning about failed
> allocation.

Hard to imagine how many printk()-s we will have to ratelimit.
To imagine NET maintainers being OK with this is even harder.

	-ss
Qian Cai Nov. 14, 2019, 5:12 p.m. UTC | #42
On Thu, 2019-09-05 at 20:32 +0900, Sergey Senozhatsky wrote:
> On (09/04/19 16:42), Qian Cai wrote:
> > > Let me think more.
> > 
> > To summary, those look to me are all good long-term improvement that would
> > reduce the likelihood of this kind of livelock in general especially for other
> > unknown allocations that happen while processing softirqs, but it is still up to
> > the air if it fixes it 100% in all situations as printk() is going to take more
> > time
> 
> Well. So. I guess that we don't need irq_work most of the time.
> 
> We need to queue irq_work for "safe" wake_up_interruptible(), when we
> know that we can deadlock in scheduler. IOW, only when we are invoked
> from the scheduler. Scheduler has printk_deferred(), which tells printk()
> that it cannot do wake_up_interruptible(). Otherwise we can just use
> normal wake_up_process() and don't need that irq_work->wake_up_interruptible()
> indirection. The parts of the scheduler, which by mistake call plain printk()
> from under pi_lock or rq_lock have chances to deadlock anyway and should
> be switched to printk_deferred().
> 
> I think we can queue significantly much less irq_work-s from printk().
> 
> Petr, Steven, what do you think?

Sergey, do you still plan to get this patch merged?

> 
> Something like this. Call wake_up_interruptible(), switch to
> wake_up_klogd() only when called from sched code.
> 
> ---
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index cd51aa7d08a9..89cb47882254 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2027,8 +2027,11 @@ asmlinkage int vprintk_emit(int facility, int level,
>  	pending_output = (curr_log_seq != log_next_seq);
>  	logbuf_unlock_irqrestore(flags);
>  
> +	if (!pending_output)
> +		return printed_len;
> +
>  	/* If called from the scheduler, we can not call up(). */
> -	if (!in_sched && pending_output) {
> +	if (!in_sched) {
>  		/*
>  		 * Disable preemption to avoid being preempted while holding
>  		 * console_sem which would prevent anyone from printing to
> @@ -2043,10 +2046,11 @@ asmlinkage int vprintk_emit(int facility, int level,
>  		if (console_trylock_spinning())
>  			console_unlock();
>  		preempt_enable();
> -	}
>  
> -	if (pending_output)
> +		wake_up_interruptible(&log_wait);
> +	} else {
>  		wake_up_klogd();
> +	}
>  	return printed_len;
>  }
>  EXPORT_SYMBOL(vprintk_emit);
> ---
> 
> > and could deal with console hardware that involve irq_exit() anyway.
> 
> printk->console_driver->write() does not involve irq.
> 
> > On the other hand, adding __GPF_NOWARN in the build_skb() allocation will fix
> > this known NET_TX_SOFTIRQ case which is common when softirqd involved at least
> > in short-term. It even have a benefit to reduce the overall warn_alloc() noise
> > out there.
> 
> That's not up to me to decide.
> 
> 	-ss
Petr Mladek Nov. 18, 2019, 3:27 p.m. UTC | #43
On Thu 2019-11-14 12:12:50, Qian Cai wrote:
> On Thu, 2019-09-05 at 20:32 +0900, Sergey Senozhatsky wrote:
> > On (09/04/19 16:42), Qian Cai wrote:
> > > > Let me think more.
> > > 
> > > To summary, those look to me are all good long-term improvement that would
> > > reduce the likelihood of this kind of livelock in general especially for other
> > > unknown allocations that happen while processing softirqs, but it is still up to
> > > the air if it fixes it 100% in all situations as printk() is going to take more
> > > time
> > 
> > Well. So. I guess that we don't need irq_work most of the time.
> > 
> > We need to queue irq_work for "safe" wake_up_interruptible(), when we
> > know that we can deadlock in scheduler. IOW, only when we are invoked
> > from the scheduler. Scheduler has printk_deferred(), which tells printk()
> > that it cannot do wake_up_interruptible(). Otherwise we can just use
> > normal wake_up_process() and don't need that irq_work->wake_up_interruptible()
> > indirection. The parts of the scheduler, which by mistake call plain printk()
> > from under pi_lock or rq_lock have chances to deadlock anyway and should
> > be switched to printk_deferred().
> > 
> > I think we can queue significantly much less irq_work-s from printk().
> > 
> > Petr, Steven, what do you think?
> 
> Sergey, do you still plan to get this patch merged?
> 
> > 
> > Something like this. Call wake_up_interruptible(), switch to
> > wake_up_klogd() only when called from sched code.
> > 
> > ---
> > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> > index cd51aa7d08a9..89cb47882254 100644
> > --- a/kernel/printk/printk.c
> > +++ b/kernel/printk/printk.c
> > @@ -2027,8 +2027,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> >  	pending_output = (curr_log_seq != log_next_seq);
> >  	logbuf_unlock_irqrestore(flags);
> >  
> > +	if (!pending_output)
> > +		return printed_len;
> > +
> >  	/* If called from the scheduler, we can not call up(). */
> > -	if (!in_sched && pending_output) {
> > +	if (!in_sched) {
> >  		/*
> >  		 * Disable preemption to avoid being preempted while holding
> >  		 * console_sem which would prevent anyone from printing to
> > @@ -2043,10 +2046,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> >  		if (console_trylock_spinning())
> >  			console_unlock();
> >  		preempt_enable();
> > -	}
> >  
> > -	if (pending_output)
> > +		wake_up_interruptible(&log_wait);

I do not like this. As a result, normal printk() will always deadlock
in the scheduler code, including WARN() calls. The chance of the
deadlock is small now. It happens only when there is another
process waiting for console_sem.

We want to remove locks from printk() and not add them.

Best Regards,
Petr
Sergey Senozhatsky Nov. 19, 2019, 12:41 a.m. UTC | #44
On (19/11/18 16:27), Petr Mladek wrote:
> > > @@ -2027,8 +2027,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> > >  	pending_output = (curr_log_seq != log_next_seq);
> > >  	logbuf_unlock_irqrestore(flags);
> > >  
> > > +	if (!pending_output)
> > > +		return printed_len;
> > > +
> > >  	/* If called from the scheduler, we can not call up(). */
> > > -	if (!in_sched && pending_output) {
> > > +	if (!in_sched) {
> > >  		/*
> > >  		 * Disable preemption to avoid being preempted while holding
> > >  		 * console_sem which would prevent anyone from printing to
> > > @@ -2043,10 +2046,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> > >  		if (console_trylock_spinning())
> > >  			console_unlock();
> > >  		preempt_enable();
> > > -	}
> > >  
> > > -	if (pending_output)
> > > +		wake_up_interruptible(&log_wait);
> 
> I do not like this. As a result, normal printk() will always deadlock
> in the scheduler code, including WARN() calls. The chance of the
> deadlock is small now. It happens only when there is another
> process waiting for console_sem.

Why would it *always* deadlock? If this is the case, why we don't *always*
deadlock doing the very same wake_up_process() from console_unlock()?

	-ss
Petr Mladek Nov. 19, 2019, 9:41 a.m. UTC | #45
On Tue 2019-11-19 09:41:19, Sergey Senozhatsky wrote:
> On (19/11/18 16:27), Petr Mladek wrote:
> > > > @@ -2027,8 +2027,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> > > >  	pending_output = (curr_log_seq != log_next_seq);
> > > >  	logbuf_unlock_irqrestore(flags);
> > > >  
> > > > +	if (!pending_output)
> > > > +		return printed_len;
> > > > +
> > > >  	/* If called from the scheduler, we can not call up(). */
> > > > -	if (!in_sched && pending_output) {
> > > > +	if (!in_sched) {
> > > >  		/*
> > > >  		 * Disable preemption to avoid being preempted while holding
> > > >  		 * console_sem which would prevent anyone from printing to
> > > > @@ -2043,10 +2046,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> > > >  		if (console_trylock_spinning())
> > > >  			console_unlock();
> > > >  		preempt_enable();
> > > > -	}
> > > >  
> > > > -	if (pending_output)
> > > > +		wake_up_interruptible(&log_wait);
> > 
> > I do not like this. As a result, normal printk() will always deadlock
> > in the scheduler code, including WARN() calls. The chance of the
> > deadlock is small now. It happens only when there is another
> > process waiting for console_sem.
> 
> Why would it *always* deadlock? If this is the case, why we don't *always*
> deadlock doing the very same wake_up_process() from console_unlock()?

I speak about _normal_ printk() and not about printk_deferred().

wake_up_process() is called in console_unlock() only when
sem->wait_list is not empty, see up() in kernel/locking/semaphore.c.
printk() itself uses console_trylock() and does not wait.

I believe that this is the rason why printk_sched() was added
so late in 2012. It was more than 10 years after adding
the semaphore into console_unlock(). IMHO, the deadlock
was rare. Of course, it was also hard to debug but it
would not take 10 years.

Best Regards,
Petr
Qian Cai Nov. 19, 2019, 3:58 p.m. UTC | #46
On Tue, 2019-11-19 at 10:41 +0100, Petr Mladek wrote:
> On Tue 2019-11-19 09:41:19, Sergey Senozhatsky wrote:
> > On (19/11/18 16:27), Petr Mladek wrote:
> > > > > @@ -2027,8 +2027,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> > > > >  	pending_output = (curr_log_seq != log_next_seq);
> > > > >  	logbuf_unlock_irqrestore(flags);
> > > > >  
> > > > > +	if (!pending_output)
> > > > > +		return printed_len;
> > > > > +
> > > > >  	/* If called from the scheduler, we can not call up(). */
> > > > > -	if (!in_sched && pending_output) {
> > > > > +	if (!in_sched) {
> > > > >  		/*
> > > > >  		 * Disable preemption to avoid being preempted while holding
> > > > >  		 * console_sem which would prevent anyone from printing to
> > > > > @@ -2043,10 +2046,11 @@ asmlinkage int vprintk_emit(int facility, int level,
> > > > >  		if (console_trylock_spinning())
> > > > >  			console_unlock();
> > > > >  		preempt_enable();
> > > > > -	}
> > > > >  
> > > > > -	if (pending_output)
> > > > > +		wake_up_interruptible(&log_wait);
> > > 
> > > I do not like this. As a result, normal printk() will always deadlock
> > > in the scheduler code, including WARN() calls. The chance of the
> > > deadlock is small now. It happens only when there is another
> > > process waiting for console_sem.
> > 
> > Why would it *always* deadlock? If this is the case, why we don't *always*
> > deadlock doing the very same wake_up_process() from console_unlock()?
> 
> I speak about _normal_ printk() and not about printk_deferred().
> 
> wake_up_process() is called in console_unlock() only when
> sem->wait_list is not empty, see up() in kernel/locking/semaphore.c.
> printk() itself uses console_trylock() and does not wait.
> 
> I believe that this is the rason why printk_sched() was added
> so late in 2012. It was more than 10 years after adding
> the semaphore into console_unlock(). IMHO, the deadlock
> was rare. Of course, it was also hard to debug but it
> would not take 10 years.

I would not be surprise that those potential deadlocks have been existed even
for 10 years. Not only that it is difficult to debug, but also when eventually
someone had reported them, subsystem developers could still "kick balls" like
where it had been observed for the last a few months, and no progress could be
done for those as eventually life is too short and the reporters have to give
up.
Sergey Senozhatsky Nov. 20, 2019, 1:30 a.m. UTC | #47
On (19/11/19 10:41), Petr Mladek wrote:
[..]
> > > I do not like this. As a result, normal printk() will always deadlock
> > > in the scheduler code, including WARN() calls. The chance of the
> > > deadlock is small now. It happens only when there is another
> > > process waiting for console_sem.
> > 
> > Why would it *always* deadlock? If this is the case, why we don't *always*
> > deadlock doing the very same wake_up_process() from console_unlock()?
> 
> I speak about _normal_ printk() and not about printk_deferred().
> 
> wake_up_process() is called in console_unlock() only when
> sem->wait_list is not empty, see up() in kernel/locking/semaphore.c.
> printk() itself uses console_trylock() and does not wait.

Sorry, I should have spent more time on this. So here's what I should
have written instead.

console_sem is not for kthreads or kernel only. User space apps can sleep
on it. And on some setups console_sem becomes really congested, with often
times non-empty console_sem->wait_list. All because of user space, we don't
even need printk()-s. For example, when user space uses tty for I/O.

Significantly trimmed logs from my laptop (a simple patch below). Just
to show that user space sleeps on console_sem (IOW, non-empty ->wait_list).

Notice - ls, bash, journalctl, etc. All competing for console_sem.

[..]
 printk: >>> bash WOKEUP console_sem() waiter
 printk: task journalctl couldn't lock console_sem and had to schedule out
 Call Trace:
  dump_stack+0x50/0x70
  console_lock.cold+0x21/0x26
  con_write+0x28/0x50
  n_tty_write+0x12f/0x4c0
  ? wait_woken+0xa0/0xa0
  tty_write+0x1a9/0x2f0
  ? n_tty_read+0x930/0x930
  vfs_write+0xb7/0x1e0

printk: >>> ls WOKEUP console_sem() waiter
printk: task kworker/3:1 couldn't lock console_sem and had to schedule out
Workqueue: events console_callback
Call Trace:
 dump_stack+0x50/0x70
 console_lock.cold+0x21/0x26
 console_callback+0xa/0x150
 ? _cond_resched+0x15/0x50
 process_one_work+0x18b/0x2e0
 worker_thread+0x4e/0x3d0
 ? process_one_work+0x2e0/0x2e0
 kthread+0x119/0x130

printk: >>> kworker/3:1 WOKEUP console_sem() waiter
printk: task ls couldn't lock console_sem and had to schedule out
Call Trace:
 dump_stack+0x50/0x70
 console_lock.cold+0x21/0x26
 con_write+0x28/0x50
 do_output_char+0x191/0x1f0
 n_tty_write+0x17e/0x4c0
 ? __switch_to_asm+0x24/0x60
 ? wait_woken+0xa0/0xa0
 tty_write+0x1a9/0x2f0
 ? n_tty_read+0x930/0x930
 vfs_write+0xb7/0x1e0
[..]

> I believe that this is the rason why printk_sched() was added
> so late in 2012. It was more than 10 years after adding
> the semaphore into console_unlock(). IMHO, the deadlock
> was rare. Of course, it was also hard to debug but it
> would not take 10 years.

Right. I also think scheduler people do pretty nice work avoiding printk
calls under ->rq lock.

What I tried to say - it's really not that hard to have a non-empty
console_sem ->wait_list, any "wrong" printk() call from scheduler
will deadlock us, because we have something to wake_up().

///////////////////////////////////////////////////////////////////////
// A simple patch

---

diff --git a/include/linux/semaphore.h b/include/linux/semaphore.h
index 6694d0019a68..c80a631f5f96 100644
--- a/include/linux/semaphore.h
+++ b/include/linux/semaphore.h
@@ -35,11 +35,11 @@ static inline void sema_init(struct semaphore *sem, int val)
 	lockdep_init_map(&sem->lock.dep_map, "semaphore->lock", &__key, 0);
 }
 
-extern void down(struct semaphore *sem);
+extern bool down(struct semaphore *sem);
 extern int __must_check down_interruptible(struct semaphore *sem);
 extern int __must_check down_killable(struct semaphore *sem);
 extern int __must_check down_trylock(struct semaphore *sem);
 extern int __must_check down_timeout(struct semaphore *sem, long jiffies);
-extern void up(struct semaphore *sem);
+extern bool up(struct semaphore *sem);
 
 #endif /* __LINUX_SEMAPHORE_H */
diff --git a/kernel/locking/semaphore.c b/kernel/locking/semaphore.c
index d9dd94defc0a..daefcbe08147 100644
--- a/kernel/locking/semaphore.c
+++ b/kernel/locking/semaphore.c
@@ -50,16 +50,20 @@ static noinline void __up(struct semaphore *sem);
  * Use of this function is deprecated, please use down_interruptible() or
  * down_killable() instead.
  */
-void down(struct semaphore *sem)
+bool down(struct semaphore *sem)
 {
 	unsigned long flags;
+	bool ret = false;
 
 	raw_spin_lock_irqsave(&sem->lock, flags);
 	if (likely(sem->count > 0))
 		sem->count--;
-	else
+	else {
 		__down(sem);
+		ret = true;
+	}
 	raw_spin_unlock_irqrestore(&sem->lock, flags);
+	return ret;
 }
 EXPORT_SYMBOL(down);
 
@@ -175,16 +179,20 @@ EXPORT_SYMBOL(down_timeout);
  * Release the semaphore.  Unlike mutexes, up() may be called from any
  * context and even by tasks which have never called down().
  */
-void up(struct semaphore *sem)
+bool up(struct semaphore *sem)
 {
 	unsigned long flags;
+	bool ret = false;
 
 	raw_spin_lock_irqsave(&sem->lock, flags);
-	if (likely(list_empty(&sem->wait_list)))
+	if (likely(list_empty(&sem->wait_list))) {
 		sem->count++;
-	else
+	} else {
 		__up(sem);
+		ret = true;
+	}
 	raw_spin_unlock_irqrestore(&sem->lock, flags);
+	return ret;
 }
 EXPORT_SYMBOL(up);
 
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index ca65327a6de8..42ffaaa01859 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -214,12 +214,21 @@ int devkmsg_sysctl_set_loglvl(struct ctl_table *table, int write,
 /* Number of registered extended console drivers. */
 static int nr_ext_console_drivers;
 
+static void ____down(void)
+{
+	bool sched_out = down(&console_sem);
+	if (sched_out) {
+		pr_err("task %s couldn't lock console_sem and had to schedule out\n", current->comm);
+		dump_stack();
+	}
+}
+
 /*
  * Helper macros to handle lockdep when locking/unlocking console_sem. We use
  * macros instead of functions so that _RET_IP_ contains useful information.
  */
 #define down_console_sem() do { \
-	down(&console_sem);\
+	____down();		\
 	mutex_acquire(&console_lock_dep_map, 0, 0, _RET_IP_);\
 } while (0)
 
@@ -244,15 +253,19 @@ static int __down_trylock_console_sem(unsigned long ip)
 }
 #define down_trylock_console_sem() __down_trylock_console_sem(_RET_IP_)
 
-static void __up_console_sem(unsigned long ip)
+static bool __up_console_sem(unsigned long ip)
 {
 	unsigned long flags;
+	bool ret;
 
 	mutex_release(&console_lock_dep_map, 1, ip);
 
 	printk_safe_enter_irqsave(flags);
-	up(&console_sem);
+	ret = up(&console_sem);
+	if (ret)
+		pr_err(">>> %s WOKEUP console_sem() waiter\n", current->comm);
 	printk_safe_exit_irqrestore(flags);
+	return ret;
 }
 #define up_console_sem() __up_console_sem(_RET_IP_)
Petr Mladek Nov. 20, 2019, 4:13 p.m. UTC | #48
On Wed 2019-11-20 10:30:05, Sergey Senozhatsky wrote:
> On (19/11/19 10:41), Petr Mladek wrote:
> [..]
> > > > I do not like this. As a result, normal printk() will always deadlock
> > > > in the scheduler code, including WARN() calls. The chance of the
> > > > deadlock is small now. It happens only when there is another
> > > > process waiting for console_sem.
> > > 
> > > Why would it *always* deadlock? If this is the case, why we don't *always*
> > > deadlock doing the very same wake_up_process() from console_unlock()?
> > 
> > I speak about _normal_ printk() and not about printk_deferred().
> > 
> > wake_up_process() is called in console_unlock() only when
> > sem->wait_list is not empty, see up() in kernel/locking/semaphore.c.
> > printk() itself uses console_trylock() and does not wait.
> 
> > I believe that this is the rason why printk_sched() was added
> > so late in 2012.
> 
> Right. I also think scheduler people do pretty nice work avoiding printk
> calls under ->rq lock.
> 
> What I tried to say - it's really not that hard to have a non-empty
> console_sem ->wait_list, any "wrong" printk() call from scheduler
> will deadlock us, because we have something to wake_up().

I am sorry but I do not take this as an argument that it would be
acceptable to replace irq_work_queue() with wake_up_interruptible().

It is the first time that I hear about problem caused by the
irq_work(). But we deal with deadlocks caused by wake_up() for years.
It would be like replacing a lightly dripping tap with a heavily
dripping one.

I see reports with WARN() from scheduler code from time to time.
I would get reports about silent death instead.

RT guys are going to make printk() fully lockless. It would be
really great achievement. irq_work is lockless. While wake_up()
is not.

There must be a better way how to break the infinite loop caused
by the irq_work.

Best Regards,
Petr
Sergey Senozhatsky Nov. 21, 2019, 1:05 a.m. UTC | #49
On (19/11/20 17:13), Petr Mladek wrote:
[..]
> It is the first time that I hear about problem caused by the
> irq_work(). But we deal with deadlocks caused by wake_up() for years.
> It would be like replacing a lightly dripping tap with a heavily
> dripping one.
> 
> I see reports with WARN() from scheduler code from time to time.
> I would get reports about silent death instead.

Just curious, how many of those WARN() come under rq lock or pi_lock?
// this is real question

> RT guys are going to make printk() fully lockless. It would be
> really great achievement. irq_work is lockless. While wake_up()
> is not.
>
> There must be a better way how to break the infinite loop caused
> by the irq_work.

A lockless wake_up() would do :)

	-ss
Petr Mladek Nov. 21, 2019, 9:15 a.m. UTC | #50
On Thu 2019-11-21 10:05:27, Sergey Senozhatsky wrote:
> On (19/11/20 17:13), Petr Mladek wrote:
> [..]
> > It is the first time that I hear about problem caused by the
> > irq_work(). But we deal with deadlocks caused by wake_up() for years.
> > It would be like replacing a lightly dripping tap with a heavily
> > dripping one.
> > 
> > I see reports with WARN() from scheduler code from time to time.
> > I would get reports about silent death instead.
> 
> Just curious, how many of those WARN() come under rq lock or pi_lock?
> // this is real question

I guess that all SCHED_WARN_ON() would stop working.

I am not 100% sure but I think that all WARN_ON*() in
set_task_cpu(), finish_task_switch(), migrate_tasks()
are affected.

I have seen many reports with the WARN() from
native_smp_send_reschedule() about offline CPU.

Best Regards,
Petr

Patch
diff mbox series

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0338820ee0ec..9cc4148deddd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -307,7 +307,9 @@  struct sk_buff *__build_skb(void *data, unsigned int frag_size)
 {
 	struct sk_buff *skb;
 
-	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+	skb = kmem_cache_alloc(skbuff_head_cache,
+			       GFP_ATOMIC | __GFP_NOWARN);
+
 	if (unlikely(!skb))
 		return NULL;