diff mbox series

issue with inflight pages from page_pool

Message ID ZD2HjZZSOjtsnQaf@lore-desk (mailing list archive)
State RFC
Headers show
Series issue with inflight pages from page_pool | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply, async

Commit Message

Lorenzo Bianconi April 17, 2023, 5:53 p.m. UTC
Hi all,

I am triggering an issue with a device running the page_pool allocator.
In particular, the device is running an iperf tcp server receiving traffic
from a remote client. On the driver I loaded a simple xdp program returning
xdp_pass. When I remove the ebpf program and destroy the pool, page_pool
allocator starts complaining in page_pool_release_retry() that not all the pages
have been returned to the allocator. In fact, the pool is not really destroyed
in this case.
Debugging the code it seems the pages are stuck softnet_data defer_list and
they are never freed in skb_defer_free_flush() since I do not have any more tcp
traffic. To prove it, I tried to set sysctl_skb_defer_max to 0 and the issue
does not occur.
I developed the poc patch below and the issue seems to be fixed:


Is it ok or do you think there is a better solution for issue?
@Felix: I think we faced a similar issue in mt76 unloading the module, right?

Regards,
Lorenzo

Comments

Eric Dumazet April 17, 2023, 6:08 p.m. UTC | #1
On Mon, Apr 17, 2023 at 7:53 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>
> Hi all,
>
> I am triggering an issue with a device running the page_pool allocator.
> In particular, the device is running an iperf tcp server receiving traffic
> from a remote client. On the driver I loaded a simple xdp program returning
> xdp_pass. When I remove the ebpf program and destroy the pool, page_pool
> allocator starts complaining in page_pool_release_retry() that not all the pages
> have been returned to the allocator. In fact, the pool is not really destroyed
> in this case.
> Debugging the code it seems the pages are stuck softnet_data defer_list and
> they are never freed in skb_defer_free_flush() since I do not have any more tcp
> traffic. To prove it, I tried to set sysctl_skb_defer_max to 0 and the issue
> does not occur.
> I developed the poc patch below and the issue seems to be fixed:

I do not see why this would be different than having buffers sitting
in some tcp receive
(or out or order) queues for a few minutes ?

Or buffers transferred to another socket or pipe (splice() and friends)
Lorenzo Bianconi April 17, 2023, 6:17 p.m. UTC | #2
> On Mon, Apr 17, 2023 at 7:53 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
> >
> > Hi all,
> >
> > I am triggering an issue with a device running the page_pool allocator.
> > In particular, the device is running an iperf tcp server receiving traffic
> > from a remote client. On the driver I loaded a simple xdp program returning
> > xdp_pass. When I remove the ebpf program and destroy the pool, page_pool
> > allocator starts complaining in page_pool_release_retry() that not all the pages
> > have been returned to the allocator. In fact, the pool is not really destroyed
> > in this case.
> > Debugging the code it seems the pages are stuck softnet_data defer_list and
> > they are never freed in skb_defer_free_flush() since I do not have any more tcp
> > traffic. To prove it, I tried to set sysctl_skb_defer_max to 0 and the issue
> > does not occur.
> > I developed the poc patch below and the issue seems to be fixed:
> 
> I do not see why this would be different than having buffers sitting
> in some tcp receive
> (or out or order) queues for a few minutes ?

The main issue in my tests (and even in mt76 I think) is the pages are not returned
to the pool for a very long time (even hours) and doing so the pool is like in a
'limbo' state where it is not actually deallocated and page_pool_release_retry
continues complaining about it. I think this is because we do not have more tcp
traffic to deallocate them, but I am not so familiar with tcp codebase :)

Regards,
Lorenzo

> 
> Or buffers transferred to another socket or pipe (splice() and friends)
Jakub Kicinski April 17, 2023, 6:23 p.m. UTC | #3
On Mon, 17 Apr 2023 20:17:45 +0200 Lorenzo Bianconi wrote:
> > I do not see why this would be different than having buffers sitting
> > in some tcp receive
> > (or out or order) queues for a few minutes ?  
> 
> The main issue in my tests (and even in mt76 I think) is the pages are not returned
> to the pool for a very long time (even hours) and doing so the pool is like in a
> 'limbo' state where it is not actually deallocated and page_pool_release_retry
> continues complaining about it. I think this is because we do not have more tcp
> traffic to deallocate them, but I am not so familiar with tcp codebase :)

I've seen the page leaks too in my recent testing but I just assumed 
I fumbled the quick bnxt conversion to page pool. Could it be something
with page frags? It happened a lot if I used page frags, IIRC mt76 is
using page frags, too.

Is drgn available for your target? You could try to scan the pages on
the system and see if you can find what's still pointing to the page
pool (assuming they are indeed leaked and not returned to the page
allocator without releasing :()
Gerhard Engleder April 17, 2023, 6:24 p.m. UTC | #4
On 17.04.23 20:17, Lorenzo Bianconi wrote:
>> On Mon, Apr 17, 2023 at 7:53 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>>>
>>> Hi all,
>>>
>>> I am triggering an issue with a device running the page_pool allocator.
>>> In particular, the device is running an iperf tcp server receiving traffic
>>> from a remote client. On the driver I loaded a simple xdp program returning
>>> xdp_pass. When I remove the ebpf program and destroy the pool, page_pool
>>> allocator starts complaining in page_pool_release_retry() that not all the pages
>>> have been returned to the allocator. In fact, the pool is not really destroyed
>>> in this case.
>>> Debugging the code it seems the pages are stuck softnet_data defer_list and
>>> they are never freed in skb_defer_free_flush() since I do not have any more tcp
>>> traffic. To prove it, I tried to set sysctl_skb_defer_max to 0 and the issue
>>> does not occur.
>>> I developed the poc patch below and the issue seems to be fixed:
>>
>> I do not see why this would be different than having buffers sitting
>> in some tcp receive
>> (or out or order) queues for a few minutes ?
> 
> The main issue in my tests (and even in mt76 I think) is the pages are not returned
> to the pool for a very long time (even hours) and doing so the pool is like in a
> 'limbo' state where it is not actually deallocated and page_pool_release_retry
> continues complaining about it. I think this is because we do not have more tcp
> traffic to deallocate them, but I am not so familiar with tcp codebase :)
> 
> Regards,
> Lorenzo
> 
>>
>> Or buffers transferred to another socket or pipe (splice() and friends)

I'm not absolutely sure that it is the same problem, but I also saw some
problems with page_pool destroying and page_pool_release_retry(). I did
post it, but I did not get any reply:

https://lore.kernel.org/bpf/20230311213709.42625-1-gerhard@engleder-embedded.com/T/

Could this be a similar issue?

Gerhard
Lorenzo Bianconi April 17, 2023, 6:42 p.m. UTC | #5
> On Mon, 17 Apr 2023 20:17:45 +0200 Lorenzo Bianconi wrote:
> > > I do not see why this would be different than having buffers sitting
> > > in some tcp receive
> > > (or out or order) queues for a few minutes ?  
> > 
> > The main issue in my tests (and even in mt76 I think) is the pages are not returned
> > to the pool for a very long time (even hours) and doing so the pool is like in a
> > 'limbo' state where it is not actually deallocated and page_pool_release_retry
> > continues complaining about it. I think this is because we do not have more tcp
> > traffic to deallocate them, but I am not so familiar with tcp codebase :)
> 
> I've seen the page leaks too in my recent testing but I just assumed 
> I fumbled the quick bnxt conversion to page pool. Could it be something
> with page frags? It happened a lot if I used page frags, IIRC mt76 is
> using page frags, too.

my device is allocating a full order 0 page from the pool so it is not related
to fragmented pages as fixed for mt76 by Alex and Felix.

> 
> Is drgn available for your target? You could try to scan the pages on
> the system and see if you can find what's still pointing to the page
> pool (assuming they are indeed leaked and not returned to the page
> allocator without releasing :()

I will test it but since setting sysctl_skb_defer_max to 0 fixes the issue,
I think the pages are still properly linked to the pool, they are just not
returned to it. I proved it using the other patch I posted [0] where I can see
the counter of returned pages incrementing from time to time (in a very long
time slot..).

Unrelated to this issue, but debugging it I think a found a page_pool leak in
skb_condense() [1] where we can reallocate the skb data using kmalloc for a
page_pool recycled skb.

Regards,
Lorenzo

[0] https://lore.kernel.org/netdev/20230417111204.08f19827@kernel.org/T/#t
[1] https://github.com/torvalds/linux/blob/master/net/core/skbuff.c#L6602
Lorenzo Bianconi April 17, 2023, 7 p.m. UTC | #6
> On 17.04.23 20:17, Lorenzo Bianconi wrote:
> > > On Mon, Apr 17, 2023 at 7:53 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
> > > > 
> > > > Hi all,
> > > > 
> > > > I am triggering an issue with a device running the page_pool allocator.
> > > > In particular, the device is running an iperf tcp server receiving traffic
> > > > from a remote client. On the driver I loaded a simple xdp program returning
> > > > xdp_pass. When I remove the ebpf program and destroy the pool, page_pool
> > > > allocator starts complaining in page_pool_release_retry() that not all the pages
> > > > have been returned to the allocator. In fact, the pool is not really destroyed
> > > > in this case.
> > > > Debugging the code it seems the pages are stuck softnet_data defer_list and
> > > > they are never freed in skb_defer_free_flush() since I do not have any more tcp
> > > > traffic. To prove it, I tried to set sysctl_skb_defer_max to 0 and the issue
> > > > does not occur.
> > > > I developed the poc patch below and the issue seems to be fixed:
> > > 
> > > I do not see why this would be different than having buffers sitting
> > > in some tcp receive
> > > (or out or order) queues for a few minutes ?
> > 
> > The main issue in my tests (and even in mt76 I think) is the pages are not returned
> > to the pool for a very long time (even hours) and doing so the pool is like in a
> > 'limbo' state where it is not actually deallocated and page_pool_release_retry
> > continues complaining about it. I think this is because we do not have more tcp
> > traffic to deallocate them, but I am not so familiar with tcp codebase :)
> > 
> > Regards,
> > Lorenzo
> > 
> > > 
> > > Or buffers transferred to another socket or pipe (splice() and friends)
> 
> I'm not absolutely sure that it is the same problem, but I also saw some
> problems with page_pool destroying and page_pool_release_retry(). I did
> post it, but I did not get any reply:
> 
> https://lore.kernel.org/bpf/20230311213709.42625-1-gerhard@engleder-embedded.com/T/
> 
> Could this be a similar issue?

I am not sure too. In order to prove it is the same issue, I would say you can try to
run the test applying my poc patch or setting sysctl_skb_defer_max to 0.

Regards,
Lorenzo

> 
> Gerhard
Jakub Kicinski April 17, 2023, 7:08 p.m. UTC | #7
On Mon, 17 Apr 2023 20:42:39 +0200 Lorenzo Bianconi wrote:
> > Is drgn available for your target? You could try to scan the pages on
> > the system and see if you can find what's still pointing to the page
> > pool (assuming they are indeed leaked and not returned to the page
> > allocator without releasing :()  
> 
> I will test it but since setting sysctl_skb_defer_max to 0 fixes the issue,
> I think the pages are still properly linked to the pool, they are just not
> returned to it. I proved it using the other patch I posted [0] where I can see
> the counter of returned pages incrementing from time to time (in a very long
> time slot..).

If it's that then I'm with Eric. There are many ways to keep the pages
in use, no point working around one of them and not the rest :(

> Unrelated to this issue, but debugging it I think a found a page_pool leak in
> skb_condense() [1] where we can reallocate the skb data using kmalloc for a
> page_pool recycled skb.

I don't see a problem having pp_recycle = 1 and head in slab is legal.
pp_recycle just means that *if* a page is from the page pool we own 
the recycling reference. A page from slab will not be treated as a PP
page cause it doesn't have pp_magic set to the correct pattern.
Lorenzo Bianconi April 17, 2023, 9:31 p.m. UTC | #8
> On Mon, 17 Apr 2023 20:42:39 +0200 Lorenzo Bianconi wrote:
> > > Is drgn available for your target? You could try to scan the pages on
> > > the system and see if you can find what's still pointing to the page
> > > pool (assuming they are indeed leaked and not returned to the page
> > > allocator without releasing :()  
> > 
> > I will test it but since setting sysctl_skb_defer_max to 0 fixes the issue,
> > I think the pages are still properly linked to the pool, they are just not
> > returned to it. I proved it using the other patch I posted [0] where I can see
> > the counter of returned pages incrementing from time to time (in a very long
> > time slot..).
> 
> If it's that then I'm with Eric. There are many ways to keep the pages
> in use, no point working around one of them and not the rest :(

I was not clear here, my fault. What I mean is I can see the returned
pages counter increasing from time to time, but during most of tests,
even after 2h the tcp traffic has stopped, page_pool_release_retry()
still complains not all the pages are returned to the pool and so the
pool has not been deallocated yet.
The chunk of code in my first email is just to demonstrate the issue
and I am completely fine to get a better solution :) I guess we just
need a way to free the pool in a reasonable amount of time. Agree?

> 
> > Unrelated to this issue, but debugging it I think a found a page_pool leak in
> > skb_condense() [1] where we can reallocate the skb data using kmalloc for a
> > page_pool recycled skb.
> 
> I don't see a problem having pp_recycle = 1 and head in slab is legal.
> pp_recycle just means that *if* a page is from the page pool we own 
> the recycling reference. A page from slab will not be treated as a PP
> page cause it doesn't have pp_magic set to the correct pattern.

ack, right. Thx for pointing this out.

Regards,
Lorenzo
Jakub Kicinski April 17, 2023, 11:32 p.m. UTC | #9
On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
> > If it's that then I'm with Eric. There are many ways to keep the pages
> > in use, no point working around one of them and not the rest :(  
> 
> I was not clear here, my fault. What I mean is I can see the returned
> pages counter increasing from time to time, but during most of tests,
> even after 2h the tcp traffic has stopped, page_pool_release_retry()
> still complains not all the pages are returned to the pool and so the
> pool has not been deallocated yet.
> The chunk of code in my first email is just to demonstrate the issue
> and I am completely fine to get a better solution :) 

Your problem is perhaps made worse by threaded NAPI, you have
defer-free skbs sprayed across all cores and no NAPI there to 
flush them :(

> I guess we just need a way to free the pool in a reasonable amount 
> of time. Agree?

Whether we need to guarantee the release is the real question.
Maybe it's more of a false-positive warning.

Flushing the defer list is probably fine as a hack, but it's not
a full fix as Eric explained. False positive can still happen.

I'm ambivalent. My only real request wold be to make the flushing 
a helper in net/core/dev.c rather than open coded in page_pool.c.

Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?
Lorenzo Bianconi April 18, 2023, 7:36 a.m. UTC | #10
> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
> > > If it's that then I'm with Eric. There are many ways to keep the pages
> > > in use, no point working around one of them and not the rest :(  
> > 
> > I was not clear here, my fault. What I mean is I can see the returned
> > pages counter increasing from time to time, but during most of tests,
> > even after 2h the tcp traffic has stopped, page_pool_release_retry()
> > still complains not all the pages are returned to the pool and so the
> > pool has not been deallocated yet.
> > The chunk of code in my first email is just to demonstrate the issue
> > and I am completely fine to get a better solution :) 
> 
> Your problem is perhaps made worse by threaded NAPI, you have
> defer-free skbs sprayed across all cores and no NAPI there to 
> flush them :(

yes, exactly :)

> 
> > I guess we just need a way to free the pool in a reasonable amount 
> > of time. Agree?
> 
> Whether we need to guarantee the release is the real question.

yes, this is the main goal of my email. The defer-free skbs behaviour seems in
contrast with the page_pool pending pages monitor mechanism or at least they
do not work well together.

@Jesper, Ilias: any input on it?

> Maybe it's more of a false-positive warning.
> 
> Flushing the defer list is probably fine as a hack, but it's not
> a full fix as Eric explained. False positive can still happen.

agree, it was just a way to give an idea of the issue, not a proper solution.

Regards,
Lorenzo

> 
> I'm ambivalent. My only real request wold be to make the flushing 
> a helper in net/core/dev.c rather than open coded in page_pool.c.
> 
> Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?
Jesper Dangaard Brouer April 19, 2023, 11:08 a.m. UTC | #11
On 18/04/2023 09.36, Lorenzo Bianconi wrote:
>> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
>>>> If it's that then I'm with Eric. There are many ways to keep the pages
>>>> in use, no point working around one of them and not the rest :(
>>>
>>> I was not clear here, my fault. What I mean is I can see the returned
>>> pages counter increasing from time to time, but during most of tests,
>>> even after 2h the tcp traffic has stopped, page_pool_release_retry()
>>> still complains not all the pages are returned to the pool and so the
>>> pool has not been deallocated yet.
>>> The chunk of code in my first email is just to demonstrate the issue
>>> and I am completely fine to get a better solution :)
>>
>> Your problem is perhaps made worse by threaded NAPI, you have
>> defer-free skbs sprayed across all cores and no NAPI there to
>> flush them :(
> 
> yes, exactly :)
> 
>>
>>> I guess we just need a way to free the pool in a reasonable amount
>>> of time. Agree?
>>
>> Whether we need to guarantee the release is the real question.
> 
> yes, this is the main goal of my email. The defer-free skbs behaviour seems in
> contrast with the page_pool pending pages monitor mechanism or at least they
> do not work well together.
> 
> @Jesper, Ilias: any input on it?
> 
>> Maybe it's more of a false-positive warning.
>>
>> Flushing the defer list is probably fine as a hack, but it's not
>> a full fix as Eric explained. False positive can still happen.
> 
> agree, it was just a way to give an idea of the issue, not a proper solution.
> 
> Regards,
> Lorenzo
> 
>>
>> I'm ambivalent. My only real request wold be to make the flushing
>> a helper in net/core/dev.c rather than open coded in page_pool.c.

I agree. We need a central defer_list flushing helper

It is too easy to say this is a false-positive warning.
IHMO this expose an issue with the sd->defer_list system.

Lorenzo's test is adding+removing veth devices, which creates and runs
NAPI processing on random CPUs.  After veth netdevices (+NAPI) are
removed, nothing will naturally invoking net_rx_softirq on this CPU.
Thus, we have SKBs waiting on CPUs sd->defer_list.  Further more we will
not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI
call (trigger_rx_softirq), even if this CPU process and frees SKBs.

I see two solutions:

  (1) When netdevice/NAPI unregister happens call defer_list flushing 
helper.

  (2) Use napi_watchdog to detect if defer_list is (many jiffies) old, 
and then call defer_list flushing helper.


>>
>> Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?

Looks to me like dev_cpu_dead() also need this flushing helper for
sd->defer_list, or at least moving the sd->defer_list to an sd that will
run eventually.

--Jesper
Eric Dumazet April 19, 2023, 12:09 p.m. UTC | #12
On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 18/04/2023 09.36, Lorenzo Bianconi wrote:
> >> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
> >>>> If it's that then I'm with Eric. There are many ways to keep the pages
> >>>> in use, no point working around one of them and not the rest :(
> >>>
> >>> I was not clear here, my fault. What I mean is I can see the returned
> >>> pages counter increasing from time to time, but during most of tests,
> >>> even after 2h the tcp traffic has stopped, page_pool_release_retry()
> >>> still complains not all the pages are returned to the pool and so the
> >>> pool has not been deallocated yet.
> >>> The chunk of code in my first email is just to demonstrate the issue
> >>> and I am completely fine to get a better solution :)
> >>
> >> Your problem is perhaps made worse by threaded NAPI, you have
> >> defer-free skbs sprayed across all cores and no NAPI there to
> >> flush them :(
> >
> > yes, exactly :)
> >
> >>
> >>> I guess we just need a way to free the pool in a reasonable amount
> >>> of time. Agree?
> >>
> >> Whether we need to guarantee the release is the real question.
> >
> > yes, this is the main goal of my email. The defer-free skbs behaviour seems in
> > contrast with the page_pool pending pages monitor mechanism or at least they
> > do not work well together.
> >
> > @Jesper, Ilias: any input on it?
> >
> >> Maybe it's more of a false-positive warning.
> >>
> >> Flushing the defer list is probably fine as a hack, but it's not
> >> a full fix as Eric explained. False positive can still happen.
> >
> > agree, it was just a way to give an idea of the issue, not a proper solution.
> >
> > Regards,
> > Lorenzo
> >
> >>
> >> I'm ambivalent. My only real request wold be to make the flushing
> >> a helper in net/core/dev.c rather than open coded in page_pool.c.
>
> I agree. We need a central defer_list flushing helper
>
> It is too easy to say this is a false-positive warning.
> IHMO this expose an issue with the sd->defer_list system.
>
> Lorenzo's test is adding+removing veth devices, which creates and runs
> NAPI processing on random CPUs.  After veth netdevices (+NAPI) are
> removed, nothing will naturally invoking net_rx_softirq on this CPU.
> Thus, we have SKBs waiting on CPUs sd->defer_list.  Further more we will
> not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI
> call (trigger_rx_softirq), even if this CPU process and frees SKBs.
>
> I see two solutions:
>
>   (1) When netdevice/NAPI unregister happens call defer_list flushing
> helper.
>
>   (2) Use napi_watchdog to detect if defer_list is (many jiffies) old,
> and then call defer_list flushing helper.
>
>
> >>
> >> Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?
>
> Looks to me like dev_cpu_dead() also need this flushing helper for
> sd->defer_list, or at least moving the sd->defer_list to an sd that will
> run eventually.

I think I just considered having a few skbs in per-cpu list would not
be an issue,
especially considering skbs can sit hours in tcp receive queues.

Do we expect hacing some kind of callback/shrinker to instruct TCP or
pipes to release all pages that prevent
a page_pool to be freed ?

Here, we are talking of hundreds of thousands of skbs, compared to at
most 32 skbs per cpu.

Perhaps sets sysctl_skb_defer_max to zero by default, so that admins can opt-in
Jesper Dangaard Brouer April 19, 2023, 2:02 p.m. UTC | #13
On 19/04/2023 14.09, Eric Dumazet wrote:
> On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer
>>
>>
>> On 18/04/2023 09.36, Lorenzo Bianconi wrote:
>>>> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
>>>>>> If it's that then I'm with Eric. There are many ways to keep the pages
>>>>>> in use, no point working around one of them and not the rest :(
>>>>>
>>>>> I was not clear here, my fault. What I mean is I can see the returned
>>>>> pages counter increasing from time to time, but during most of tests,
>>>>> even after 2h the tcp traffic has stopped, page_pool_release_retry()
>>>>> still complains not all the pages are returned to the pool and so the
>>>>> pool has not been deallocated yet.
>>>>> The chunk of code in my first email is just to demonstrate the issue
>>>>> and I am completely fine to get a better solution :)
>>>>
>>>> Your problem is perhaps made worse by threaded NAPI, you have
>>>> defer-free skbs sprayed across all cores and no NAPI there to
>>>> flush them :(
>>>
>>> yes, exactly :)
>>>
>>>>
>>>>> I guess we just need a way to free the pool in a reasonable amount
>>>>> of time. Agree?
>>>>
>>>> Whether we need to guarantee the release is the real question.
>>>
>>> yes, this is the main goal of my email. The defer-free skbs behaviour seems in
>>> contrast with the page_pool pending pages monitor mechanism or at least they
>>> do not work well together.
>>>
>>> @Jesper, Ilias: any input on it?
>>>
>>>> Maybe it's more of a false-positive warning.
>>>>
>>>> Flushing the defer list is probably fine as a hack, but it's not
>>>> a full fix as Eric explained. False positive can still happen.
>>>
>>> agree, it was just a way to give an idea of the issue, not a proper solution.
>>>
>>> Regards,
>>> Lorenzo
>>>
>>>>
>>>> I'm ambivalent. My only real request wold be to make the flushing
>>>> a helper in net/core/dev.c rather than open coded in page_pool.c.
>>
>> I agree. We need a central defer_list flushing helper
>>
>> It is too easy to say this is a false-positive warning.
>> IHMO this expose an issue with the sd->defer_list system.
>>
>> Lorenzo's test is adding+removing veth devices, which creates and runs
>> NAPI processing on random CPUs.  After veth netdevices (+NAPI) are
>> removed, nothing will naturally invoking net_rx_softirq on this CPU.
>> Thus, we have SKBs waiting on CPUs sd->defer_list.  Further more we will
>> not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI
>> call (trigger_rx_softirq), even if this CPU process and frees SKBs.
>>
>> I see two solutions:
>>
>>    (1) When netdevice/NAPI unregister happens call defer_list flushing
>> helper.
>>
>>    (2) Use napi_watchdog to detect if defer_list is (many jiffies) old,
>> and then call defer_list flushing helper.
>>
>>
>>>>
>>>> Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?
>>
>> Looks to me like dev_cpu_dead() also need this flushing helper for
>> sd->defer_list, or at least moving the sd->defer_list to an sd that will
>> run eventually.
> 
> I think I just considered having a few skbs in per-cpu list would not
> be an issue,
> especially considering skbs can sit hours in tcp receive queues.
>

It was the first thing I said to Lorenzo when he first reported the
problem to me (over chat): It is likely packets sitting in a TCP queue.
Then I instructed him to look at output from netstat to see queues and
look for TIME-WAIT, FIN-WAIT etc.


> Do we expect hacing some kind of callback/shrinker to instruct TCP or
> pipes to release all pages that prevent
> a page_pool to be freed ?
> 

This is *not* what I'm asking for.

With TCP sockets (pipes etc) we can take care of closing the sockets
(and programs etc) to free up the SKBs (and perhaps wait for timeouts)
to make sure the page_pool shutdown doesn't hang.

The problem arise for all the selftests that uses veth and bpf_test_run
(using bpf_test_run_xdp_live / xdp_test_run_setup).  For the selftests
we obviously take care of closing sockets and removing veth interfaces
again.  Problem: The defer_list corner-case isn't under our control.


> Here, we are talking of hundreds of thousands of skbs, compared to at
> most 32 skbs per cpu.
> 

It is not a memory usage concern.

> Perhaps sets sysctl_skb_defer_max to zero by default, so that admins
> can opt-in
> 

I really like the sd->defer_list system and I think is should be enabled
by default.  Even if disabled by default, we still need to handle these
corner cases, as the selftests shouldn't start to cause-issues when this
gets enabled.

The simple solution is: (1) When netdevice/NAPI unregister happens call
defer_list flushing helper.  And perhaps we also need to call it in
xdp_test_run_teardown().  How do you feel about that?

--Jesper
Eric Dumazet April 19, 2023, 2:18 p.m. UTC | #14
On Wed, Apr 19, 2023 at 4:02 PM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:

> With TCP sockets (pipes etc) we can take care of closing the sockets
> (and programs etc) to free up the SKBs (and perhaps wait for timeouts)
> to make sure the page_pool shutdown doesn't hang.

This can not happen in many cases, like pages being now mapped to user
space programs,
or nfsd or whatever.

I think that fundamentally, page pool should handle this case gracefully.

For instance, when a TCP socket is closed(), user space can die, but
many resources in the kernel are freed later.

We do not block a close() just because a qdisc decided to hold a
buffer for few minutes.
Lorenzo Bianconi April 19, 2023, 2:21 p.m. UTC | #15
> 
> On 19/04/2023 14.09, Eric Dumazet wrote:
> > On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer
> > > 
> > > 
> > > On 18/04/2023 09.36, Lorenzo Bianconi wrote:
> > > > > On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
> > > > > > > If it's that then I'm with Eric. There are many ways to keep the pages
> > > > > > > in use, no point working around one of them and not the rest :(
> > > > > > 
> > > > > > I was not clear here, my fault. What I mean is I can see the returned
> > > > > > pages counter increasing from time to time, but during most of tests,
> > > > > > even after 2h the tcp traffic has stopped, page_pool_release_retry()
> > > > > > still complains not all the pages are returned to the pool and so the
> > > > > > pool has not been deallocated yet.
> > > > > > The chunk of code in my first email is just to demonstrate the issue
> > > > > > and I am completely fine to get a better solution :)
> > > > > 
> > > > > Your problem is perhaps made worse by threaded NAPI, you have
> > > > > defer-free skbs sprayed across all cores and no NAPI there to
> > > > > flush them :(
> > > > 
> > > > yes, exactly :)
> > > > 
> > > > > 
> > > > > > I guess we just need a way to free the pool in a reasonable amount
> > > > > > of time. Agree?
> > > > > 
> > > > > Whether we need to guarantee the release is the real question.
> > > > 
> > > > yes, this is the main goal of my email. The defer-free skbs behaviour seems in
> > > > contrast with the page_pool pending pages monitor mechanism or at least they
> > > > do not work well together.
> > > > 
> > > > @Jesper, Ilias: any input on it?
> > > > 
> > > > > Maybe it's more of a false-positive warning.
> > > > > 
> > > > > Flushing the defer list is probably fine as a hack, but it's not
> > > > > a full fix as Eric explained. False positive can still happen.
> > > > 
> > > > agree, it was just a way to give an idea of the issue, not a proper solution.
> > > > 
> > > > Regards,
> > > > Lorenzo
> > > > 
> > > > > 
> > > > > I'm ambivalent. My only real request wold be to make the flushing
> > > > > a helper in net/core/dev.c rather than open coded in page_pool.c.
> > > 
> > > I agree. We need a central defer_list flushing helper
> > > 
> > > It is too easy to say this is a false-positive warning.
> > > IHMO this expose an issue with the sd->defer_list system.
> > > 
> > > Lorenzo's test is adding+removing veth devices, which creates and runs
> > > NAPI processing on random CPUs.  After veth netdevices (+NAPI) are
> > > removed, nothing will naturally invoking net_rx_softirq on this CPU.
> > > Thus, we have SKBs waiting on CPUs sd->defer_list.  Further more we will
> > > not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI
> > > call (trigger_rx_softirq), even if this CPU process and frees SKBs.
> > > 
> > > I see two solutions:
> > > 
> > >    (1) When netdevice/NAPI unregister happens call defer_list flushing
> > > helper.
> > > 
> > >    (2) Use napi_watchdog to detect if defer_list is (many jiffies) old,
> > > and then call defer_list flushing helper.
> > > 
> > > 
> > > > > 
> > > > > Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?
> > > 
> > > Looks to me like dev_cpu_dead() also need this flushing helper for
> > > sd->defer_list, or at least moving the sd->defer_list to an sd that will
> > > run eventually.
> > 
> > I think I just considered having a few skbs in per-cpu list would not
> > be an issue,
> > especially considering skbs can sit hours in tcp receive queues.
> > 
> 
> It was the first thing I said to Lorenzo when he first reported the
> problem to me (over chat): It is likely packets sitting in a TCP queue.
> Then I instructed him to look at output from netstat to see queues and
> look for TIME-WAIT, FIN-WAIT etc.
> 
> 
> > Do we expect hacing some kind of callback/shrinker to instruct TCP or
> > pipes to release all pages that prevent
> > a page_pool to be freed ?
> > 
> 
> This is *not* what I'm asking for.
> 
> With TCP sockets (pipes etc) we can take care of closing the sockets
> (and programs etc) to free up the SKBs (and perhaps wait for timeouts)
> to make sure the page_pool shutdown doesn't hang.
> 
> The problem arise for all the selftests that uses veth and bpf_test_run
> (using bpf_test_run_xdp_live / xdp_test_run_setup).  For the selftests
> we obviously take care of closing sockets and removing veth interfaces
> again.  Problem: The defer_list corner-case isn't under our control.
> 
> 
> > Here, we are talking of hundreds of thousands of skbs, compared to at
> > most 32 skbs per cpu.
> > 
> 
> It is not a memory usage concern.
> 
> > Perhaps sets sysctl_skb_defer_max to zero by default, so that admins
> > can opt-in
> > 
> 
> I really like the sd->defer_list system and I think is should be enabled
> by default.  Even if disabled by default, we still need to handle these
> corner cases, as the selftests shouldn't start to cause-issues when this
> gets enabled.
> 
> The simple solution is: (1) When netdevice/NAPI unregister happens call
> defer_list flushing helper.  And perhaps we also need to call it in
> xdp_test_run_teardown().  How do you feel about that?
> 
> --Jesper
> 

Today I was discussing with Toke about this issue, and we were wondering,
if we just consider the page_pool use-case, what about moving the real pool
destroying steps when we return a page to the pool in page_pool_put_full_page()
if the pool has marked to be destroyed and there are no inflight pages instead
of assuming we have all the pages in the pool when we run page_pool_destroy()?
Maybe this means just get rid of the warn in page_pool_release_retry() :)

Regards,
Lorenzo
Jesper Dangaard Brouer April 19, 2023, 3:36 p.m. UTC | #16
On 19/04/2023 16.21, Lorenzo Bianconi wrote:
>>
>> On 19/04/2023 14.09, Eric Dumazet wrote:
>>> On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer
>>>>
>>>>
>>>> On 18/04/2023 09.36, Lorenzo Bianconi wrote:
>>>>>> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
>>>>>>>> If it's that then I'm with Eric. There are many ways to keep the pages
>>>>>>>> in use, no point working around one of them and not the rest :(
>>>>>>>
>>>>>>> I was not clear here, my fault. What I mean is I can see the returned
>>>>>>> pages counter increasing from time to time, but during most of tests,
>>>>>>> even after 2h the tcp traffic has stopped, page_pool_release_retry()
>>>>>>> still complains not all the pages are returned to the pool and so the
>>>>>>> pool has not been deallocated yet.
>>>>>>> The chunk of code in my first email is just to demonstrate the issue
>>>>>>> and I am completely fine to get a better solution :)
>>>>>>
>>>>>> Your problem is perhaps made worse by threaded NAPI, you have
>>>>>> defer-free skbs sprayed across all cores and no NAPI there to
>>>>>> flush them :(
>>>>>
>>>>> yes, exactly :)
>>>>>
>>>>>>
>>>>>>> I guess we just need a way to free the pool in a reasonable amount
>>>>>>> of time. Agree?
>>>>>>
>>>>>> Whether we need to guarantee the release is the real question.
>>>>>
>>>>> yes, this is the main goal of my email. The defer-free skbs behaviour seems in
>>>>> contrast with the page_pool pending pages monitor mechanism or at least they
>>>>> do not work well together.
>>>>>
>>>>> @Jesper, Ilias: any input on it?
>>>>>
>>>>>> Maybe it's more of a false-positive warning.
>>>>>>
>>>>>> Flushing the defer list is probably fine as a hack, but it's not
>>>>>> a full fix as Eric explained. False positive can still happen.
>>>>>
>>>>> agree, it was just a way to give an idea of the issue, not a proper solution.
>>>>>
>>>>> Regards,
>>>>> Lorenzo
>>>>>
>>>>>>
>>>>>> I'm ambivalent. My only real request wold be to make the flushing
>>>>>> a helper in net/core/dev.c rather than open coded in page_pool.c.
>>>>
>>>> I agree. We need a central defer_list flushing helper
>>>>
>>>> It is too easy to say this is a false-positive warning.
>>>> IHMO this expose an issue with the sd->defer_list system.
>>>>
>>>> Lorenzo's test is adding+removing veth devices, which creates and runs
>>>> NAPI processing on random CPUs.  After veth netdevices (+NAPI) are
>>>> removed, nothing will naturally invoking net_rx_softirq on this CPU.
>>>> Thus, we have SKBs waiting on CPUs sd->defer_list.  Further more we will
>>>> not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI
>>>> call (trigger_rx_softirq), even if this CPU process and frees SKBs.
>>>>
>>>> I see two solutions:
>>>>
>>>>     (1) When netdevice/NAPI unregister happens call defer_list flushing
>>>> helper.
>>>>
>>>>     (2) Use napi_watchdog to detect if defer_list is (many jiffies) old,
>>>> and then call defer_list flushing helper.
>>>>
>>>>
>>>>>>
>>>>>> Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?
>>>>
>>>> Looks to me like dev_cpu_dead() also need this flushing helper for
>>>> sd->defer_list, or at least moving the sd->defer_list to an sd that will
>>>> run eventually.
>>>
>>> I think I just considered having a few skbs in per-cpu list would not
>>> be an issue,
>>> especially considering skbs can sit hours in tcp receive queues.
>>>
>>
>> It was the first thing I said to Lorenzo when he first reported the
>> problem to me (over chat): It is likely packets sitting in a TCP queue.
>> Then I instructed him to look at output from netstat to see queues and
>> look for TIME-WAIT, FIN-WAIT etc.
>>
>>
>>> Do we expect hacing some kind of callback/shrinker to instruct TCP or
>>> pipes to release all pages that prevent
>>> a page_pool to be freed ?
>>>
>>
>> This is *not* what I'm asking for.
>>
>> With TCP sockets (pipes etc) we can take care of closing the sockets
>> (and programs etc) to free up the SKBs (and perhaps wait for timeouts)
>> to make sure the page_pool shutdown doesn't hang.
>>
>> The problem arise for all the selftests that uses veth and bpf_test_run
>> (using bpf_test_run_xdp_live / xdp_test_run_setup).  For the selftests
>> we obviously take care of closing sockets and removing veth interfaces
>> again.  Problem: The defer_list corner-case isn't under our control.
>>
>>
>>> Here, we are talking of hundreds of thousands of skbs, compared to at
>>> most 32 skbs per cpu.
>>>
>>
>> It is not a memory usage concern.
>>
>>> Perhaps sets sysctl_skb_defer_max to zero by default, so that admins
>>> can opt-in
>>>
>>
>> I really like the sd->defer_list system and I think is should be enabled
>> by default.  Even if disabled by default, we still need to handle these
>> corner cases, as the selftests shouldn't start to cause-issues when this
>> gets enabled.
>>
>> The simple solution is: (1) When netdevice/NAPI unregister happens call
>> defer_list flushing helper.  And perhaps we also need to call it in
>> xdp_test_run_teardown().  How do you feel about that?
>>
>> --Jesper
>>
> 
> Today I was discussing with Toke about this issue, and we were wondering,
> if we just consider the page_pool use-case, what about moving the real pool
> destroying steps when we return a page to the pool in page_pool_put_full_page()
> if the pool has marked to be destroyed and there are no inflight pages instead
> of assuming we have all the pages in the pool when we run page_pool_destroy()?

It sounds like you want to add a runtime check to the fast-path to
handle these corner cases?

For performance reason we should not call page_pool_inflight() check in 
fast-path, please!

Details: You hopefully mean running/calling page_pool_release(pool) and 
not page_pool_destroy().

I'm not totally against the idea, as long as someone is willing to do
extensive benchmarking that it doesn't affect fast-path performance.
Given we already read pool->p.flags in fast-path, it might be possible
to hide the extra branch (in the CPU pipeline).


> Maybe this means just get rid of the warn in page_pool_release_retry() :)
> 

Sure, we can remove the print statement, but it feels like closing our
eyes and ignoring the problem.  We can remove the print statement, and
still debug the problem, as I have added tracepoints (to debug this).
But users will not report these issue early... on the other hand most of
these reports will likely be false-positives.

This reminds me that Jakub's recent defer patches returning pages
'directly' to the page_pool alloc-cache, will actually result in this
kind of bug.  This is because page_pool_destroy() assumes that pages
cannot be returned to alloc-cache, as driver will have "disconnected" RX
side.  We need to address this bug separately.  Lorenzo you didn't
happen to use a kernel with Jakub's patches included, do you?

--Jesper
Jesper Dangaard Brouer April 19, 2023, 4:10 p.m. UTC | #17
On 19/04/2023 16.18, Eric Dumazet wrote:
> On Wed, Apr 19, 2023 at 4:02 PM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
> 
>> With TCP sockets (pipes etc) we can take care of closing the sockets
>> (and programs etc) to free up the SKBs (and perhaps wait for timeouts)
>> to make sure the page_pool shutdown doesn't hang.
> 
> This can not happen in many cases, like pages being now mapped to user
> space programs,
> or nfsd or whatever.
> 
> I think that fundamentally, page pool should handle this case gracefully.
> 
> For instance, when a TCP socket is closed(), user space can die, but
> many resources in the kernel are freed later.
> 
> We do not block a close() just because a qdisc decided to hold a
> buffer for few minutes.
> 

But page pool does handle this gracefully via scheduling a workqueue.

--Jesper
Lorenzo Bianconi April 19, 2023, 4:40 p.m. UTC | #18
> 
> 
> On 19/04/2023 16.21, Lorenzo Bianconi wrote:
> > > 
> > > On 19/04/2023 14.09, Eric Dumazet wrote:
> > > > On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer
> > > > > 
> > > > > 
> > > > > On 18/04/2023 09.36, Lorenzo Bianconi wrote:
> > > > > > > On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote:
> > > > > > > > > If it's that then I'm with Eric. There are many ways to keep the pages
> > > > > > > > > in use, no point working around one of them and not the rest :(
> > > > > > > > 
> > > > > > > > I was not clear here, my fault. What I mean is I can see the returned
> > > > > > > > pages counter increasing from time to time, but during most of tests,
> > > > > > > > even after 2h the tcp traffic has stopped, page_pool_release_retry()
> > > > > > > > still complains not all the pages are returned to the pool and so the
> > > > > > > > pool has not been deallocated yet.
> > > > > > > > The chunk of code in my first email is just to demonstrate the issue
> > > > > > > > and I am completely fine to get a better solution :)
> > > > > > > 
> > > > > > > Your problem is perhaps made worse by threaded NAPI, you have
> > > > > > > defer-free skbs sprayed across all cores and no NAPI there to
> > > > > > > flush them :(
> > > > > > 
> > > > > > yes, exactly :)
> > > > > > 
> > > > > > > 
> > > > > > > > I guess we just need a way to free the pool in a reasonable amount
> > > > > > > > of time. Agree?
> > > > > > > 
> > > > > > > Whether we need to guarantee the release is the real question.
> > > > > > 
> > > > > > yes, this is the main goal of my email. The defer-free skbs behaviour seems in
> > > > > > contrast with the page_pool pending pages monitor mechanism or at least they
> > > > > > do not work well together.
> > > > > > 
> > > > > > @Jesper, Ilias: any input on it?
> > > > > > 
> > > > > > > Maybe it's more of a false-positive warning.
> > > > > > > 
> > > > > > > Flushing the defer list is probably fine as a hack, but it's not
> > > > > > > a full fix as Eric explained. False positive can still happen.
> > > > > > 
> > > > > > agree, it was just a way to give an idea of the issue, not a proper solution.
> > > > > > 
> > > > > > Regards,
> > > > > > Lorenzo
> > > > > > 
> > > > > > > 
> > > > > > > I'm ambivalent. My only real request wold be to make the flushing
> > > > > > > a helper in net/core/dev.c rather than open coded in page_pool.c.
> > > > > 
> > > > > I agree. We need a central defer_list flushing helper
> > > > > 
> > > > > It is too easy to say this is a false-positive warning.
> > > > > IHMO this expose an issue with the sd->defer_list system.
> > > > > 
> > > > > Lorenzo's test is adding+removing veth devices, which creates and runs
> > > > > NAPI processing on random CPUs.  After veth netdevices (+NAPI) are
> > > > > removed, nothing will naturally invoking net_rx_softirq on this CPU.
> > > > > Thus, we have SKBs waiting on CPUs sd->defer_list.  Further more we will
> > > > > not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI
> > > > > call (trigger_rx_softirq), even if this CPU process and frees SKBs.
> > > > > 
> > > > > I see two solutions:
> > > > > 
> > > > >     (1) When netdevice/NAPI unregister happens call defer_list flushing
> > > > > helper.
> > > > > 
> > > > >     (2) Use napi_watchdog to detect if defer_list is (many jiffies) old,
> > > > > and then call defer_list flushing helper.
> > > > > 
> > > > > 
> > > > > > > 
> > > > > > > Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?
> > > > > 
> > > > > Looks to me like dev_cpu_dead() also need this flushing helper for
> > > > > sd->defer_list, or at least moving the sd->defer_list to an sd that will
> > > > > run eventually.
> > > > 
> > > > I think I just considered having a few skbs in per-cpu list would not
> > > > be an issue,
> > > > especially considering skbs can sit hours in tcp receive queues.
> > > > 
> > > 
> > > It was the first thing I said to Lorenzo when he first reported the
> > > problem to me (over chat): It is likely packets sitting in a TCP queue.
> > > Then I instructed him to look at output from netstat to see queues and
> > > look for TIME-WAIT, FIN-WAIT etc.
> > > 
> > > 
> > > > Do we expect hacing some kind of callback/shrinker to instruct TCP or
> > > > pipes to release all pages that prevent
> > > > a page_pool to be freed ?
> > > > 
> > > 
> > > This is *not* what I'm asking for.
> > > 
> > > With TCP sockets (pipes etc) we can take care of closing the sockets
> > > (and programs etc) to free up the SKBs (and perhaps wait for timeouts)
> > > to make sure the page_pool shutdown doesn't hang.
> > > 
> > > The problem arise for all the selftests that uses veth and bpf_test_run
> > > (using bpf_test_run_xdp_live / xdp_test_run_setup).  For the selftests
> > > we obviously take care of closing sockets and removing veth interfaces
> > > again.  Problem: The defer_list corner-case isn't under our control.
> > > 
> > > 
> > > > Here, we are talking of hundreds of thousands of skbs, compared to at
> > > > most 32 skbs per cpu.
> > > > 
> > > 
> > > It is not a memory usage concern.
> > > 
> > > > Perhaps sets sysctl_skb_defer_max to zero by default, so that admins
> > > > can opt-in
> > > > 
> > > 
> > > I really like the sd->defer_list system and I think is should be enabled
> > > by default.  Even if disabled by default, we still need to handle these
> > > corner cases, as the selftests shouldn't start to cause-issues when this
> > > gets enabled.
> > > 
> > > The simple solution is: (1) When netdevice/NAPI unregister happens call
> > > defer_list flushing helper.  And perhaps we also need to call it in
> > > xdp_test_run_teardown().  How do you feel about that?
> > > 
> > > --Jesper
> > > 
> > 
> > Today I was discussing with Toke about this issue, and we were wondering,
> > if we just consider the page_pool use-case, what about moving the real pool
> > destroying steps when we return a page to the pool in page_pool_put_full_page()
> > if the pool has marked to be destroyed and there are no inflight pages instead
> > of assuming we have all the pages in the pool when we run page_pool_destroy()?
> 
> It sounds like you want to add a runtime check to the fast-path to
> handle these corner cases?
> 
> For performance reason we should not call page_pool_inflight() check in
> fast-path, please!

ack, right.

> 
> Details: You hopefully mean running/calling page_pool_release(pool) and not
> page_pool_destroy().

yes, I mean page_pool_release()

> 
> I'm not totally against the idea, as long as someone is willing to do
> extensive benchmarking that it doesn't affect fast-path performance.
> Given we already read pool->p.flags in fast-path, it might be possible
> to hide the extra branch (in the CPU pipeline).
> 
> 
> > Maybe this means just get rid of the warn in page_pool_release_retry() :)
> > 
> 
> Sure, we can remove the print statement, but it feels like closing our
> eyes and ignoring the problem.  We can remove the print statement, and
> still debug the problem, as I have added tracepoints (to debug this).
> But users will not report these issue early... on the other hand most of
> these reports will likely be false-positives.
> 
> This reminds me that Jakub's recent defer patches returning pages
> 'directly' to the page_pool alloc-cache, will actually result in this
> kind of bug.  This is because page_pool_destroy() assumes that pages
> cannot be returned to alloc-cache, as driver will have "disconnected" RX
> side.  We need to address this bug separately.  Lorenzo you didn't
> happen to use a kernel with Jakub's patches included, do you?

nope, I did not tested them.

Regards,
Lorenzo

> 
> --Jesper
> 
> 
>
Jakub Kicinski April 19, 2023, 5:12 p.m. UTC | #19
On Wed, 19 Apr 2023 17:36:20 +0200 Jesper Dangaard Brouer wrote:
> This reminds me that Jakub's recent defer patches returning pages
> 'directly' to the page_pool alloc-cache, will actually result in this
> kind of bug.  This is because page_pool_destroy() assumes that pages
> cannot be returned to alloc-cache, as driver will have "disconnected" RX
> side.  We need to address this bug separately. 

Mm, I think you're right. The NAPI can get restarted with a different
page pool and the old page pool will still think it's in use.
diff mbox series

Patch

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 193c18799865..160f45c4e3a5 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -19,6 +19,7 @@ 
 #include <linux/mm.h> /* for put_page() */
 #include <linux/poison.h>
 #include <linux/ethtool.h>
+#include <linux/netdevice.h>
 
 #include <trace/events/page_pool.h>
 
@@ -810,12 +811,23 @@  static void page_pool_release_retry(struct work_struct *wq)
 {
 	struct delayed_work *dwq = to_delayed_work(wq);
 	struct page_pool *pool = container_of(dwq, typeof(*pool), release_dw);
-	int inflight;
+	int cpu, inflight;
 
 	inflight = page_pool_release(pool);
 	if (!inflight)
 		return;
 
+	/* Run NET_RX_SOFTIRQ in order to free pending skbs in softnet_data
+	 * defer_list that can stay in the list until we have enough queued
+	 * traffic.
+	 */
+	for_each_online_cpu(cpu) {
+		struct softnet_data *sd = &per_cpu(softnet_data, cpu);
+
+		if (!cmpxchg(&sd->defer_ipi_scheduled, 0, 1))
+			smp_call_function_single_async(cpu, &sd->defer_csd);
+	}
+
 	/* Periodic warning */
 	if (time_after_eq(jiffies, pool->defer_warn)) {
 		int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;