Message ID | ZD2HjZZSOjtsnQaf@lore-desk (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | issue with inflight pages from page_pool | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Guessing tree name failed - patch did not apply, async |
On Mon, Apr 17, 2023 at 7:53 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote: > > Hi all, > > I am triggering an issue with a device running the page_pool allocator. > In particular, the device is running an iperf tcp server receiving traffic > from a remote client. On the driver I loaded a simple xdp program returning > xdp_pass. When I remove the ebpf program and destroy the pool, page_pool > allocator starts complaining in page_pool_release_retry() that not all the pages > have been returned to the allocator. In fact, the pool is not really destroyed > in this case. > Debugging the code it seems the pages are stuck softnet_data defer_list and > they are never freed in skb_defer_free_flush() since I do not have any more tcp > traffic. To prove it, I tried to set sysctl_skb_defer_max to 0 and the issue > does not occur. > I developed the poc patch below and the issue seems to be fixed: I do not see why this would be different than having buffers sitting in some tcp receive (or out or order) queues for a few minutes ? Or buffers transferred to another socket or pipe (splice() and friends)
> On Mon, Apr 17, 2023 at 7:53 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote: > > > > Hi all, > > > > I am triggering an issue with a device running the page_pool allocator. > > In particular, the device is running an iperf tcp server receiving traffic > > from a remote client. On the driver I loaded a simple xdp program returning > > xdp_pass. When I remove the ebpf program and destroy the pool, page_pool > > allocator starts complaining in page_pool_release_retry() that not all the pages > > have been returned to the allocator. In fact, the pool is not really destroyed > > in this case. > > Debugging the code it seems the pages are stuck softnet_data defer_list and > > they are never freed in skb_defer_free_flush() since I do not have any more tcp > > traffic. To prove it, I tried to set sysctl_skb_defer_max to 0 and the issue > > does not occur. > > I developed the poc patch below and the issue seems to be fixed: > > I do not see why this would be different than having buffers sitting > in some tcp receive > (or out or order) queues for a few minutes ? The main issue in my tests (and even in mt76 I think) is the pages are not returned to the pool for a very long time (even hours) and doing so the pool is like in a 'limbo' state where it is not actually deallocated and page_pool_release_retry continues complaining about it. I think this is because we do not have more tcp traffic to deallocate them, but I am not so familiar with tcp codebase :) Regards, Lorenzo > > Or buffers transferred to another socket or pipe (splice() and friends)
On Mon, 17 Apr 2023 20:17:45 +0200 Lorenzo Bianconi wrote: > > I do not see why this would be different than having buffers sitting > > in some tcp receive > > (or out or order) queues for a few minutes ? > > The main issue in my tests (and even in mt76 I think) is the pages are not returned > to the pool for a very long time (even hours) and doing so the pool is like in a > 'limbo' state where it is not actually deallocated and page_pool_release_retry > continues complaining about it. I think this is because we do not have more tcp > traffic to deallocate them, but I am not so familiar with tcp codebase :) I've seen the page leaks too in my recent testing but I just assumed I fumbled the quick bnxt conversion to page pool. Could it be something with page frags? It happened a lot if I used page frags, IIRC mt76 is using page frags, too. Is drgn available for your target? You could try to scan the pages on the system and see if you can find what's still pointing to the page pool (assuming they are indeed leaked and not returned to the page allocator without releasing :()
On 17.04.23 20:17, Lorenzo Bianconi wrote: >> On Mon, Apr 17, 2023 at 7:53 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote: >>> >>> Hi all, >>> >>> I am triggering an issue with a device running the page_pool allocator. >>> In particular, the device is running an iperf tcp server receiving traffic >>> from a remote client. On the driver I loaded a simple xdp program returning >>> xdp_pass. When I remove the ebpf program and destroy the pool, page_pool >>> allocator starts complaining in page_pool_release_retry() that not all the pages >>> have been returned to the allocator. In fact, the pool is not really destroyed >>> in this case. >>> Debugging the code it seems the pages are stuck softnet_data defer_list and >>> they are never freed in skb_defer_free_flush() since I do not have any more tcp >>> traffic. To prove it, I tried to set sysctl_skb_defer_max to 0 and the issue >>> does not occur. >>> I developed the poc patch below and the issue seems to be fixed: >> >> I do not see why this would be different than having buffers sitting >> in some tcp receive >> (or out or order) queues for a few minutes ? > > The main issue in my tests (and even in mt76 I think) is the pages are not returned > to the pool for a very long time (even hours) and doing so the pool is like in a > 'limbo' state where it is not actually deallocated and page_pool_release_retry > continues complaining about it. I think this is because we do not have more tcp > traffic to deallocate them, but I am not so familiar with tcp codebase :) > > Regards, > Lorenzo > >> >> Or buffers transferred to another socket or pipe (splice() and friends) I'm not absolutely sure that it is the same problem, but I also saw some problems with page_pool destroying and page_pool_release_retry(). I did post it, but I did not get any reply: https://lore.kernel.org/bpf/20230311213709.42625-1-gerhard@engleder-embedded.com/T/ Could this be a similar issue? Gerhard
> On Mon, 17 Apr 2023 20:17:45 +0200 Lorenzo Bianconi wrote: > > > I do not see why this would be different than having buffers sitting > > > in some tcp receive > > > (or out or order) queues for a few minutes ? > > > > The main issue in my tests (and even in mt76 I think) is the pages are not returned > > to the pool for a very long time (even hours) and doing so the pool is like in a > > 'limbo' state where it is not actually deallocated and page_pool_release_retry > > continues complaining about it. I think this is because we do not have more tcp > > traffic to deallocate them, but I am not so familiar with tcp codebase :) > > I've seen the page leaks too in my recent testing but I just assumed > I fumbled the quick bnxt conversion to page pool. Could it be something > with page frags? It happened a lot if I used page frags, IIRC mt76 is > using page frags, too. my device is allocating a full order 0 page from the pool so it is not related to fragmented pages as fixed for mt76 by Alex and Felix. > > Is drgn available for your target? You could try to scan the pages on > the system and see if you can find what's still pointing to the page > pool (assuming they are indeed leaked and not returned to the page > allocator without releasing :() I will test it but since setting sysctl_skb_defer_max to 0 fixes the issue, I think the pages are still properly linked to the pool, they are just not returned to it. I proved it using the other patch I posted [0] where I can see the counter of returned pages incrementing from time to time (in a very long time slot..). Unrelated to this issue, but debugging it I think a found a page_pool leak in skb_condense() [1] where we can reallocate the skb data using kmalloc for a page_pool recycled skb. Regards, Lorenzo [0] https://lore.kernel.org/netdev/20230417111204.08f19827@kernel.org/T/#t [1] https://github.com/torvalds/linux/blob/master/net/core/skbuff.c#L6602
> On 17.04.23 20:17, Lorenzo Bianconi wrote: > > > On Mon, Apr 17, 2023 at 7:53 PM Lorenzo Bianconi <lorenzo@kernel.org> wrote: > > > > > > > > Hi all, > > > > > > > > I am triggering an issue with a device running the page_pool allocator. > > > > In particular, the device is running an iperf tcp server receiving traffic > > > > from a remote client. On the driver I loaded a simple xdp program returning > > > > xdp_pass. When I remove the ebpf program and destroy the pool, page_pool > > > > allocator starts complaining in page_pool_release_retry() that not all the pages > > > > have been returned to the allocator. In fact, the pool is not really destroyed > > > > in this case. > > > > Debugging the code it seems the pages are stuck softnet_data defer_list and > > > > they are never freed in skb_defer_free_flush() since I do not have any more tcp > > > > traffic. To prove it, I tried to set sysctl_skb_defer_max to 0 and the issue > > > > does not occur. > > > > I developed the poc patch below and the issue seems to be fixed: > > > > > > I do not see why this would be different than having buffers sitting > > > in some tcp receive > > > (or out or order) queues for a few minutes ? > > > > The main issue in my tests (and even in mt76 I think) is the pages are not returned > > to the pool for a very long time (even hours) and doing so the pool is like in a > > 'limbo' state where it is not actually deallocated and page_pool_release_retry > > continues complaining about it. I think this is because we do not have more tcp > > traffic to deallocate them, but I am not so familiar with tcp codebase :) > > > > Regards, > > Lorenzo > > > > > > > > Or buffers transferred to another socket or pipe (splice() and friends) > > I'm not absolutely sure that it is the same problem, but I also saw some > problems with page_pool destroying and page_pool_release_retry(). I did > post it, but I did not get any reply: > > https://lore.kernel.org/bpf/20230311213709.42625-1-gerhard@engleder-embedded.com/T/ > > Could this be a similar issue? I am not sure too. In order to prove it is the same issue, I would say you can try to run the test applying my poc patch or setting sysctl_skb_defer_max to 0. Regards, Lorenzo > > Gerhard
On Mon, 17 Apr 2023 20:42:39 +0200 Lorenzo Bianconi wrote: > > Is drgn available for your target? You could try to scan the pages on > > the system and see if you can find what's still pointing to the page > > pool (assuming they are indeed leaked and not returned to the page > > allocator without releasing :() > > I will test it but since setting sysctl_skb_defer_max to 0 fixes the issue, > I think the pages are still properly linked to the pool, they are just not > returned to it. I proved it using the other patch I posted [0] where I can see > the counter of returned pages incrementing from time to time (in a very long > time slot..). If it's that then I'm with Eric. There are many ways to keep the pages in use, no point working around one of them and not the rest :( > Unrelated to this issue, but debugging it I think a found a page_pool leak in > skb_condense() [1] where we can reallocate the skb data using kmalloc for a > page_pool recycled skb. I don't see a problem having pp_recycle = 1 and head in slab is legal. pp_recycle just means that *if* a page is from the page pool we own the recycling reference. A page from slab will not be treated as a PP page cause it doesn't have pp_magic set to the correct pattern.
> On Mon, 17 Apr 2023 20:42:39 +0200 Lorenzo Bianconi wrote: > > > Is drgn available for your target? You could try to scan the pages on > > > the system and see if you can find what's still pointing to the page > > > pool (assuming they are indeed leaked and not returned to the page > > > allocator without releasing :() > > > > I will test it but since setting sysctl_skb_defer_max to 0 fixes the issue, > > I think the pages are still properly linked to the pool, they are just not > > returned to it. I proved it using the other patch I posted [0] where I can see > > the counter of returned pages incrementing from time to time (in a very long > > time slot..). > > If it's that then I'm with Eric. There are many ways to keep the pages > in use, no point working around one of them and not the rest :( I was not clear here, my fault. What I mean is I can see the returned pages counter increasing from time to time, but during most of tests, even after 2h the tcp traffic has stopped, page_pool_release_retry() still complains not all the pages are returned to the pool and so the pool has not been deallocated yet. The chunk of code in my first email is just to demonstrate the issue and I am completely fine to get a better solution :) I guess we just need a way to free the pool in a reasonable amount of time. Agree? > > > Unrelated to this issue, but debugging it I think a found a page_pool leak in > > skb_condense() [1] where we can reallocate the skb data using kmalloc for a > > page_pool recycled skb. > > I don't see a problem having pp_recycle = 1 and head in slab is legal. > pp_recycle just means that *if* a page is from the page pool we own > the recycling reference. A page from slab will not be treated as a PP > page cause it doesn't have pp_magic set to the correct pattern. ack, right. Thx for pointing this out. Regards, Lorenzo
On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote: > > If it's that then I'm with Eric. There are many ways to keep the pages > > in use, no point working around one of them and not the rest :( > > I was not clear here, my fault. What I mean is I can see the returned > pages counter increasing from time to time, but during most of tests, > even after 2h the tcp traffic has stopped, page_pool_release_retry() > still complains not all the pages are returned to the pool and so the > pool has not been deallocated yet. > The chunk of code in my first email is just to demonstrate the issue > and I am completely fine to get a better solution :) Your problem is perhaps made worse by threaded NAPI, you have defer-free skbs sprayed across all cores and no NAPI there to flush them :( > I guess we just need a way to free the pool in a reasonable amount > of time. Agree? Whether we need to guarantee the release is the real question. Maybe it's more of a false-positive warning. Flushing the defer list is probably fine as a hack, but it's not a full fix as Eric explained. False positive can still happen. I'm ambivalent. My only real request wold be to make the flushing a helper in net/core/dev.c rather than open coded in page_pool.c. Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?
> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote: > > > If it's that then I'm with Eric. There are many ways to keep the pages > > > in use, no point working around one of them and not the rest :( > > > > I was not clear here, my fault. What I mean is I can see the returned > > pages counter increasing from time to time, but during most of tests, > > even after 2h the tcp traffic has stopped, page_pool_release_retry() > > still complains not all the pages are returned to the pool and so the > > pool has not been deallocated yet. > > The chunk of code in my first email is just to demonstrate the issue > > and I am completely fine to get a better solution :) > > Your problem is perhaps made worse by threaded NAPI, you have > defer-free skbs sprayed across all cores and no NAPI there to > flush them :( yes, exactly :) > > > I guess we just need a way to free the pool in a reasonable amount > > of time. Agree? > > Whether we need to guarantee the release is the real question. yes, this is the main goal of my email. The defer-free skbs behaviour seems in contrast with the page_pool pending pages monitor mechanism or at least they do not work well together. @Jesper, Ilias: any input on it? > Maybe it's more of a false-positive warning. > > Flushing the defer list is probably fine as a hack, but it's not > a full fix as Eric explained. False positive can still happen. agree, it was just a way to give an idea of the issue, not a proper solution. Regards, Lorenzo > > I'm ambivalent. My only real request wold be to make the flushing > a helper in net/core/dev.c rather than open coded in page_pool.c. > > Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()?
On 18/04/2023 09.36, Lorenzo Bianconi wrote: >> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote: >>>> If it's that then I'm with Eric. There are many ways to keep the pages >>>> in use, no point working around one of them and not the rest :( >>> >>> I was not clear here, my fault. What I mean is I can see the returned >>> pages counter increasing from time to time, but during most of tests, >>> even after 2h the tcp traffic has stopped, page_pool_release_retry() >>> still complains not all the pages are returned to the pool and so the >>> pool has not been deallocated yet. >>> The chunk of code in my first email is just to demonstrate the issue >>> and I am completely fine to get a better solution :) >> >> Your problem is perhaps made worse by threaded NAPI, you have >> defer-free skbs sprayed across all cores and no NAPI there to >> flush them :( > > yes, exactly :) > >> >>> I guess we just need a way to free the pool in a reasonable amount >>> of time. Agree? >> >> Whether we need to guarantee the release is the real question. > > yes, this is the main goal of my email. The defer-free skbs behaviour seems in > contrast with the page_pool pending pages monitor mechanism or at least they > do not work well together. > > @Jesper, Ilias: any input on it? > >> Maybe it's more of a false-positive warning. >> >> Flushing the defer list is probably fine as a hack, but it's not >> a full fix as Eric explained. False positive can still happen. > > agree, it was just a way to give an idea of the issue, not a proper solution. > > Regards, > Lorenzo > >> >> I'm ambivalent. My only real request wold be to make the flushing >> a helper in net/core/dev.c rather than open coded in page_pool.c. I agree. We need a central defer_list flushing helper It is too easy to say this is a false-positive warning. IHMO this expose an issue with the sd->defer_list system. Lorenzo's test is adding+removing veth devices, which creates and runs NAPI processing on random CPUs. After veth netdevices (+NAPI) are removed, nothing will naturally invoking net_rx_softirq on this CPU. Thus, we have SKBs waiting on CPUs sd->defer_list. Further more we will not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI call (trigger_rx_softirq), even if this CPU process and frees SKBs. I see two solutions: (1) When netdevice/NAPI unregister happens call defer_list flushing helper. (2) Use napi_watchdog to detect if defer_list is (many jiffies) old, and then call defer_list flushing helper. >> >> Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()? Looks to me like dev_cpu_dead() also need this flushing helper for sd->defer_list, or at least moving the sd->defer_list to an sd that will run eventually. --Jesper
On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer <jbrouer@redhat.com> wrote: > > > On 18/04/2023 09.36, Lorenzo Bianconi wrote: > >> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote: > >>>> If it's that then I'm with Eric. There are many ways to keep the pages > >>>> in use, no point working around one of them and not the rest :( > >>> > >>> I was not clear here, my fault. What I mean is I can see the returned > >>> pages counter increasing from time to time, but during most of tests, > >>> even after 2h the tcp traffic has stopped, page_pool_release_retry() > >>> still complains not all the pages are returned to the pool and so the > >>> pool has not been deallocated yet. > >>> The chunk of code in my first email is just to demonstrate the issue > >>> and I am completely fine to get a better solution :) > >> > >> Your problem is perhaps made worse by threaded NAPI, you have > >> defer-free skbs sprayed across all cores and no NAPI there to > >> flush them :( > > > > yes, exactly :) > > > >> > >>> I guess we just need a way to free the pool in a reasonable amount > >>> of time. Agree? > >> > >> Whether we need to guarantee the release is the real question. > > > > yes, this is the main goal of my email. The defer-free skbs behaviour seems in > > contrast with the page_pool pending pages monitor mechanism or at least they > > do not work well together. > > > > @Jesper, Ilias: any input on it? > > > >> Maybe it's more of a false-positive warning. > >> > >> Flushing the defer list is probably fine as a hack, but it's not > >> a full fix as Eric explained. False positive can still happen. > > > > agree, it was just a way to give an idea of the issue, not a proper solution. > > > > Regards, > > Lorenzo > > > >> > >> I'm ambivalent. My only real request wold be to make the flushing > >> a helper in net/core/dev.c rather than open coded in page_pool.c. > > I agree. We need a central defer_list flushing helper > > It is too easy to say this is a false-positive warning. > IHMO this expose an issue with the sd->defer_list system. > > Lorenzo's test is adding+removing veth devices, which creates and runs > NAPI processing on random CPUs. After veth netdevices (+NAPI) are > removed, nothing will naturally invoking net_rx_softirq on this CPU. > Thus, we have SKBs waiting on CPUs sd->defer_list. Further more we will > not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI > call (trigger_rx_softirq), even if this CPU process and frees SKBs. > > I see two solutions: > > (1) When netdevice/NAPI unregister happens call defer_list flushing > helper. > > (2) Use napi_watchdog to detect if defer_list is (many jiffies) old, > and then call defer_list flushing helper. > > > >> > >> Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()? > > Looks to me like dev_cpu_dead() also need this flushing helper for > sd->defer_list, or at least moving the sd->defer_list to an sd that will > run eventually. I think I just considered having a few skbs in per-cpu list would not be an issue, especially considering skbs can sit hours in tcp receive queues. Do we expect hacing some kind of callback/shrinker to instruct TCP or pipes to release all pages that prevent a page_pool to be freed ? Here, we are talking of hundreds of thousands of skbs, compared to at most 32 skbs per cpu. Perhaps sets sysctl_skb_defer_max to zero by default, so that admins can opt-in
On 19/04/2023 14.09, Eric Dumazet wrote: > On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer >> >> >> On 18/04/2023 09.36, Lorenzo Bianconi wrote: >>>> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote: >>>>>> If it's that then I'm with Eric. There are many ways to keep the pages >>>>>> in use, no point working around one of them and not the rest :( >>>>> >>>>> I was not clear here, my fault. What I mean is I can see the returned >>>>> pages counter increasing from time to time, but during most of tests, >>>>> even after 2h the tcp traffic has stopped, page_pool_release_retry() >>>>> still complains not all the pages are returned to the pool and so the >>>>> pool has not been deallocated yet. >>>>> The chunk of code in my first email is just to demonstrate the issue >>>>> and I am completely fine to get a better solution :) >>>> >>>> Your problem is perhaps made worse by threaded NAPI, you have >>>> defer-free skbs sprayed across all cores and no NAPI there to >>>> flush them :( >>> >>> yes, exactly :) >>> >>>> >>>>> I guess we just need a way to free the pool in a reasonable amount >>>>> of time. Agree? >>>> >>>> Whether we need to guarantee the release is the real question. >>> >>> yes, this is the main goal of my email. The defer-free skbs behaviour seems in >>> contrast with the page_pool pending pages monitor mechanism or at least they >>> do not work well together. >>> >>> @Jesper, Ilias: any input on it? >>> >>>> Maybe it's more of a false-positive warning. >>>> >>>> Flushing the defer list is probably fine as a hack, but it's not >>>> a full fix as Eric explained. False positive can still happen. >>> >>> agree, it was just a way to give an idea of the issue, not a proper solution. >>> >>> Regards, >>> Lorenzo >>> >>>> >>>> I'm ambivalent. My only real request wold be to make the flushing >>>> a helper in net/core/dev.c rather than open coded in page_pool.c. >> >> I agree. We need a central defer_list flushing helper >> >> It is too easy to say this is a false-positive warning. >> IHMO this expose an issue with the sd->defer_list system. >> >> Lorenzo's test is adding+removing veth devices, which creates and runs >> NAPI processing on random CPUs. After veth netdevices (+NAPI) are >> removed, nothing will naturally invoking net_rx_softirq on this CPU. >> Thus, we have SKBs waiting on CPUs sd->defer_list. Further more we will >> not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI >> call (trigger_rx_softirq), even if this CPU process and frees SKBs. >> >> I see two solutions: >> >> (1) When netdevice/NAPI unregister happens call defer_list flushing >> helper. >> >> (2) Use napi_watchdog to detect if defer_list is (many jiffies) old, >> and then call defer_list flushing helper. >> >> >>>> >>>> Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()? >> >> Looks to me like dev_cpu_dead() also need this flushing helper for >> sd->defer_list, or at least moving the sd->defer_list to an sd that will >> run eventually. > > I think I just considered having a few skbs in per-cpu list would not > be an issue, > especially considering skbs can sit hours in tcp receive queues. > It was the first thing I said to Lorenzo when he first reported the problem to me (over chat): It is likely packets sitting in a TCP queue. Then I instructed him to look at output from netstat to see queues and look for TIME-WAIT, FIN-WAIT etc. > Do we expect hacing some kind of callback/shrinker to instruct TCP or > pipes to release all pages that prevent > a page_pool to be freed ? > This is *not* what I'm asking for. With TCP sockets (pipes etc) we can take care of closing the sockets (and programs etc) to free up the SKBs (and perhaps wait for timeouts) to make sure the page_pool shutdown doesn't hang. The problem arise for all the selftests that uses veth and bpf_test_run (using bpf_test_run_xdp_live / xdp_test_run_setup). For the selftests we obviously take care of closing sockets and removing veth interfaces again. Problem: The defer_list corner-case isn't under our control. > Here, we are talking of hundreds of thousands of skbs, compared to at > most 32 skbs per cpu. > It is not a memory usage concern. > Perhaps sets sysctl_skb_defer_max to zero by default, so that admins > can opt-in > I really like the sd->defer_list system and I think is should be enabled by default. Even if disabled by default, we still need to handle these corner cases, as the selftests shouldn't start to cause-issues when this gets enabled. The simple solution is: (1) When netdevice/NAPI unregister happens call defer_list flushing helper. And perhaps we also need to call it in xdp_test_run_teardown(). How do you feel about that? --Jesper
On Wed, Apr 19, 2023 at 4:02 PM Jesper Dangaard Brouer <jbrouer@redhat.com> wrote: > With TCP sockets (pipes etc) we can take care of closing the sockets > (and programs etc) to free up the SKBs (and perhaps wait for timeouts) > to make sure the page_pool shutdown doesn't hang. This can not happen in many cases, like pages being now mapped to user space programs, or nfsd or whatever. I think that fundamentally, page pool should handle this case gracefully. For instance, when a TCP socket is closed(), user space can die, but many resources in the kernel are freed later. We do not block a close() just because a qdisc decided to hold a buffer for few minutes.
> > On 19/04/2023 14.09, Eric Dumazet wrote: > > On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer > > > > > > > > > On 18/04/2023 09.36, Lorenzo Bianconi wrote: > > > > > On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote: > > > > > > > If it's that then I'm with Eric. There are many ways to keep the pages > > > > > > > in use, no point working around one of them and not the rest :( > > > > > > > > > > > > I was not clear here, my fault. What I mean is I can see the returned > > > > > > pages counter increasing from time to time, but during most of tests, > > > > > > even after 2h the tcp traffic has stopped, page_pool_release_retry() > > > > > > still complains not all the pages are returned to the pool and so the > > > > > > pool has not been deallocated yet. > > > > > > The chunk of code in my first email is just to demonstrate the issue > > > > > > and I am completely fine to get a better solution :) > > > > > > > > > > Your problem is perhaps made worse by threaded NAPI, you have > > > > > defer-free skbs sprayed across all cores and no NAPI there to > > > > > flush them :( > > > > > > > > yes, exactly :) > > > > > > > > > > > > > > > I guess we just need a way to free the pool in a reasonable amount > > > > > > of time. Agree? > > > > > > > > > > Whether we need to guarantee the release is the real question. > > > > > > > > yes, this is the main goal of my email. The defer-free skbs behaviour seems in > > > > contrast with the page_pool pending pages monitor mechanism or at least they > > > > do not work well together. > > > > > > > > @Jesper, Ilias: any input on it? > > > > > > > > > Maybe it's more of a false-positive warning. > > > > > > > > > > Flushing the defer list is probably fine as a hack, but it's not > > > > > a full fix as Eric explained. False positive can still happen. > > > > > > > > agree, it was just a way to give an idea of the issue, not a proper solution. > > > > > > > > Regards, > > > > Lorenzo > > > > > > > > > > > > > > I'm ambivalent. My only real request wold be to make the flushing > > > > > a helper in net/core/dev.c rather than open coded in page_pool.c. > > > > > > I agree. We need a central defer_list flushing helper > > > > > > It is too easy to say this is a false-positive warning. > > > IHMO this expose an issue with the sd->defer_list system. > > > > > > Lorenzo's test is adding+removing veth devices, which creates and runs > > > NAPI processing on random CPUs. After veth netdevices (+NAPI) are > > > removed, nothing will naturally invoking net_rx_softirq on this CPU. > > > Thus, we have SKBs waiting on CPUs sd->defer_list. Further more we will > > > not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI > > > call (trigger_rx_softirq), even if this CPU process and frees SKBs. > > > > > > I see two solutions: > > > > > > (1) When netdevice/NAPI unregister happens call defer_list flushing > > > helper. > > > > > > (2) Use napi_watchdog to detect if defer_list is (many jiffies) old, > > > and then call defer_list flushing helper. > > > > > > > > > > > > > > > > Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()? > > > > > > Looks to me like dev_cpu_dead() also need this flushing helper for > > > sd->defer_list, or at least moving the sd->defer_list to an sd that will > > > run eventually. > > > > I think I just considered having a few skbs in per-cpu list would not > > be an issue, > > especially considering skbs can sit hours in tcp receive queues. > > > > It was the first thing I said to Lorenzo when he first reported the > problem to me (over chat): It is likely packets sitting in a TCP queue. > Then I instructed him to look at output from netstat to see queues and > look for TIME-WAIT, FIN-WAIT etc. > > > > Do we expect hacing some kind of callback/shrinker to instruct TCP or > > pipes to release all pages that prevent > > a page_pool to be freed ? > > > > This is *not* what I'm asking for. > > With TCP sockets (pipes etc) we can take care of closing the sockets > (and programs etc) to free up the SKBs (and perhaps wait for timeouts) > to make sure the page_pool shutdown doesn't hang. > > The problem arise for all the selftests that uses veth and bpf_test_run > (using bpf_test_run_xdp_live / xdp_test_run_setup). For the selftests > we obviously take care of closing sockets and removing veth interfaces > again. Problem: The defer_list corner-case isn't under our control. > > > > Here, we are talking of hundreds of thousands of skbs, compared to at > > most 32 skbs per cpu. > > > > It is not a memory usage concern. > > > Perhaps sets sysctl_skb_defer_max to zero by default, so that admins > > can opt-in > > > > I really like the sd->defer_list system and I think is should be enabled > by default. Even if disabled by default, we still need to handle these > corner cases, as the selftests shouldn't start to cause-issues when this > gets enabled. > > The simple solution is: (1) When netdevice/NAPI unregister happens call > defer_list flushing helper. And perhaps we also need to call it in > xdp_test_run_teardown(). How do you feel about that? > > --Jesper > Today I was discussing with Toke about this issue, and we were wondering, if we just consider the page_pool use-case, what about moving the real pool destroying steps when we return a page to the pool in page_pool_put_full_page() if the pool has marked to be destroyed and there are no inflight pages instead of assuming we have all the pages in the pool when we run page_pool_destroy()? Maybe this means just get rid of the warn in page_pool_release_retry() :) Regards, Lorenzo
On 19/04/2023 16.21, Lorenzo Bianconi wrote: >> >> On 19/04/2023 14.09, Eric Dumazet wrote: >>> On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer >>>> >>>> >>>> On 18/04/2023 09.36, Lorenzo Bianconi wrote: >>>>>> On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote: >>>>>>>> If it's that then I'm with Eric. There are many ways to keep the pages >>>>>>>> in use, no point working around one of them and not the rest :( >>>>>>> >>>>>>> I was not clear here, my fault. What I mean is I can see the returned >>>>>>> pages counter increasing from time to time, but during most of tests, >>>>>>> even after 2h the tcp traffic has stopped, page_pool_release_retry() >>>>>>> still complains not all the pages are returned to the pool and so the >>>>>>> pool has not been deallocated yet. >>>>>>> The chunk of code in my first email is just to demonstrate the issue >>>>>>> and I am completely fine to get a better solution :) >>>>>> >>>>>> Your problem is perhaps made worse by threaded NAPI, you have >>>>>> defer-free skbs sprayed across all cores and no NAPI there to >>>>>> flush them :( >>>>> >>>>> yes, exactly :) >>>>> >>>>>> >>>>>>> I guess we just need a way to free the pool in a reasonable amount >>>>>>> of time. Agree? >>>>>> >>>>>> Whether we need to guarantee the release is the real question. >>>>> >>>>> yes, this is the main goal of my email. The defer-free skbs behaviour seems in >>>>> contrast with the page_pool pending pages monitor mechanism or at least they >>>>> do not work well together. >>>>> >>>>> @Jesper, Ilias: any input on it? >>>>> >>>>>> Maybe it's more of a false-positive warning. >>>>>> >>>>>> Flushing the defer list is probably fine as a hack, but it's not >>>>>> a full fix as Eric explained. False positive can still happen. >>>>> >>>>> agree, it was just a way to give an idea of the issue, not a proper solution. >>>>> >>>>> Regards, >>>>> Lorenzo >>>>> >>>>>> >>>>>> I'm ambivalent. My only real request wold be to make the flushing >>>>>> a helper in net/core/dev.c rather than open coded in page_pool.c. >>>> >>>> I agree. We need a central defer_list flushing helper >>>> >>>> It is too easy to say this is a false-positive warning. >>>> IHMO this expose an issue with the sd->defer_list system. >>>> >>>> Lorenzo's test is adding+removing veth devices, which creates and runs >>>> NAPI processing on random CPUs. After veth netdevices (+NAPI) are >>>> removed, nothing will naturally invoking net_rx_softirq on this CPU. >>>> Thus, we have SKBs waiting on CPUs sd->defer_list. Further more we will >>>> not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI >>>> call (trigger_rx_softirq), even if this CPU process and frees SKBs. >>>> >>>> I see two solutions: >>>> >>>> (1) When netdevice/NAPI unregister happens call defer_list flushing >>>> helper. >>>> >>>> (2) Use napi_watchdog to detect if defer_list is (many jiffies) old, >>>> and then call defer_list flushing helper. >>>> >>>> >>>>>> >>>>>> Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()? >>>> >>>> Looks to me like dev_cpu_dead() also need this flushing helper for >>>> sd->defer_list, or at least moving the sd->defer_list to an sd that will >>>> run eventually. >>> >>> I think I just considered having a few skbs in per-cpu list would not >>> be an issue, >>> especially considering skbs can sit hours in tcp receive queues. >>> >> >> It was the first thing I said to Lorenzo when he first reported the >> problem to me (over chat): It is likely packets sitting in a TCP queue. >> Then I instructed him to look at output from netstat to see queues and >> look for TIME-WAIT, FIN-WAIT etc. >> >> >>> Do we expect hacing some kind of callback/shrinker to instruct TCP or >>> pipes to release all pages that prevent >>> a page_pool to be freed ? >>> >> >> This is *not* what I'm asking for. >> >> With TCP sockets (pipes etc) we can take care of closing the sockets >> (and programs etc) to free up the SKBs (and perhaps wait for timeouts) >> to make sure the page_pool shutdown doesn't hang. >> >> The problem arise for all the selftests that uses veth and bpf_test_run >> (using bpf_test_run_xdp_live / xdp_test_run_setup). For the selftests >> we obviously take care of closing sockets and removing veth interfaces >> again. Problem: The defer_list corner-case isn't under our control. >> >> >>> Here, we are talking of hundreds of thousands of skbs, compared to at >>> most 32 skbs per cpu. >>> >> >> It is not a memory usage concern. >> >>> Perhaps sets sysctl_skb_defer_max to zero by default, so that admins >>> can opt-in >>> >> >> I really like the sd->defer_list system and I think is should be enabled >> by default. Even if disabled by default, we still need to handle these >> corner cases, as the selftests shouldn't start to cause-issues when this >> gets enabled. >> >> The simple solution is: (1) When netdevice/NAPI unregister happens call >> defer_list flushing helper. And perhaps we also need to call it in >> xdp_test_run_teardown(). How do you feel about that? >> >> --Jesper >> > > Today I was discussing with Toke about this issue, and we were wondering, > if we just consider the page_pool use-case, what about moving the real pool > destroying steps when we return a page to the pool in page_pool_put_full_page() > if the pool has marked to be destroyed and there are no inflight pages instead > of assuming we have all the pages in the pool when we run page_pool_destroy()? It sounds like you want to add a runtime check to the fast-path to handle these corner cases? For performance reason we should not call page_pool_inflight() check in fast-path, please! Details: You hopefully mean running/calling page_pool_release(pool) and not page_pool_destroy(). I'm not totally against the idea, as long as someone is willing to do extensive benchmarking that it doesn't affect fast-path performance. Given we already read pool->p.flags in fast-path, it might be possible to hide the extra branch (in the CPU pipeline). > Maybe this means just get rid of the warn in page_pool_release_retry() :) > Sure, we can remove the print statement, but it feels like closing our eyes and ignoring the problem. We can remove the print statement, and still debug the problem, as I have added tracepoints (to debug this). But users will not report these issue early... on the other hand most of these reports will likely be false-positives. This reminds me that Jakub's recent defer patches returning pages 'directly' to the page_pool alloc-cache, will actually result in this kind of bug. This is because page_pool_destroy() assumes that pages cannot be returned to alloc-cache, as driver will have "disconnected" RX side. We need to address this bug separately. Lorenzo you didn't happen to use a kernel with Jakub's patches included, do you? --Jesper
On 19/04/2023 16.18, Eric Dumazet wrote: > On Wed, Apr 19, 2023 at 4:02 PM Jesper Dangaard Brouer > <jbrouer@redhat.com> wrote: > >> With TCP sockets (pipes etc) we can take care of closing the sockets >> (and programs etc) to free up the SKBs (and perhaps wait for timeouts) >> to make sure the page_pool shutdown doesn't hang. > > This can not happen in many cases, like pages being now mapped to user > space programs, > or nfsd or whatever. > > I think that fundamentally, page pool should handle this case gracefully. > > For instance, when a TCP socket is closed(), user space can die, but > many resources in the kernel are freed later. > > We do not block a close() just because a qdisc decided to hold a > buffer for few minutes. > But page pool does handle this gracefully via scheduling a workqueue. --Jesper
> > > On 19/04/2023 16.21, Lorenzo Bianconi wrote: > > > > > > On 19/04/2023 14.09, Eric Dumazet wrote: > > > > On Wed, Apr 19, 2023 at 1:08 PM Jesper Dangaard Brouer > > > > > > > > > > > > > > > On 18/04/2023 09.36, Lorenzo Bianconi wrote: > > > > > > > On Mon, 17 Apr 2023 23:31:01 +0200 Lorenzo Bianconi wrote: > > > > > > > > > If it's that then I'm with Eric. There are many ways to keep the pages > > > > > > > > > in use, no point working around one of them and not the rest :( > > > > > > > > > > > > > > > > I was not clear here, my fault. What I mean is I can see the returned > > > > > > > > pages counter increasing from time to time, but during most of tests, > > > > > > > > even after 2h the tcp traffic has stopped, page_pool_release_retry() > > > > > > > > still complains not all the pages are returned to the pool and so the > > > > > > > > pool has not been deallocated yet. > > > > > > > > The chunk of code in my first email is just to demonstrate the issue > > > > > > > > and I am completely fine to get a better solution :) > > > > > > > > > > > > > > Your problem is perhaps made worse by threaded NAPI, you have > > > > > > > defer-free skbs sprayed across all cores and no NAPI there to > > > > > > > flush them :( > > > > > > > > > > > > yes, exactly :) > > > > > > > > > > > > > > > > > > > > > I guess we just need a way to free the pool in a reasonable amount > > > > > > > > of time. Agree? > > > > > > > > > > > > > > Whether we need to guarantee the release is the real question. > > > > > > > > > > > > yes, this is the main goal of my email. The defer-free skbs behaviour seems in > > > > > > contrast with the page_pool pending pages monitor mechanism or at least they > > > > > > do not work well together. > > > > > > > > > > > > @Jesper, Ilias: any input on it? > > > > > > > > > > > > > Maybe it's more of a false-positive warning. > > > > > > > > > > > > > > Flushing the defer list is probably fine as a hack, but it's not > > > > > > > a full fix as Eric explained. False positive can still happen. > > > > > > > > > > > > agree, it was just a way to give an idea of the issue, not a proper solution. > > > > > > > > > > > > Regards, > > > > > > Lorenzo > > > > > > > > > > > > > > > > > > > > I'm ambivalent. My only real request wold be to make the flushing > > > > > > > a helper in net/core/dev.c rather than open coded in page_pool.c. > > > > > > > > > > I agree. We need a central defer_list flushing helper > > > > > > > > > > It is too easy to say this is a false-positive warning. > > > > > IHMO this expose an issue with the sd->defer_list system. > > > > > > > > > > Lorenzo's test is adding+removing veth devices, which creates and runs > > > > > NAPI processing on random CPUs. After veth netdevices (+NAPI) are > > > > > removed, nothing will naturally invoking net_rx_softirq on this CPU. > > > > > Thus, we have SKBs waiting on CPUs sd->defer_list. Further more we will > > > > > not create new SKB with this skb->alloc_cpu, to trigger RX softirq IPI > > > > > call (trigger_rx_softirq), even if this CPU process and frees SKBs. > > > > > > > > > > I see two solutions: > > > > > > > > > > (1) When netdevice/NAPI unregister happens call defer_list flushing > > > > > helper. > > > > > > > > > > (2) Use napi_watchdog to detect if defer_list is (many jiffies) old, > > > > > and then call defer_list flushing helper. > > > > > > > > > > > > > > > > > > > > > > > > Somewhat related - Eric, do we need to handle defer_list in dev_cpu_dead()? > > > > > > > > > > Looks to me like dev_cpu_dead() also need this flushing helper for > > > > > sd->defer_list, or at least moving the sd->defer_list to an sd that will > > > > > run eventually. > > > > > > > > I think I just considered having a few skbs in per-cpu list would not > > > > be an issue, > > > > especially considering skbs can sit hours in tcp receive queues. > > > > > > > > > > It was the first thing I said to Lorenzo when he first reported the > > > problem to me (over chat): It is likely packets sitting in a TCP queue. > > > Then I instructed him to look at output from netstat to see queues and > > > look for TIME-WAIT, FIN-WAIT etc. > > > > > > > > > > Do we expect hacing some kind of callback/shrinker to instruct TCP or > > > > pipes to release all pages that prevent > > > > a page_pool to be freed ? > > > > > > > > > > This is *not* what I'm asking for. > > > > > > With TCP sockets (pipes etc) we can take care of closing the sockets > > > (and programs etc) to free up the SKBs (and perhaps wait for timeouts) > > > to make sure the page_pool shutdown doesn't hang. > > > > > > The problem arise for all the selftests that uses veth and bpf_test_run > > > (using bpf_test_run_xdp_live / xdp_test_run_setup). For the selftests > > > we obviously take care of closing sockets and removing veth interfaces > > > again. Problem: The defer_list corner-case isn't under our control. > > > > > > > > > > Here, we are talking of hundreds of thousands of skbs, compared to at > > > > most 32 skbs per cpu. > > > > > > > > > > It is not a memory usage concern. > > > > > > > Perhaps sets sysctl_skb_defer_max to zero by default, so that admins > > > > can opt-in > > > > > > > > > > I really like the sd->defer_list system and I think is should be enabled > > > by default. Even if disabled by default, we still need to handle these > > > corner cases, as the selftests shouldn't start to cause-issues when this > > > gets enabled. > > > > > > The simple solution is: (1) When netdevice/NAPI unregister happens call > > > defer_list flushing helper. And perhaps we also need to call it in > > > xdp_test_run_teardown(). How do you feel about that? > > > > > > --Jesper > > > > > > > Today I was discussing with Toke about this issue, and we were wondering, > > if we just consider the page_pool use-case, what about moving the real pool > > destroying steps when we return a page to the pool in page_pool_put_full_page() > > if the pool has marked to be destroyed and there are no inflight pages instead > > of assuming we have all the pages in the pool when we run page_pool_destroy()? > > It sounds like you want to add a runtime check to the fast-path to > handle these corner cases? > > For performance reason we should not call page_pool_inflight() check in > fast-path, please! ack, right. > > Details: You hopefully mean running/calling page_pool_release(pool) and not > page_pool_destroy(). yes, I mean page_pool_release() > > I'm not totally against the idea, as long as someone is willing to do > extensive benchmarking that it doesn't affect fast-path performance. > Given we already read pool->p.flags in fast-path, it might be possible > to hide the extra branch (in the CPU pipeline). > > > > Maybe this means just get rid of the warn in page_pool_release_retry() :) > > > > Sure, we can remove the print statement, but it feels like closing our > eyes and ignoring the problem. We can remove the print statement, and > still debug the problem, as I have added tracepoints (to debug this). > But users will not report these issue early... on the other hand most of > these reports will likely be false-positives. > > This reminds me that Jakub's recent defer patches returning pages > 'directly' to the page_pool alloc-cache, will actually result in this > kind of bug. This is because page_pool_destroy() assumes that pages > cannot be returned to alloc-cache, as driver will have "disconnected" RX > side. We need to address this bug separately. Lorenzo you didn't > happen to use a kernel with Jakub's patches included, do you? nope, I did not tested them. Regards, Lorenzo > > --Jesper > > >
On Wed, 19 Apr 2023 17:36:20 +0200 Jesper Dangaard Brouer wrote: > This reminds me that Jakub's recent defer patches returning pages > 'directly' to the page_pool alloc-cache, will actually result in this > kind of bug. This is because page_pool_destroy() assumes that pages > cannot be returned to alloc-cache, as driver will have "disconnected" RX > side. We need to address this bug separately. Mm, I think you're right. The NAPI can get restarted with a different page pool and the old page pool will still think it's in use.
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 193c18799865..160f45c4e3a5 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -19,6 +19,7 @@ #include <linux/mm.h> /* for put_page() */ #include <linux/poison.h> #include <linux/ethtool.h> +#include <linux/netdevice.h> #include <trace/events/page_pool.h> @@ -810,12 +811,23 @@ static void page_pool_release_retry(struct work_struct *wq) { struct delayed_work *dwq = to_delayed_work(wq); struct page_pool *pool = container_of(dwq, typeof(*pool), release_dw); - int inflight; + int cpu, inflight; inflight = page_pool_release(pool); if (!inflight) return; + /* Run NET_RX_SOFTIRQ in order to free pending skbs in softnet_data + * defer_list that can stay in the list until we have enough queued + * traffic. + */ + for_each_online_cpu(cpu) { + struct softnet_data *sd = &per_cpu(softnet_data, cpu); + + if (!cmpxchg(&sd->defer_ipi_scheduled, 0, 1)) + smp_call_function_single_async(cpu, &sd->defer_csd); + } + /* Periodic warning */ if (time_after_eq(jiffies, pool->defer_warn)) { int sec = (s32)((u32)jiffies - (u32)pool->defer_start) / HZ;