diff mbox series

[net,v2,2/2] page_pool: fix IOMMU crash when driver has already unbound

Message ID 20240925075707.3970187-3-linyunsheng@huawei.com (mailing list archive)
State New
Headers show
Series fix two bugs related to page_pool | expand

Commit Message

Yunsheng Lin Sept. 25, 2024, 7:57 a.m. UTC
Networking driver with page_pool support may hand over page
still with dma mapping to network stack and try to reuse that
page after network stack is done with it and passes it back
to page_pool to avoid the penalty of dma mapping/unmapping.
With all the caching in the network stack, some pages may be
held in the network stack without returning to the page_pool
soon enough, and with VF disable causing the driver unbound,
the page_pool does not stop the driver from doing it's
unbounding work, instead page_pool uses workqueue to check
if there is some pages coming back from the network stack
periodically, if there is any, it will do the dma unmmapping
related cleanup work.

As mentioned in [1], attempting DMA unmaps after the driver
has already unbound may leak resources or at worst corrupt
memory. Fundamentally, the page pool code cannot allow DMA
mappings to outlive the driver they belong to.

Currently it seems there are at least two cases that the page
is not released fast enough causing dma unmmapping done after
driver has already unbound:
1. ipv4 packet defragmentation timeout: this seems to cause
   delay up to 30 secs.
2. skb_defer_free_flush(): this may cause infinite delay if
   there is no triggering for net_rx_action().

In order not to do the dma unmmapping after driver has already
unbound and stall the unloading of the networking driver, add
the pool->items array to record all the pages including the ones
which are handed over to network stack, so the page_pool can
do the dma unmmapping for those pages when page_pool_destroy()
is called. As the pool->items need to be large enough to avoid
performance degradation, add a 'item_full' stat to indicate the
allocation failure due to unavailability of pool->items.

Note, the devmem patchset seems to make the bug harder to fix,
and may make backporting harder too. As there is no actual user
for the devmem and the fixing for devmem is unclear for now,
this patch does not consider fixing the case for devmem yet.

1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/

Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: IOMMU <iommu@lists.linux.dev>
---
 drivers/net/ethernet/freescale/fec_main.c     |   8 +-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   |   6 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.c   |  14 +-
 drivers/net/ethernet/intel/libeth/rx.c        |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |   3 +-
 drivers/net/netdevsim/netdev.c                |   6 +-
 drivers/net/wireless/mediatek/mt76/mt76.h     |   2 +-
 include/linux/mm_types.h                      |   2 +-
 include/linux/skbuff.h                        |   1 +
 include/net/libeth/rx.h                       |   3 +-
 include/net/netmem.h                          |  10 +-
 include/net/page_pool/helpers.h               |   7 +
 include/net/page_pool/types.h                 |  17 +-
 net/core/devmem.c                             |   4 +-
 net/core/netmem_priv.h                        |   5 +-
 net/core/page_pool.c                          | 163 +++++++++++++++---
 net/core/page_pool_priv.h                     |  10 +-
 net/core/skbuff.c                             |   3 +-
 net/core/xdp.c                                |   3 +-
 19 files changed, 212 insertions(+), 57 deletions(-)

Comments

Mina Almasry Sept. 26, 2024, 6:15 p.m. UTC | #1
On Wed, Sep 25, 2024 at 1:03 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> Networking driver with page_pool support may hand over page
> still with dma mapping to network stack and try to reuse that
> page after network stack is done with it and passes it back
> to page_pool to avoid the penalty of dma mapping/unmapping.
> With all the caching in the network stack, some pages may be
> held in the network stack without returning to the page_pool
> soon enough, and with VF disable causing the driver unbound,
> the page_pool does not stop the driver from doing it's
> unbounding work, instead page_pool uses workqueue to check
> if there is some pages coming back from the network stack
> periodically, if there is any, it will do the dma unmmapping
> related cleanup work.
>
> As mentioned in [1], attempting DMA unmaps after the driver
> has already unbound may leak resources or at worst corrupt
> memory. Fundamentally, the page pool code cannot allow DMA
> mappings to outlive the driver they belong to.
>
> Currently it seems there are at least two cases that the page
> is not released fast enough causing dma unmmapping done after
> driver has already unbound:
> 1. ipv4 packet defragmentation timeout: this seems to cause
>    delay up to 30 secs.
> 2. skb_defer_free_flush(): this may cause infinite delay if
>    there is no triggering for net_rx_action().
>

I think additionally this is dependent on user behavior, right? AFAIU,
frags allocated by the page_pool will remain in the socket receive
queue until the user calls recvmsg(), and AFAIU they are stuck there
arbitrarily long.

> In order not to do the dma unmmapping after driver has already
> unbound and stall the unloading of the networking driver, add
> the pool->items array to record all the pages including the ones
> which are handed over to network stack, so the page_pool can
> do the dma unmmapping for those pages when page_pool_destroy()
> is called.

One thing I could not understand from looking at the code: if the
items array is in the struct page_pool, why do you need to modify the
page_pool entry in the struct page and in the struct net_iov? I think
the code could be made much simpler if you can remove these changes,
and you wouldn't need to modify the public api of the page_pool.

> As the pool->items need to be large enough to avoid
> performance degradation, add a 'item_full' stat to indicate the
> allocation failure due to unavailability of pool->items.
>

I'm not sure there is any way to size the pool->items array correctly.
Can you use a data structure here that can grow? Linked list or
xarray?

AFAIU what we want is when the page pool allocates a netmem it will
add the netmem to the items array, and when the pp releases a netmem
it will remove it from the array. Both of these operations are slow
paths, right? So the performance of a data structure more complicated
than an array may be ok. bench_page_pool_simple will tell for sure.

> Note, the devmem patchset seems to make the bug harder to fix,
> and may make backporting harder too. As there is no actual user
> for the devmem and the fixing for devmem is unclear for now,
> this patch does not consider fixing the case for devmem yet.
>

net_iovs don't hit this bug, dma_unmap_page_attrs() is never called on
them, so no special handling is needed really. However for code
quality reasons lets try to minimize the number of devmem or memory
provider checks in the code, if possible.
Yunsheng Lin Sept. 27, 2024, 3:57 a.m. UTC | #2
On 2024/9/27 2:15, Mina Almasry wrote:
> 
>> In order not to do the dma unmmapping after driver has already
>> unbound and stall the unloading of the networking driver, add
>> the pool->items array to record all the pages including the ones
>> which are handed over to network stack, so the page_pool can
>> do the dma unmmapping for those pages when page_pool_destroy()
>> is called.
> 
> One thing I could not understand from looking at the code: if the
> items array is in the struct page_pool, why do you need to modify the
> page_pool entry in the struct page and in the struct net_iov? I think
> the code could be made much simpler if you can remove these changes,
> and you wouldn't need to modify the public api of the page_pool.

As mentioned in [1]:
"There is no space in 'struct page' to track the inflight pages, so
'pp' in 'struct page' is renamed to 'pp_item' to enable the tracking
of inflight page"

As we still need pp for "struct page_pool" for page_pool_put_page()
related API, the container_of() trick is used to get the pp from the
pp_item.

As you had changed 'struct net_iov' to be mirroring the 'struct page',
so change 'struct net_iov' part accordingly.

1. https://lore.kernel.org/all/50a463d5-a5a1-422f-a4f7-d3587b12c265@huawei.com/

> 
>> As the pool->items need to be large enough to avoid
>> performance degradation, add a 'item_full' stat to indicate the
>> allocation failure due to unavailability of pool->items.
>>
> 
> I'm not sure there is any way to size the pool->items array correctly.

Currently the size of pool->items is calculated in page_pool_create_percpu()
as below, to make sure the size of pool->items is somewhat twice of the
size of pool->ring so that the number of page sitting in the driver's rx
ring waiting for the new packet is the similar to the number of page that is
still being handled in the network stack as most drivers seems to set the
pool->pool_size according to their rx ring size:

+#define PAGE_POOL_MIN_INFLIGHT_ITEMS		512
+	unsigned int item_cnt = (params->pool_size ? : 1024) +
+				PP_ALLOC_CACHE_SIZE + PAGE_POOL_MIN_INFLIGHT_ITEMS;
+	item_cnt = roundup_pow_of_two(item_cnt);

> Can you use a data structure here that can grow? Linked list or
> xarray?
> 
> AFAIU what we want is when the page pool allocates a netmem it will
> add the netmem to the items array, and when the pp releases a netmem
> it will remove it from the array. Both of these operations are slow
> paths, right? So the performance of a data structure more complicated
> than an array may be ok. bench_page_pool_simple will tell for sure.

The question would be why do we need the pool->items to grow with the
additional overhead and complication by dynamic allocation of item, using
complicated data structure and concurrent handling?

As mentioned in [2], it was the existing semantics, but it does not means
we need to keep it. The changing of semantics seems like an advantage
to me, as we are able to limit how many pages is allowed to be used by
a page_pool instance.

2. https://lore.kernel.org/all/2fb8d278-62e0-4a81-a537-8f601f61e81d@huawei.com/

> 
>> Note, the devmem patchset seems to make the bug harder to fix,
>> and may make backporting harder too. As there is no actual user
>> for the devmem and the fixing for devmem is unclear for now,
>> this patch does not consider fixing the case for devmem yet.
>>
> 
> net_iovs don't hit this bug, dma_unmap_page_attrs() is never called on
> them, so no special handling is needed really. However for code

I am really doubtful about your above claim. As at least the below
implementaion of dma_buf_unmap_attachment_unlocked() called in
__net_devmem_dmabuf_binding_free() seems be using the DMA API directly:

https://elixir.bootlin.com/linux/v6.7-rc8/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L215

Or am I missing something obvious here?

> quality reasons lets try to minimize the number of devmem or memory
> provider checks in the code, if possible.
>
Mina Almasry Sept. 27, 2024, 5:54 a.m. UTC | #3
On Thu, Sep 26, 2024 at 8:58 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/9/27 2:15, Mina Almasry wrote:
> >
> >> In order not to do the dma unmmapping after driver has already
> >> unbound and stall the unloading of the networking driver, add
> >> the pool->items array to record all the pages including the ones
> >> which are handed over to network stack, so the page_pool can
> >> do the dma unmmapping for those pages when page_pool_destroy()
> >> is called.
> >
> > One thing I could not understand from looking at the code: if the
> > items array is in the struct page_pool, why do you need to modify the
> > page_pool entry in the struct page and in the struct net_iov? I think
> > the code could be made much simpler if you can remove these changes,
> > and you wouldn't need to modify the public api of the page_pool.
>
> As mentioned in [1]:
> "There is no space in 'struct page' to track the inflight pages, so
> 'pp' in 'struct page' is renamed to 'pp_item' to enable the tracking
> of inflight page"
>
> As we still need pp for "struct page_pool" for page_pool_put_page()
> related API, the container_of() trick is used to get the pp from the
> pp_item.
>
> As you had changed 'struct net_iov' to be mirroring the 'struct page',
> so change 'struct net_iov' part accordingly.
>
> 1. https://lore.kernel.org/all/50a463d5-a5a1-422f-a4f7-d3587b12c265@huawei.com/
>

I'm not sure we need the pages themselves to have the list of pages
that need to be dma unmapped on page_pool_destroy. The pool can have
the list of pages that need to be unmapped on page_pool_destroy, and
the individual pages need not track them, unless I'm missing
something.

> >
> >> As the pool->items need to be large enough to avoid
> >> performance degradation, add a 'item_full' stat to indicate the
> >> allocation failure due to unavailability of pool->items.
> >>
> >
> > I'm not sure there is any way to size the pool->items array correctly.
>
> Currently the size of pool->items is calculated in page_pool_create_percpu()
> as below, to make sure the size of pool->items is somewhat twice of the
> size of pool->ring so that the number of page sitting in the driver's rx
> ring waiting for the new packet is the similar to the number of page that is
> still being handled in the network stack as most drivers seems to set the
> pool->pool_size according to their rx ring size:
>
> +#define PAGE_POOL_MIN_INFLIGHT_ITEMS           512
> +       unsigned int item_cnt = (params->pool_size ? : 1024) +
> +                               PP_ALLOC_CACHE_SIZE + PAGE_POOL_MIN_INFLIGHT_ITEMS;
> +       item_cnt = roundup_pow_of_two(item_cnt);
>

I'm not sure it's OK to add a limitation to the page_pool that it can
only allocate N pages. At the moment, AFAIU, N is unlimited and it may
become a regression if we add a limitation.

> > Can you use a data structure here that can grow? Linked list or
> > xarray?
> >
> > AFAIU what we want is when the page pool allocates a netmem it will
> > add the netmem to the items array, and when the pp releases a netmem
> > it will remove it from the array. Both of these operations are slow
> > paths, right? So the performance of a data structure more complicated
> > than an array may be ok. bench_page_pool_simple will tell for sure.
>
> The question would be why do we need the pool->items to grow with the
> additional overhead and complication by dynamic allocation of item, using
> complicated data structure and concurrent handling?
>
> As mentioned in [2], it was the existing semantics, but it does not means
> we need to keep it. The changing of semantics seems like an advantage
> to me, as we are able to limit how many pages is allowed to be used by
> a page_pool instance.
>
> 2. https://lore.kernel.org/all/2fb8d278-62e0-4a81-a537-8f601f61e81d@huawei.com/
>
> >
> >> Note, the devmem patchset seems to make the bug harder to fix,
> >> and may make backporting harder too. As there is no actual user
> >> for the devmem and the fixing for devmem is unclear for now,
> >> this patch does not consider fixing the case for devmem yet.
> >>
> >
> > net_iovs don't hit this bug, dma_unmap_page_attrs() is never called on
> > them, so no special handling is needed really. However for code
>
> I am really doubtful about your above claim. As at least the below
> implementaion of dma_buf_unmap_attachment_unlocked() called in
> __net_devmem_dmabuf_binding_free() seems be using the DMA API directly:
>
> https://elixir.bootlin.com/linux/v6.7-rc8/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L215
>
> Or am I missing something obvious here?
>

I mean currently net_iovs don't hit the __page_pool_release_page_dma
function that causes the crash in the stack trace. The dmabuf layer
handles the unmapping when the dmabuf dies (I assume correctly).
Yunsheng Lin Sept. 27, 2024, 7:25 a.m. UTC | #4
adding Sumit & Christian & dma-buf maillist

On 2024/9/27 13:54, Mina Almasry wrote:
> On Thu, Sep 26, 2024 at 8:58 PM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2024/9/27 2:15, Mina Almasry wrote:
>>>
>>>> In order not to do the dma unmmapping after driver has already
>>>> unbound and stall the unloading of the networking driver, add
>>>> the pool->items array to record all the pages including the ones
>>>> which are handed over to network stack, so the page_pool can
>>>> do the dma unmmapping for those pages when page_pool_destroy()
>>>> is called.
>>>
>>> One thing I could not understand from looking at the code: if the
>>> items array is in the struct page_pool, why do you need to modify the
>>> page_pool entry in the struct page and in the struct net_iov? I think
>>> the code could be made much simpler if you can remove these changes,
>>> and you wouldn't need to modify the public api of the page_pool.
>>
>> As mentioned in [1]:
>> "There is no space in 'struct page' to track the inflight pages, so
>> 'pp' in 'struct page' is renamed to 'pp_item' to enable the tracking
>> of inflight page"
>>
>> As we still need pp for "struct page_pool" for page_pool_put_page()
>> related API, the container_of() trick is used to get the pp from the
>> pp_item.
>>
>> As you had changed 'struct net_iov' to be mirroring the 'struct page',
>> so change 'struct net_iov' part accordingly.
>>
>> 1. https://lore.kernel.org/all/50a463d5-a5a1-422f-a4f7-d3587b12c265@huawei.com/
>>
> 
> I'm not sure we need the pages themselves to have the list of pages
> that need to be dma unmapped on page_pool_destroy. The pool can have
> the list of pages that need to be unmapped on page_pool_destroy, and
> the individual pages need not track them, unless I'm missing
> something.

It is about the pool having the list of pages that need to be unmapped.
The point is that the list of pages that need to be unmapped is dynamic,
it is not a static list:
1. How to find a empty space in the list and add a page to the list?
2. How to find a page in the list and delete it from the list?
3. How to do the about two steps concurrently without obvious overhead?

I am not sure how it is possible to do the above without something like
the 'pp_item' added in this patch? Even the lockless list in the
include/linux/llist.h need a 'struct llist_node' for that to work.
But if it is possible, please share the idea in your mind.

> 
>>>
>>>> As the pool->items need to be large enough to avoid
>>>> performance degradation, add a 'item_full' stat to indicate the
>>>> allocation failure due to unavailability of pool->items.
>>>>
>>>
>>> I'm not sure there is any way to size the pool->items array correctly.
>>
>> Currently the size of pool->items is calculated in page_pool_create_percpu()
>> as below, to make sure the size of pool->items is somewhat twice of the
>> size of pool->ring so that the number of page sitting in the driver's rx
>> ring waiting for the new packet is the similar to the number of page that is
>> still being handled in the network stack as most drivers seems to set the
>> pool->pool_size according to their rx ring size:
>>
>> +#define PAGE_POOL_MIN_INFLIGHT_ITEMS           512
>> +       unsigned int item_cnt = (params->pool_size ? : 1024) +
>> +                               PP_ALLOC_CACHE_SIZE + PAGE_POOL_MIN_INFLIGHT_ITEMS;
>> +       item_cnt = roundup_pow_of_two(item_cnt);
>>
> 
> I'm not sure it's OK to add a limitation to the page_pool that it can
> only allocate N pages. At the moment, AFAIU, N is unlimited and it may
> become a regression if we add a limitation.

Maybe, let's see if there is some stronger argument that it is not ok
to add the limitation or some testing that does show the limitation
does bring a regression.

> 
>>> Can you use a data structure here that can grow? Linked list or
>>> xarray?
>>>
>>> AFAIU what we want is when the page pool allocates a netmem it will
>>> add the netmem to the items array, and when the pp releases a netmem
>>> it will remove it from the array. Both of these operations are slow
>>> paths, right? So the performance of a data structure more complicated
>>> than an array may be ok. bench_page_pool_simple will tell for sure.
>>
>> The question would be why do we need the pool->items to grow with the
>> additional overhead and complication by dynamic allocation of item, using
>> complicated data structure and concurrent handling?
>>
>> As mentioned in [2], it was the existing semantics, but it does not means
>> we need to keep it. The changing of semantics seems like an advantage
>> to me, as we are able to limit how many pages is allowed to be used by
>> a page_pool instance.
>>
>> 2. https://lore.kernel.org/all/2fb8d278-62e0-4a81-a537-8f601f61e81d@huawei.com/
>>
>>>
>>>> Note, the devmem patchset seems to make the bug harder to fix,
>>>> and may make backporting harder too. As there is no actual user
>>>> for the devmem and the fixing for devmem is unclear for now,
>>>> this patch does not consider fixing the case for devmem yet.
>>>>
>>>
>>> net_iovs don't hit this bug, dma_unmap_page_attrs() is never called on
>>> them, so no special handling is needed really. However for code
>>
>> I am really doubtful about your above claim. As at least the below
>> implementaion of dma_buf_unmap_attachment_unlocked() called in
>> __net_devmem_dmabuf_binding_free() seems be using the DMA API directly:
>>
>> https://elixir.bootlin.com/linux/v6.7-rc8/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L215
>>
>> Or am I missing something obvious here?
>>
> 
> I mean currently net_iovs don't hit the __page_pool_release_page_dma
> function that causes the crash in the stack trace. The dmabuf layer
> handles the unmapping when the dmabuf dies (I assume correctly).

It seems like the similar assumption made about the normal page.
How is dmabuf layer able to handles the unmapping when the driver
which creates the page_pool with the devmem pages has unbound and
the 'struct device' behind the driver has became invalid?

If dmabuf layer is able to handle that, it seems the page_pool may
be able to handle that too. Adding the maintainers of Dma-buf to see
if there is some clarifying from them.

>
Ilias Apalodimas Sept. 27, 2024, 9:21 a.m. UTC | #5
Hi Yunsheng

On Fri, 27 Sept 2024 at 06:58, Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/9/27 2:15, Mina Almasry wrote:
> >
> >> In order not to do the dma unmmapping after driver has already
> >> unbound and stall the unloading of the networking driver, add
> >> the pool->items array to record all the pages including the ones
> >> which are handed over to network stack, so the page_pool can
> >> do the dma unmmapping for those pages when page_pool_destroy()
> >> is called.
> >
> > One thing I could not understand from looking at the code: if the
> > items array is in the struct page_pool, why do you need to modify the
> > page_pool entry in the struct page and in the struct net_iov? I think
> > the code could be made much simpler if you can remove these changes,
> > and you wouldn't need to modify the public api of the page_pool.
>
> As mentioned in [1]:
> "There is no space in 'struct page' to track the inflight pages, so
> 'pp' in 'struct page' is renamed to 'pp_item' to enable the tracking
> of inflight page"

I have the same feeling as Mina here. First of all, we do have an
unsigned long in struct page we use for padding IIRC. More
importantly, though, why does struct page need to know about this?
Can't we have the same information in page pool?
When the driver allocates pages it does via page_pool_dev_alloc_XXXXX
or something similar. Cant we do what you suggest here ? IOW when we
allocate a page we put it in a list, and when that page returns to
page_pool (and it's mapped) we remove it.

Thanks
/Ilias
>
> As we still need pp for "struct page_pool" for page_pool_put_page()
> related API, the container_of() trick is used to get the pp from the
> pp_item.
>
> As you had changed 'struct net_iov' to be mirroring the 'struct page',
> so change 'struct net_iov' part accordingly.
>
> 1. https://lore.kernel.org/all/50a463d5-a5a1-422f-a4f7-d3587b12c265@huawei.com/
>
> >
> >> As the pool->items need to be large enough to avoid
> >> performance degradation, add a 'item_full' stat to indicate the
> >> allocation failure due to unavailability of pool->items.
> >>
> >
> > I'm not sure there is any way to size the pool->items array correctly.
>
> Currently the size of pool->items is calculated in page_pool_create_percpu()
> as below, to make sure the size of pool->items is somewhat twice of the
> size of pool->ring so that the number of page sitting in the driver's rx
> ring waiting for the new packet is the similar to the number of page that is
> still being handled in the network stack as most drivers seems to set the
> pool->pool_size according to their rx ring size:
>
> +#define PAGE_POOL_MIN_INFLIGHT_ITEMS           512
> +       unsigned int item_cnt = (params->pool_size ? : 1024) +
> +                               PP_ALLOC_CACHE_SIZE + PAGE_POOL_MIN_INFLIGHT_ITEMS;
> +       item_cnt = roundup_pow_of_two(item_cnt);
>
> > Can you use a data structure here that can grow? Linked list or
> > xarray?
> >
> > AFAIU what we want is when the page pool allocates a netmem it will
> > add the netmem to the items array, and when the pp releases a netmem
> > it will remove it from the array. Both of these operations are slow
> > paths, right? So the performance of a data structure more complicated
> > than an array may be ok. bench_page_pool_simple will tell for sure.
>
> The question would be why do we need the pool->items to grow with the
> additional overhead and complication by dynamic allocation of item, using
> complicated data structure and concurrent handling?
>
> As mentioned in [2], it was the existing semantics, but it does not means
> we need to keep it. The changing of semantics seems like an advantage
> to me, as we are able to limit how many pages is allowed to be used by
> a page_pool instance.
>
> 2. https://lore.kernel.org/all/2fb8d278-62e0-4a81-a537-8f601f61e81d@huawei.com/
>
> >
> >> Note, the devmem patchset seems to make the bug harder to fix,
> >> and may make backporting harder too. As there is no actual user
> >> for the devmem and the fixing for devmem is unclear for now,
> >> this patch does not consider fixing the case for devmem yet.
> >>
> >
> > net_iovs don't hit this bug, dma_unmap_page_attrs() is never called on
> > them, so no special handling is needed really. However for code
>
> I am really doubtful about your above claim. As at least the below
> implementaion of dma_buf_unmap_attachment_unlocked() called in
> __net_devmem_dmabuf_binding_free() seems be using the DMA API directly:
>
> https://elixir.bootlin.com/linux/v6.7-rc8/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L215
>
> Or am I missing something obvious here?
>
> > quality reasons lets try to minimize the number of devmem or memory
> > provider checks in the code, if possible.
> >
Yunsheng Lin Sept. 27, 2024, 9:49 a.m. UTC | #6
On 2024/9/27 17:21, Ilias Apalodimas wrote:
> Hi Yunsheng
> 
> On Fri, 27 Sept 2024 at 06:58, Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2024/9/27 2:15, Mina Almasry wrote:
>>>
>>>> In order not to do the dma unmmapping after driver has already
>>>> unbound and stall the unloading of the networking driver, add
>>>> the pool->items array to record all the pages including the ones
>>>> which are handed over to network stack, so the page_pool can
>>>> do the dma unmmapping for those pages when page_pool_destroy()
>>>> is called.
>>>
>>> One thing I could not understand from looking at the code: if the
>>> items array is in the struct page_pool, why do you need to modify the
>>> page_pool entry in the struct page and in the struct net_iov? I think
>>> the code could be made much simpler if you can remove these changes,
>>> and you wouldn't need to modify the public api of the page_pool.
>>
>> As mentioned in [1]:
>> "There is no space in 'struct page' to track the inflight pages, so
>> 'pp' in 'struct page' is renamed to 'pp_item' to enable the tracking
>> of inflight page"
> 
> I have the same feeling as Mina here. First of all, we do have an
> unsigned long in struct page we use for padding IIRC. More

I am assuming you are referring to '_pp_mapping_pad' in 'struct page',
unfortunately the field might be used when a page is mmap'ed to user
space as my understanding.

https://elixir.bootlin.com/linux/v6.7-rc8/source/include/linux/mm_types.h#L126

> importantly, though, why does struct page need to know about this?
> Can't we have the same information in page pool?
> When the driver allocates pages it does via page_pool_dev_alloc_XXXXX
> or something similar. Cant we do what you suggest here ? IOW when we
> allocate a page we put it in a list, and when that page returns to
> page_pool (and it's mapped) we remove it.

Yes, that is the basic idea, but the important part is how to do that
with less performance impact.
Ilias Apalodimas Sept. 27, 2024, 9:58 a.m. UTC | #7
On Fri, 27 Sept 2024 at 12:50, Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/9/27 17:21, Ilias Apalodimas wrote:
> > Hi Yunsheng
> >
> > On Fri, 27 Sept 2024 at 06:58, Yunsheng Lin <linyunsheng@huawei.com> wrote:
> >>
> >> On 2024/9/27 2:15, Mina Almasry wrote:
> >>>
> >>>> In order not to do the dma unmmapping after driver has already
> >>>> unbound and stall the unloading of the networking driver, add
> >>>> the pool->items array to record all the pages including the ones
> >>>> which are handed over to network stack, so the page_pool can
> >>>> do the dma unmmapping for those pages when page_pool_destroy()
> >>>> is called.
> >>>
> >>> One thing I could not understand from looking at the code: if the
> >>> items array is in the struct page_pool, why do you need to modify the
> >>> page_pool entry in the struct page and in the struct net_iov? I think
> >>> the code could be made much simpler if you can remove these changes,
> >>> and you wouldn't need to modify the public api of the page_pool.
> >>
> >> As mentioned in [1]:
> >> "There is no space in 'struct page' to track the inflight pages, so
> >> 'pp' in 'struct page' is renamed to 'pp_item' to enable the tracking
> >> of inflight page"
> >
> > I have the same feeling as Mina here. First of all, we do have an
> > unsigned long in struct page we use for padding IIRC. More
>
> I am assuming you are referring to '_pp_mapping_pad' in 'struct page',
> unfortunately the field might be used when a page is mmap'ed to user
> space as my understanding.
>

Ah good point, I just grepped for it and didn't look at the surrounding unions.

> https://elixir.bootlin.com/linux/v6.7-rc8/source/include/linux/mm_types.h#L126
>
> > importantly, though, why does struct page need to know about this?
> > Can't we have the same information in page pool?
> > When the driver allocates pages it does via page_pool_dev_alloc_XXXXX
> > or something similar. Cant we do what you suggest here ? IOW when we
> > allocate a page we put it in a list, and when that page returns to
> > page_pool (and it's mapped) we remove it.
>
> Yes, that is the basic idea, but the important part is how to do that
> with less performance impact.

Yes, but do you think that keeping that list of allocated pages in
struct page_pool will end up being more costly somehow compared to
struct page?

Thanks
/Ilias
Yunsheng Lin Sept. 27, 2024, 11:29 a.m. UTC | #8
On 2024/9/27 17:58, Ilias Apalodimas wrote:

...

>>
>>> importantly, though, why does struct page need to know about this?
>>> Can't we have the same information in page pool?
>>> When the driver allocates pages it does via page_pool_dev_alloc_XXXXX
>>> or something similar. Cant we do what you suggest here ? IOW when we
>>> allocate a page we put it in a list, and when that page returns to
>>> page_pool (and it's mapped) we remove it.
>>
>> Yes, that is the basic idea, but the important part is how to do that
>> with less performance impact.
> 
> Yes, but do you think that keeping that list of allocated pages in
> struct page_pool will end up being more costly somehow compared to
> struct page?

I am not sure if I understand your above question here.
I am supposing the question is about what's the cost between using
single/doubly linked list for the inflight pages or using a array
for the inflight pages like this patch does using pool->items?
If I understand question correctly, the single/doubly linked list
is more costly than array as the page_pool case as my understanding.

For single linked list, it doesn't allow deleting a specific entry but
only support deleting the first entry and all the entries. It does support
lockless operation using llist, but have limitation as below:
https://elixir.bootlin.com/linux/v6.7-rc8/source/include/linux/llist.h#L13

For doubly linked list, it needs two pointer to support deleting a specific
entry and it does not support lockless operation.

For pool->items, as the alloc side is protected by NAPI context, and the
free side use item->pp_idx to ensure there is only one producer for each
item, which means for each item in pool->items, there is only one consumer
and one producer, which seems much like the case when the page is not
recyclable in __page_pool_put_page, we don't need a lock protection when
calling page_pool_return_page(), the 'struct page' is also one consumer
and one producer as the pool->items[item->pp_idx] does:
https://elixir.bootlin.com/linux/v6.7-rc8/source/net/core/page_pool.c#L645

We only need a lock protection when page_pool_destroy() is called to
check if there is inflight page to be unmapped as a consumer, and the
__page_pool_put_page() may also called to unmapped the inflight page as
another consumer, there is why the 'destroy_lock' is added for protection
when pool->destroy_cnt > 0.

> 
> Thanks
> /Ilias
Ilias Apalodimas Sept. 28, 2024, 7:34 a.m. UTC | #9
Hi Yunsheng,

Overall this is a patch in the right direction. I want to get feedback
from others since Jakub and Jesper seemed to prefer the stalling idea.

On Fri, 27 Sept 2024 at 14:29, Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/9/27 17:58, Ilias Apalodimas wrote:
>
> ...
>
> >>
> >>> importantly, though, why does struct page need to know about this?
> >>> Can't we have the same information in page pool?
> >>> When the driver allocates pages it does via page_pool_dev_alloc_XXXXX
> >>> or something similar. Cant we do what you suggest here ? IOW when we
> >>> allocate a page we put it in a list, and when that page returns to
> >>> page_pool (and it's mapped) we remove it.
> >>
> >> Yes, that is the basic idea, but the important part is how to do that
> >> with less performance impact.
> >
> > Yes, but do you think that keeping that list of allocated pages in
> > struct page_pool will end up being more costly somehow compared to
> > struct page?
>
> I am not sure if I understand your above question here.
> I am supposing the question is about what's the cost between using
> single/doubly linked list for the inflight pages or using a array
> for the inflight pages like this patch does using pool->items?

Yes, that wasn't very clear indeed, apologies for any confusion. I was
trying to ask on a linked list that only lives in struct page_pool.
But I now realize this was a bad idea since the lookup would be way
slower.

> If I understand question correctly, the single/doubly linked list
> is more costly than array as the page_pool case as my understanding.
>
> For single linked list, it doesn't allow deleting a specific entry but
> only support deleting the first entry and all the entries. It does support
> lockless operation using llist, but have limitation as below:
> https://elixir.bootlin.com/linux/v6.7-rc8/source/include/linux/llist.h#L13
>
> For doubly linked list, it needs two pointer to support deleting a specific
> entry and it does not support lockless operation.

I didn't look at the patch too carefully at first. Looking a bit
closer now, the array is indeed better, since the lookup is faster.
You just need the stored index in struct page to find the page we need
to unmap. Do you remember if we can reduce the atomic pp_ref_count to
32bits? If so we can reuse that space for the index. Looking at it
requires a bit more work in netmem, but that's mostly swapping all the
atomic64 calls to atomic ones.

>
> For pool->items, as the alloc side is protected by NAPI context, and the
> free side use item->pp_idx to ensure there is only one producer for each
> item, which means for each item in pool->items, there is only one consumer
> and one producer, which seems much like the case when the page is not
> recyclable in __page_pool_put_page, we don't need a lock protection when
> calling page_pool_return_page(), the 'struct page' is also one consumer
> and one producer as the pool->items[item->pp_idx] does:
> https://elixir.bootlin.com/linux/v6.7-rc8/source/net/core/page_pool.c#L645
>
> We only need a lock protection when page_pool_destroy() is called to
> check if there is inflight page to be unmapped as a consumer, and the
> __page_pool_put_page() may also called to unmapped the inflight page as
> another consumer,

Thanks for the explanation. On the locking side, page_pool_destroy is
called once from the driver and then it's either the workqueue for
inflight packets or an SKB that got freed and tried to recycle right?
But do we still need to do all the unmapping etc from the delayed
work? Since the new function will unmap all packets in
page_pool_destroy, we can just skip unmapping when the delayed work
runs

Thanks
/Ilias





> there is why the 'destroy_lock' is added for protection
> when pool->destroy_cnt > 0.
>
> >
> > Thanks
> > /Ilias
Yunsheng Lin Sept. 29, 2024, 2:44 a.m. UTC | #10
On 2024/9/28 15:34, Ilias Apalodimas wrote:

...

> 
> Yes, that wasn't very clear indeed, apologies for any confusion. I was
> trying to ask on a linked list that only lives in struct page_pool.
> But I now realize this was a bad idea since the lookup would be way
> slower.
> 
>> If I understand question correctly, the single/doubly linked list
>> is more costly than array as the page_pool case as my understanding.
>>
>> For single linked list, it doesn't allow deleting a specific entry but
>> only support deleting the first entry and all the entries. It does support
>> lockless operation using llist, but have limitation as below:
>> https://elixir.bootlin.com/linux/v6.7-rc8/source/include/linux/llist.h#L13
>>
>> For doubly linked list, it needs two pointer to support deleting a specific
>> entry and it does not support lockless operation.
> 
> I didn't look at the patch too carefully at first. Looking a bit
> closer now, the array is indeed better, since the lookup is faster.
> You just need the stored index in struct page to find the page we need
> to unmap. Do you remember if we can reduce the atomic pp_ref_count to
> 32bits? If so we can reuse that space for the index. Looking at it

For 64 bits system, yes, we can reuse that.
But for 32 bits system, we may have only 16 bits for each of them, and it
seems that there is no atomic operation for variable that is less than 32
bits.

> requires a bit more work in netmem, but that's mostly swapping all the
> atomic64 calls to atomic ones.
> 
>>
>> For pool->items, as the alloc side is protected by NAPI context, and the
>> free side use item->pp_idx to ensure there is only one producer for each
>> item, which means for each item in pool->items, there is only one consumer
>> and one producer, which seems much like the case when the page is not
>> recyclable in __page_pool_put_page, we don't need a lock protection when
>> calling page_pool_return_page(), the 'struct page' is also one consumer
>> and one producer as the pool->items[item->pp_idx] does:
>> https://elixir.bootlin.com/linux/v6.7-rc8/source/net/core/page_pool.c#L645
>>
>> We only need a lock protection when page_pool_destroy() is called to
>> check if there is inflight page to be unmapped as a consumer, and the
>> __page_pool_put_page() may also called to unmapped the inflight page as
>> another consumer,
> 
> Thanks for the explanation. On the locking side, page_pool_destroy is
> called once from the driver and then it's either the workqueue for
> inflight packets or an SKB that got freed and tried to recycle right?
> But do we still need to do all the unmapping etc from the delayed
> work? Since the new function will unmap all packets in
> page_pool_destroy, we can just skip unmapping when the delayed work
> runs

Yes, the pool->dma_map is clear in page_pool_item_uninit() after it does
the unmapping for all inflight pages with the protection of pool->destroy_lock,
so that the unmapping is skipped in page_pool_return_page() when those inflight
pages are returned back to page_pool.

>
Ilias Apalodimas Sept. 30, 2024, 8:09 a.m. UTC | #11
On Sun, 29 Sept 2024 at 05:44, Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/9/28 15:34, Ilias Apalodimas wrote:
>
> ...
>
> >
> > Yes, that wasn't very clear indeed, apologies for any confusion. I was
> > trying to ask on a linked list that only lives in struct page_pool.
> > But I now realize this was a bad idea since the lookup would be way
> > slower.
> >
> >> If I understand question correctly, the single/doubly linked list
> >> is more costly than array as the page_pool case as my understanding.
> >>
> >> For single linked list, it doesn't allow deleting a specific entry but
> >> only support deleting the first entry and all the entries. It does support
> >> lockless operation using llist, but have limitation as below:
> >> https://elixir.bootlin.com/linux/v6.7-rc8/source/include/linux/llist.h#L13
> >>
> >> For doubly linked list, it needs two pointer to support deleting a specific
> >> entry and it does not support lockless operation.
> >
> > I didn't look at the patch too carefully at first. Looking a bit
> > closer now, the array is indeed better, since the lookup is faster.
> > You just need the stored index in struct page to find the page we need
> > to unmap. Do you remember if we can reduce the atomic pp_ref_count to
> > 32bits? If so we can reuse that space for the index. Looking at it
>
> For 64 bits system, yes, we can reuse that.
> But for 32 bits system, we may have only 16 bits for each of them, and it
> seems that there is no atomic operation for variable that is less than 32
> bits.
>
> > requires a bit more work in netmem, but that's mostly swapping all the
> > atomic64 calls to atomic ones.
> >
> >>
> >> For pool->items, as the alloc side is protected by NAPI context, and the
> >> free side use item->pp_idx to ensure there is only one producer for each
> >> item, which means for each item in pool->items, there is only one consumer
> >> and one producer, which seems much like the case when the page is not
> >> recyclable in __page_pool_put_page, we don't need a lock protection when
> >> calling page_pool_return_page(), the 'struct page' is also one consumer
> >> and one producer as the pool->items[item->pp_idx] does:
> >> https://elixir.bootlin.com/linux/v6.7-rc8/source/net/core/page_pool.c#L645
> >>
> >> We only need a lock protection when page_pool_destroy() is called to
> >> check if there is inflight page to be unmapped as a consumer, and the
> >> __page_pool_put_page() may also called to unmapped the inflight page as
> >> another consumer,
> >
> > Thanks for the explanation. On the locking side, page_pool_destroy is
> > called once from the driver and then it's either the workqueue for
> > inflight packets or an SKB that got freed and tried to recycle right?
> > But do we still need to do all the unmapping etc from the delayed
> > work? Since the new function will unmap all packets in
> > page_pool_destroy, we can just skip unmapping when the delayed work
> > runs
>
> Yes, the pool->dma_map is clear in page_pool_item_uninit() after it does
> the unmapping for all inflight pages with the protection of pool->destroy_lock,
> so that the unmapping is skipped in page_pool_return_page() when those inflight
> pages are returned back to page_pool.

Ah yes, the entire destruction path is protected which seems correct.
Instead of that WARN_ONCE in page_pool_item_uninit() can we instead
check the number of inflight packets vs what we just unmapped? IOW
check 'mask' against what page_pool_inflight() gives you and warn if
those aren't equal.


Thanks
/Ilias
>
> >
Yunsheng Lin Sept. 30, 2024, 8:38 a.m. UTC | #12
On 2024/9/30 16:09, Ilias Apalodimas wrote:
> On Sun, 29 Sept 2024 at 05:44, Yunsheng Lin <linyunsheng@huawei.com> wrote:
>>
>> On 2024/9/28 15:34, Ilias Apalodimas wrote:
>>
>> ...
>>
>>>
>>> Yes, that wasn't very clear indeed, apologies for any confusion. I was
>>> trying to ask on a linked list that only lives in struct page_pool.
>>> But I now realize this was a bad idea since the lookup would be way
>>> slower.
>>>
>>>> If I understand question correctly, the single/doubly linked list
>>>> is more costly than array as the page_pool case as my understanding.
>>>>
>>>> For single linked list, it doesn't allow deleting a specific entry but
>>>> only support deleting the first entry and all the entries. It does support
>>>> lockless operation using llist, but have limitation as below:
>>>> https://elixir.bootlin.com/linux/v6.7-rc8/source/include/linux/llist.h#L13
>>>>
>>>> For doubly linked list, it needs two pointer to support deleting a specific
>>>> entry and it does not support lockless operation.
>>>
>>> I didn't look at the patch too carefully at first. Looking a bit
>>> closer now, the array is indeed better, since the lookup is faster.
>>> You just need the stored index in struct page to find the page we need
>>> to unmap. Do you remember if we can reduce the atomic pp_ref_count to
>>> 32bits? If so we can reuse that space for the index. Looking at it
>>
>> For 64 bits system, yes, we can reuse that.
>> But for 32 bits system, we may have only 16 bits for each of them, and it
>> seems that there is no atomic operation for variable that is less than 32
>> bits.
>>
>>> requires a bit more work in netmem, but that's mostly swapping all the
>>> atomic64 calls to atomic ones.
>>>
>>>>
>>>> For pool->items, as the alloc side is protected by NAPI context, and the
>>>> free side use item->pp_idx to ensure there is only one producer for each
>>>> item, which means for each item in pool->items, there is only one consumer
>>>> and one producer, which seems much like the case when the page is not
>>>> recyclable in __page_pool_put_page, we don't need a lock protection when
>>>> calling page_pool_return_page(), the 'struct page' is also one consumer
>>>> and one producer as the pool->items[item->pp_idx] does:
>>>> https://elixir.bootlin.com/linux/v6.7-rc8/source/net/core/page_pool.c#L645
>>>>
>>>> We only need a lock protection when page_pool_destroy() is called to
>>>> check if there is inflight page to be unmapped as a consumer, and the
>>>> __page_pool_put_page() may also called to unmapped the inflight page as
>>>> another consumer,
>>>
>>> Thanks for the explanation. On the locking side, page_pool_destroy is
>>> called once from the driver and then it's either the workqueue for
>>> inflight packets or an SKB that got freed and tried to recycle right?
>>> But do we still need to do all the unmapping etc from the delayed
>>> work? Since the new function will unmap all packets in
>>> page_pool_destroy, we can just skip unmapping when the delayed work
>>> runs
>>
>> Yes, the pool->dma_map is clear in page_pool_item_uninit() after it does
>> the unmapping for all inflight pages with the protection of pool->destroy_lock,
>> so that the unmapping is skipped in page_pool_return_page() when those inflight
>> pages are returned back to page_pool.
> 
> Ah yes, the entire destruction path is protected which seems correct.
> Instead of that WARN_ONCE in page_pool_item_uninit() can we instead
> check the number of inflight packets vs what we just unmapped? IOW
> check 'mask' against what page_pool_inflight() gives you and warn if
> those aren't equal.
Yes, it seems it is quite normal to trigger the warning from testing,
it makes sense to check it against page_pool_inflight() to catch some
bug of tracking/calculating inflight pages.

> 
> 
> Thanks
> /Ilias
>>
>>>
Paolo Abeni Oct. 1, 2024, 1:32 p.m. UTC | #13
On 9/25/24 09:57, Yunsheng Lin wrote:
> Networking driver with page_pool support may hand over page
> still with dma mapping to network stack and try to reuse that
> page after network stack is done with it and passes it back
> to page_pool to avoid the penalty of dma mapping/unmapping.
> With all the caching in the network stack, some pages may be
> held in the network stack without returning to the page_pool
> soon enough, and with VF disable causing the driver unbound,
> the page_pool does not stop the driver from doing it's
> unbounding work, instead page_pool uses workqueue to check
> if there is some pages coming back from the network stack
> periodically, if there is any, it will do the dma unmmapping
> related cleanup work.
> 
> As mentioned in [1], attempting DMA unmaps after the driver
> has already unbound may leak resources or at worst corrupt
> memory. Fundamentally, the page pool code cannot allow DMA
> mappings to outlive the driver they belong to.
> 
> Currently it seems there are at least two cases that the page
> is not released fast enough causing dma unmmapping done after
> driver has already unbound:
> 1. ipv4 packet defragmentation timeout: this seems to cause
>     delay up to 30 secs.
> 2. skb_defer_free_flush(): this may cause infinite delay if
>     there is no triggering for net_rx_action().
> 
> In order not to do the dma unmmapping after driver has already
> unbound and stall the unloading of the networking driver, add
> the pool->items array to record all the pages including the ones
> which are handed over to network stack, so the page_pool can
> do the dma unmmapping for those pages when page_pool_destroy()
> is called. As the pool->items need to be large enough to avoid
> performance degradation, add a 'item_full' stat to indicate the
> allocation failure due to unavailability of pool->items.

This looks really invasive, with room for potentially large performance 
regressions or worse. At very least it does not look suitable for net.

Is the problem only tied to VFs drivers? It's a pity all the page_pool 
users will have to pay a bill for it...

/P
Yunsheng Lin Oct. 2, 2024, 2:34 a.m. UTC | #14
On 10/1/2024 9:32 PM, Paolo Abeni wrote:
> On 9/25/24 09:57, Yunsheng Lin wrote:
>> Networking driver with page_pool support may hand over page
>> still with dma mapping to network stack and try to reuse that
>> page after network stack is done with it and passes it back
>> to page_pool to avoid the penalty of dma mapping/unmapping.
>> With all the caching in the network stack, some pages may be
>> held in the network stack without returning to the page_pool
>> soon enough, and with VF disable causing the driver unbound,
>> the page_pool does not stop the driver from doing it's
>> unbounding work, instead page_pool uses workqueue to check
>> if there is some pages coming back from the network stack
>> periodically, if there is any, it will do the dma unmmapping
>> related cleanup work.
>>
>> As mentioned in [1], attempting DMA unmaps after the driver
>> has already unbound may leak resources or at worst corrupt
>> memory. Fundamentally, the page pool code cannot allow DMA
>> mappings to outlive the driver they belong to.
>>
>> Currently it seems there are at least two cases that the page
>> is not released fast enough causing dma unmmapping done after
>> driver has already unbound:
>> 1. ipv4 packet defragmentation timeout: this seems to cause
>>     delay up to 30 secs.
>> 2. skb_defer_free_flush(): this may cause infinite delay if
>>     there is no triggering for net_rx_action().
>>
>> In order not to do the dma unmmapping after driver has already
>> unbound and stall the unloading of the networking driver, add
>> the pool->items array to record all the pages including the ones
>> which are handed over to network stack, so the page_pool can
>> do the dma unmmapping for those pages when page_pool_destroy()
>> is called. As the pool->items need to be large enough to avoid
>> performance degradation, add a 'item_full' stat to indicate the
>> allocation failure due to unavailability of pool->items.
> 
> This looks really invasive, with room for potentially large performance 
> regressions or worse. At very least it does not look suitable for net.

I am open to targetting this to net-next, it can be backported when some
testing is done through one or two kernel versions and there is still
some interest to backport it too.

Or if there is some non-invasive way to fix this.

> 
> Is the problem only tied to VFs drivers? It's a pity all the page_pool 
> users will have to pay a bill for it...

I am afraid it is not only tied to VFs drivers, as:
attempting DMA unmaps after the driver has already unbound may leak
resources or at worst corrupt memory.

Unloading PFs driver might cause the above problems too, I guess the
probability of crashing is low for the PF as PF can not be disable
unless it can be hot-unplug'ed, but the probability of leaking resources
behind the dma mapping might be similar.

> 
> /P
> 
>
Ilias Apalodimas Oct. 2, 2024, 6:46 a.m. UTC | #15
Hi Paolo,

Thanks for taking the time.

On Tue, 1 Oct 2024 at 16:32, Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 9/25/24 09:57, Yunsheng Lin wrote:
> > Networking driver with page_pool support may hand over page
> > still with dma mapping to network stack and try to reuse that
> > page after network stack is done with it and passes it back
> > to page_pool to avoid the penalty of dma mapping/unmapping.
> > With all the caching in the network stack, some pages may be
> > held in the network stack without returning to the page_pool
> > soon enough, and with VF disable causing the driver unbound,
> > the page_pool does not stop the driver from doing it's
> > unbounding work, instead page_pool uses workqueue to check
> > if there is some pages coming back from the network stack
> > periodically, if there is any, it will do the dma unmmapping
> > related cleanup work.
> >
> > As mentioned in [1], attempting DMA unmaps after the driver
> > has already unbound may leak resources or at worst corrupt
> > memory. Fundamentally, the page pool code cannot allow DMA
> > mappings to outlive the driver they belong to.
> >
> > Currently it seems there are at least two cases that the page
> > is not released fast enough causing dma unmmapping done after
> > driver has already unbound:
> > 1. ipv4 packet defragmentation timeout: this seems to cause
> >     delay up to 30 secs.
> > 2. skb_defer_free_flush(): this may cause infinite delay if
> >     there is no triggering for net_rx_action().
> >
> > In order not to do the dma unmmapping after driver has already
> > unbound and stall the unloading of the networking driver, add
> > the pool->items array to record all the pages including the ones
> > which are handed over to network stack, so the page_pool can
> > do the dma unmmapping for those pages when page_pool_destroy()
> > is called. As the pool->items need to be large enough to avoid
> > performance degradation, add a 'item_full' stat to indicate the
> > allocation failure due to unavailability of pool->items.
>
> This looks really invasive, with room for potentially large performance
> regressions or worse. At very least it does not look suitable for net.

Perhaps, and you are right we need to measure performance before
pulling it but...

>
> Is the problem only tied to VFs drivers? It's a pity all the page_pool
> users will have to pay a bill for it...

It's not. The problem happens when an SKB has been scheduled for
recycling and has already been mapped via page_pool. If the driver
disappears in the meantime, page_pool will free all the packets it
holds in its private rings (both slow and fast), but is not in control
of the SKB anymore. So any packets coming back for recycling *after*
that point cannot unmap memory properly.

As discussed this can either lead to memory corruption and resource
leaking, or worse as seen in the bug report panics. I am fine with
this going into -next, but it really is a bugfix, although I am not
100% sure that the Fixes: tag in the current patch is correct.

Thanks
/Ilias
>
> /P
>
Ilias Apalodimas Oct. 2, 2024, 6:51 a.m. UTC | #16
On Wed, 2 Oct 2024 at 09:46, Ilias Apalodimas
<ilias.apalodimas@linaro.org> wrote:
>
> Hi Paolo,
>
> Thanks for taking the time.
>
> On Tue, 1 Oct 2024 at 16:32, Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On 9/25/24 09:57, Yunsheng Lin wrote:
> > > Networking driver with page_pool support may hand over page
> > > still with dma mapping to network stack and try to reuse that
> > > page after network stack is done with it and passes it back
> > > to page_pool to avoid the penalty of dma mapping/unmapping.
> > > With all the caching in the network stack, some pages may be
> > > held in the network stack without returning to the page_pool
> > > soon enough, and with VF disable causing the driver unbound,
> > > the page_pool does not stop the driver from doing it's
> > > unbounding work, instead page_pool uses workqueue to check
> > > if there is some pages coming back from the network stack
> > > periodically, if there is any, it will do the dma unmmapping
> > > related cleanup work.
> > >
> > > As mentioned in [1], attempting DMA unmaps after the driver
> > > has already unbound may leak resources or at worst corrupt
> > > memory. Fundamentally, the page pool code cannot allow DMA
> > > mappings to outlive the driver they belong to.
> > >
> > > Currently it seems there are at least two cases that the page
> > > is not released fast enough causing dma unmmapping done after
> > > driver has already unbound:
> > > 1. ipv4 packet defragmentation timeout: this seems to cause
> > >     delay up to 30 secs.
> > > 2. skb_defer_free_flush(): this may cause infinite delay if
> > >     there is no triggering for net_rx_action().
> > >
> > > In order not to do the dma unmmapping after driver has already
> > > unbound and stall the unloading of the networking driver, add
> > > the pool->items array to record all the pages including the ones
> > > which are handed over to network stack, so the page_pool can
> > > do the dma unmmapping for those pages when page_pool_destroy()
> > > is called. As the pool->items need to be large enough to avoid
> > > performance degradation, add a 'item_full' stat to indicate the
> > > allocation failure due to unavailability of pool->items.
> >
> > This looks really invasive, with room for potentially large performance
> > regressions or worse. At very least it does not look suitable for net.
>
> Perhaps, and you are right we need to measure performance before
> pulling it but...
>
> >
> > Is the problem only tied to VFs drivers? It's a pity all the page_pool
> > users will have to pay a bill for it...
>
> It's not. The problem happens when an SKB has been scheduled for
> recycling and has already been mapped via page_pool. If the driver
> disappears in the meantime,

Apologies, this wasn't correct. It's the device that has to disappear
not the driver

> page_pool will free all the packets it
> holds in its private rings (both slow and fast), but is not in control
> of the SKB anymore. So any packets coming back for recycling *after*
> that point cannot unmap memory properly.
>
> As discussed this can either lead to memory corruption and resource
> leaking, or worse as seen in the bug report panics. I am fine with
> this going into -next, but it really is a bugfix, although I am not
> 100% sure that the Fixes: tag in the current patch is correct.
>
> Thanks
> /Ilias
> >
> > /P
> >
Paolo Abeni Oct. 2, 2024, 7:37 a.m. UTC | #17
Hi,

On 10/2/24 04:34, Yunsheng Lin wrote:
> On 10/1/2024 9:32 PM, Paolo Abeni wrote:
>> Is the problem only tied to VFs drivers? It's a pity all the page_pool
>> users will have to pay a bill for it...
> 
> I am afraid it is not only tied to VFs drivers, as:
> attempting DMA unmaps after the driver has already unbound may leak
> resources or at worst corrupt memory.
> 
> Unloading PFs driver might cause the above problems too, I guess the
> probability of crashing is low for the PF as PF can not be disable
> unless it can be hot-unplug'ed, but the probability of leaking resources
> behind the dma mapping might be similar.

Out of sheer ignorance, why/how the refcount acquired by the page pool 
on the device does not prevent unloading?

I fear the performance impact could be very high: AFICS, if the item 
array become fragmented, insertion will take linar time, with the quite 
large item_count/pool size. If so, it looks like a no-go.

I fear we should consider blocking the device removal until all the 
pages are returned/unmapped ?!? (I hope that could be easier/faster)

/P
Ilias Apalodimas Oct. 2, 2024, 8:23 a.m. UTC | #18
Hi Paolo,

On Wed, 2 Oct 2024 at 10:38, Paolo Abeni <pabeni@redhat.com> wrote:
>
> Hi,
>
> On 10/2/24 04:34, Yunsheng Lin wrote:
> > On 10/1/2024 9:32 PM, Paolo Abeni wrote:
> >> Is the problem only tied to VFs drivers? It's a pity all the page_pool
> >> users will have to pay a bill for it...
> >
> > I am afraid it is not only tied to VFs drivers, as:
> > attempting DMA unmaps after the driver has already unbound may leak
> > resources or at worst corrupt memory.
> >
> > Unloading PFs driver might cause the above problems too, I guess the
> > probability of crashing is low for the PF as PF can not be disable
> > unless it can be hot-unplug'ed, but the probability of leaking resources
> > behind the dma mapping might be similar.
>
> Out of sheer ignorance, why/how the refcount acquired by the page pool
> on the device does not prevent unloading?
>
> I fear the performance impact could be very high: AFICS, if the item
> array become fragmented, insertion will take linar time, with the quite
> large item_count/pool size. If so, it looks like a no-go.

It would be could if someone could test that. I'll look around in case
we have any test machines with cards that run on page pool.

>
> I fear we should consider blocking the device removal until all the
> pages are returned/unmapped ?!? (I hope that could be easier/faster)

Jakub send an RFC doing that [0]. Yes, this is far far simpler and
does not affect performance, but aren't we implicitly breaking
userspace?

[0] https://lore.kernel.org/netdev/20240806151618.1373008-1-kuba@kernel.org/

Thanks
/Ilias
>
> /P
>
Yunsheng Lin Oct. 5, 2024, 12:38 p.m. UTC | #19
On 10/2/2024 3:37 PM, Paolo Abeni wrote:
> Hi,
> 
> On 10/2/24 04:34, Yunsheng Lin wrote:
>> On 10/1/2024 9:32 PM, Paolo Abeni wrote:
>>> Is the problem only tied to VFs drivers? It's a pity all the page_pool
>>> users will have to pay a bill for it...
>>
>> I am afraid it is not only tied to VFs drivers, as:
>> attempting DMA unmaps after the driver has already unbound may leak
>> resources or at worst corrupt memory.
>>
>> Unloading PFs driver might cause the above problems too, I guess the
>> probability of crashing is low for the PF as PF can not be disable
>> unless it can be hot-unplug'ed, but the probability of leaking resources
>> behind the dma mapping might be similar.
> 
> Out of sheer ignorance, why/how the refcount acquired by the page pool 
> on the device does not prevent unloading?

I am not sure if I understand the reasoning behind that, but it seems
the driver unloading does not check on the refcount of the device from
the implementation of __device_release_driver().

> 
> I fear the performance impact could be very high: AFICS, if the item 
> array become fragmented, insertion will take linar time, with the quite 
> large item_count/pool size. If so, it looks like a no-go.

The last checked index is recorded in pool->item_idx, so the insertion
mostly will not take linear, unless pool->items is almost full and the
old item came back to page_pool is just checked. The thought is that if
it comes to this point, the page_pool is likely not the bottleneck
anymore, and adding infinite pool->items might not make any difference.

If the insertion does turn out to be a bottleneck, 'struct llist_head'
can be used to records the old items lockless for the freeing side, and
llist_del_all() can be used to refill the old items for the allocing
side from freeing side, which is kind of like the pool->ring and
pool->alloc used currently in page_pool. As this patchset is already
complicated, doing this makes it more complicated, I am not sure it is
worth the effort right now as benefit does not seem obvious yet.

> 
> I fear we should consider blocking the device removal until all the 
> pages are returned/unmapped ?!? (I hope that could be easier/faster)

As Ilias pointed out, blocking the device removal until all the pages
are returned/unmapped might cause infinite delay in our testing:

https://lore.kernel.org/netdev/d50ac1a9-f1e2-49ee-b89b-05dac9bc6ee1@huawei.com/

> 
> /P
>
diff mbox series

Patch

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index acbb627d51bf..c00f8c460759 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1009,7 +1009,8 @@  static void fec_enet_bd_init(struct net_device *dev)
 				struct page *page = txq->tx_buf[i].buf_p;
 
 				if (page)
-					page_pool_put_page(page->pp, page, 0, false);
+					page_pool_put_page(page_pool_to_pp(page),
+							   page, 0, false);
 			}
 
 			txq->tx_buf[i].buf_p = NULL;
@@ -1538,7 +1539,7 @@  fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget)
 			xdp_return_frame_rx_napi(xdpf);
 		} else { /* recycle pages of XDP_TX frames */
 			/* The dma_sync_size = 0 as XDP_TX has already synced DMA for_device */
-			page_pool_put_page(page->pp, page, 0, true);
+			page_pool_put_page(page_pool_to_pp(page), page, 0, true);
 		}
 
 		txq->tx_buf[index].buf_p = NULL;
@@ -3300,7 +3301,8 @@  static void fec_enet_free_buffers(struct net_device *ndev)
 			} else {
 				struct page *page = txq->tx_buf[i].buf_p;
 
-				page_pool_put_page(page->pp, page, 0, false);
+				page_pool_put_page(page_pool_to_pp(page),
+						   page, 0, false);
 			}
 
 			txq->tx_buf[i].buf_p = NULL;
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 26b424fd6718..658d8f9a6abb 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -1050,7 +1050,8 @@  static void iavf_add_rx_frag(struct sk_buff *skb,
 			     const struct libeth_fqe *rx_buffer,
 			     unsigned int size)
 {
-	u32 hr = rx_buffer->page->pp->p.offset;
+	struct page_pool *pool = page_pool_to_pp(rx_buffer->page);
+	u32 hr = pool->p.offset;
 
 	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page,
 			rx_buffer->offset + hr, size, rx_buffer->truesize);
@@ -1067,7 +1068,8 @@  static void iavf_add_rx_frag(struct sk_buff *skb,
 static struct sk_buff *iavf_build_skb(const struct libeth_fqe *rx_buffer,
 				      unsigned int size)
 {
-	u32 hr = rx_buffer->page->pp->p.offset;
+	struct page_pool *pool = page_pool_to_pp(rx_buffer->page);
+	u32 hr = pool->p.offset;
 	struct sk_buff *skb;
 	void *va;
 
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
index d4e6f0e10487..e3389f1a215f 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
@@ -385,7 +385,8 @@  static void idpf_rx_page_rel(struct libeth_fqe *rx_buf)
 	if (unlikely(!rx_buf->page))
 		return;
 
-	page_pool_put_full_page(rx_buf->page->pp, rx_buf->page, false);
+	page_pool_put_full_page(page_pool_to_pp(rx_buf->page), rx_buf->page,
+				false);
 
 	rx_buf->page = NULL;
 	rx_buf->offset = 0;
@@ -3097,7 +3098,8 @@  idpf_rx_process_skb_fields(struct idpf_rx_queue *rxq, struct sk_buff *skb,
 void idpf_rx_add_frag(struct idpf_rx_buf *rx_buf, struct sk_buff *skb,
 		      unsigned int size)
 {
-	u32 hr = rx_buf->page->pp->p.offset;
+	struct page_pool *pool = page_pool_to_pp(rx_buf->page);
+	u32 hr = pool->p.offset;
 
 	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buf->page,
 			rx_buf->offset + hr, size, rx_buf->truesize);
@@ -3129,8 +3131,10 @@  static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
 	if (!libeth_rx_sync_for_cpu(buf, copy))
 		return 0;
 
-	dst = page_address(hdr->page) + hdr->offset + hdr->page->pp->p.offset;
-	src = page_address(buf->page) + buf->offset + buf->page->pp->p.offset;
+	dst = page_address(hdr->page) + hdr->offset +
+		page_pool_to_pp(hdr->page)->p.offset;
+	src = page_address(buf->page) + buf->offset +
+		page_pool_to_pp(buf->page)->p.offset;
 	memcpy(dst, src, LARGEST_ALIGN(copy));
 
 	buf->offset += copy;
@@ -3148,7 +3152,7 @@  static u32 idpf_rx_hsplit_wa(const struct libeth_fqe *hdr,
  */
 struct sk_buff *idpf_rx_build_skb(const struct libeth_fqe *buf, u32 size)
 {
-	u32 hr = buf->page->pp->p.offset;
+	u32 hr = page_pool_to_pp(buf->page)->p.offset;
 	struct sk_buff *skb;
 	void *va;
 
diff --git a/drivers/net/ethernet/intel/libeth/rx.c b/drivers/net/ethernet/intel/libeth/rx.c
index f20926669318..385afca0e61d 100644
--- a/drivers/net/ethernet/intel/libeth/rx.c
+++ b/drivers/net/ethernet/intel/libeth/rx.c
@@ -207,7 +207,7 @@  EXPORT_SYMBOL_NS_GPL(libeth_rx_fq_destroy, LIBETH);
  */
 void libeth_rx_recycle_slow(struct page *page)
 {
-	page_pool_recycle_direct(page->pp, page);
+	page_pool_recycle_direct(page_pool_to_pp(page), page);
 }
 EXPORT_SYMBOL_NS_GPL(libeth_rx_recycle_slow, LIBETH);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 4610621a340e..83511a45a6dc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -716,7 +716,8 @@  static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 				/* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE)
 				 * as we know this is a page_pool page.
 				 */
-				page_pool_recycle_direct(page->pp, page);
+				page_pool_recycle_direct(page_pool_to_pp(page),
+							 page);
 			} while (++n < num);
 
 			break;
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 017a6102be0a..9bfa593cd5dd 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -593,7 +593,8 @@  nsim_pp_hold_write(struct file *file, const char __user *data,
 		if (!ns->page)
 			ret = -ENOMEM;
 	} else {
-		page_pool_put_full_page(ns->page->pp, ns->page, false);
+		page_pool_put_full_page(page_pool_to_pp(ns->page), ns->page,
+					false);
 		ns->page = NULL;
 	}
 	rtnl_unlock();
@@ -788,7 +789,8 @@  void nsim_destroy(struct netdevsim *ns)
 
 	/* Put this intentionally late to exercise the orphaning path */
 	if (ns->page) {
-		page_pool_put_full_page(ns->page->pp, ns->page, false);
+		page_pool_put_full_page(page_pool_to_pp(ns->page), ns->page,
+					false);
 		ns->page = NULL;
 	}
 
diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h
index 0b75a45ad2e8..94a277290909 100644
--- a/drivers/net/wireless/mediatek/mt76/mt76.h
+++ b/drivers/net/wireless/mediatek/mt76/mt76.h
@@ -1688,7 +1688,7 @@  static inline void mt76_put_page_pool_buf(void *buf, bool allow_direct)
 {
 	struct page *page = virt_to_head_page(buf);
 
-	page_pool_put_full_page(page->pp, page, allow_direct);
+	page_pool_put_full_page(page_pool_to_pp(page), page, allow_direct);
 }
 
 static inline void *
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 485424979254..410187133d27 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -120,7 +120,7 @@  struct page {
 			 * page_pool allocated pages.
 			 */
 			unsigned long pp_magic;
-			struct page_pool *pp;
+			struct page_pool_item *pp_item;
 			unsigned long _pp_mapping_pad;
 			unsigned long dma_addr;
 			atomic_long_t pp_ref_count;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 39f1d16f3628..64d1ecb7a7fc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -38,6 +38,7 @@ 
 #include <net/net_debug.h>
 #include <net/dropreason-core.h>
 #include <net/netmem.h>
+#include <net/page_pool/types.h>
 
 /**
  * DOC: skb checksums
diff --git a/include/net/libeth/rx.h b/include/net/libeth/rx.h
index 43574bd6612f..beee7ddd77a5 100644
--- a/include/net/libeth/rx.h
+++ b/include/net/libeth/rx.h
@@ -137,7 +137,8 @@  static inline bool libeth_rx_sync_for_cpu(const struct libeth_fqe *fqe,
 		return false;
 	}
 
-	page_pool_dma_sync_for_cpu(page->pp, page, fqe->offset, len);
+	page_pool_dma_sync_for_cpu(page_pool_to_pp(page), page, fqe->offset,
+				   len);
 
 	return true;
 }
diff --git a/include/net/netmem.h b/include/net/netmem.h
index 8a6e20be4b9d..5e7b4d1c1c44 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -23,7 +23,7 @@  DECLARE_STATIC_KEY_FALSE(page_pool_mem_providers);
 struct net_iov {
 	unsigned long __unused_padding;
 	unsigned long pp_magic;
-	struct page_pool *pp;
+	struct page_pool_item *pp_item;
 	struct dmabuf_genpool_chunk_owner *owner;
 	unsigned long dma_addr;
 	atomic_long_t pp_ref_count;
@@ -33,7 +33,7 @@  struct net_iov {
  *
  *        struct {
  *                unsigned long pp_magic;
- *                struct page_pool *pp;
+ *                struct page_pool_item *pp_item;
  *                unsigned long _pp_mapping_pad;
  *                unsigned long dma_addr;
  *                atomic_long_t pp_ref_count;
@@ -49,7 +49,7 @@  struct net_iov {
 	static_assert(offsetof(struct page, pg) == \
 		      offsetof(struct net_iov, iov))
 NET_IOV_ASSERT_OFFSET(pp_magic, pp_magic);
-NET_IOV_ASSERT_OFFSET(pp, pp);
+NET_IOV_ASSERT_OFFSET(pp_item, pp_item);
 NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr);
 NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count);
 #undef NET_IOV_ASSERT_OFFSET
@@ -127,9 +127,9 @@  static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem)
 	return (struct net_iov *)((__force unsigned long)netmem & ~NET_IOV);
 }
 
-static inline struct page_pool *netmem_get_pp(netmem_ref netmem)
+static inline struct page_pool_item *netmem_get_pp_item(netmem_ref netmem)
 {
-	return __netmem_clear_lsb(netmem)->pp;
+	return __netmem_clear_lsb(netmem)->pp_item;
 }
 
 static inline atomic_long_t *netmem_get_pp_ref_count_ref(netmem_ref netmem)
diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
index 793e6fd78bc5..f781c81f8aa9 100644
--- a/include/net/page_pool/helpers.h
+++ b/include/net/page_pool/helpers.h
@@ -83,6 +83,13 @@  static inline u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
 }
 #endif
 
+static inline struct page_pool *page_pool_to_pp(struct page *page)
+{
+	struct page_pool_item *item = page->pp_item;
+
+	return container_of(item, struct page_pool, items[item->pp_idx]);
+}
+
 /**
  * page_pool_dev_alloc_pages() - allocate a page.
  * @pool:	pool from which to allocate
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index c022c410abe3..194006d2930f 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -102,6 +102,7 @@  struct page_pool_params {
  * @refill:	an allocation which triggered a refill of the cache
  * @waive:	pages obtained from the ptr ring that cannot be added to
  *		the cache due to a NUMA mismatch
+ * @item_full	items array is full
  */
 struct page_pool_alloc_stats {
 	u64 fast;
@@ -110,6 +111,7 @@  struct page_pool_alloc_stats {
 	u64 empty;
 	u64 refill;
 	u64 waive;
+	u64 item_full;
 };
 
 /**
@@ -142,6 +144,11 @@  struct page_pool_stats {
 };
 #endif
 
+struct page_pool_item {
+	netmem_ref pp_netmem;
+	unsigned int pp_idx;
+};
+
 /* The whole frag API block must stay within one cacheline. On 32-bit systems,
  * sizeof(long) == sizeof(int), so that the block size is ``3 * sizeof(long)``.
  * On 64-bit systems, the actual size is ``2 * sizeof(long) + sizeof(int)``.
@@ -161,6 +168,8 @@  struct page_pool {
 
 	int cpuid;
 	u32 pages_state_hold_cnt;
+	unsigned int item_mask;
+	unsigned int item_idx;
 
 	bool has_init_callback:1;	/* slow::init_callback is set */
 	bool dma_map:1;			/* Perform DMA mapping */
@@ -228,7 +237,11 @@  struct page_pool {
 	 */
 	refcount_t user_cnt;
 
-	u64 destroy_cnt;
+	/* Lock to avoid doing dma unmapping concurrently when
+	 * destroy_cnt > 0.
+	 */
+	spinlock_t destroy_lock;
+	unsigned int destroy_cnt;
 
 	/* Slow/Control-path information follows */
 	struct page_pool_params_slow slow;
@@ -239,6 +252,8 @@  struct page_pool {
 		u32 napi_id;
 		u32 id;
 	} user;
+
+	struct page_pool_item items[] ____cacheline_aligned_in_smp;
 };
 
 struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 11b91c12ee11..09c5aa83f12a 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -85,7 +85,7 @@  net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
 	niov = &owner->niovs[index];
 
 	niov->pp_magic = 0;
-	niov->pp = NULL;
+	niov->pp_item = NULL;
 	atomic_long_set(&niov->pp_ref_count, 0);
 
 	return niov;
@@ -380,7 +380,7 @@  bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem)
 	if (WARN_ON_ONCE(refcount != 1))
 		return false;
 
-	page_pool_clear_pp_info(netmem);
+	page_pool_clear_pp_info(pool, netmem);
 
 	net_devmem_free_dmabuf(netmem_to_net_iov(netmem));
 
diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h
index 7eadb8393e00..3173f6070cf7 100644
--- a/net/core/netmem_priv.h
+++ b/net/core/netmem_priv.h
@@ -18,9 +18,10 @@  static inline void netmem_clear_pp_magic(netmem_ref netmem)
 	__netmem_clear_lsb(netmem)->pp_magic = 0;
 }
 
-static inline void netmem_set_pp(netmem_ref netmem, struct page_pool *pool)
+static inline void netmem_set_pp_item(netmem_ref netmem,
+				      struct page_pool_item *item)
 {
-	__netmem_clear_lsb(netmem)->pp = pool;
+	__netmem_clear_lsb(netmem)->pp_item = item;
 }
 
 static inline void netmem_set_dma_addr(netmem_ref netmem,
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index bec6e717cd22..3e041b80f62d 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -61,6 +61,7 @@  static const char pp_stats[][ETH_GSTRING_LEN] = {
 	"rx_pp_alloc_empty",
 	"rx_pp_alloc_refill",
 	"rx_pp_alloc_waive",
+	"rx_pp_alloc_item_full",
 	"rx_pp_recycle_cached",
 	"rx_pp_recycle_cache_full",
 	"rx_pp_recycle_ring",
@@ -94,6 +95,7 @@  bool page_pool_get_stats(const struct page_pool *pool,
 	stats->alloc_stats.empty += pool->alloc_stats.empty;
 	stats->alloc_stats.refill += pool->alloc_stats.refill;
 	stats->alloc_stats.waive += pool->alloc_stats.waive;
+	stats->alloc_stats.item_full += pool->alloc_stats.item_full;
 
 	for_each_possible_cpu(cpu) {
 		const struct page_pool_recycle_stats *pcpu =
@@ -139,6 +141,7 @@  u64 *page_pool_ethtool_stats_get(u64 *data, const void *stats)
 	*data++ = pool_stats->alloc_stats.empty;
 	*data++ = pool_stats->alloc_stats.refill;
 	*data++ = pool_stats->alloc_stats.waive;
+	*data++ = pool_stats->alloc_stats.item_full;
 	*data++ = pool_stats->recycle_stats.cached;
 	*data++ = pool_stats->recycle_stats.cache_full;
 	*data++ = pool_stats->recycle_stats.ring;
@@ -267,14 +270,12 @@  static int page_pool_init(struct page_pool *pool,
 		return -ENOMEM;
 	}
 
+	spin_lock_init(&pool->destroy_lock);
 	atomic_set(&pool->pages_state_release_cnt, 0);
 
 	/* Driver calling page_pool_create() also call page_pool_destroy() */
 	refcount_set(&pool->user_cnt, 1);
 
-	if (pool->dma_map)
-		get_device(pool->p.dev);
-
 	if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) {
 		/* We rely on rtnl_lock()ing to make sure netdev_rx_queue
 		 * configuration doesn't change while we're initializing
@@ -312,15 +313,91 @@  static void page_pool_uninit(struct page_pool *pool)
 {
 	ptr_ring_cleanup(&pool->ring, NULL);
 
-	if (pool->dma_map)
-		put_device(pool->p.dev);
-
 #ifdef CONFIG_PAGE_POOL_STATS
 	if (!pool->system)
 		free_percpu(pool->recycle_stats);
 #endif
 }
 
+static void page_pool_item_init(struct page_pool *pool, unsigned int item_cnt)
+{
+	struct page_pool_item *items = pool->items;
+	unsigned int i;
+
+	WARN_ON_ONCE(!is_power_of_2(item_cnt));
+
+	for (i = 0; i < item_cnt; i++)
+		items[i].pp_idx = i;
+
+	pool->item_mask = item_cnt - 1;
+}
+
+static void page_pool_item_uninit(struct page_pool *pool)
+{
+	struct page_pool_item *items = pool->items;
+	unsigned int mask = pool->item_mask;
+	unsigned int i;
+
+	if (!pool->dma_map || pool->mp_priv)
+		return;
+
+	spin_lock_bh(&pool->destroy_lock);
+
+	for (i = 0; i <= mask; i++) {
+		struct page *page;
+
+		page = netmem_to_page(READ_ONCE(items[i].pp_netmem));
+		if (!page)
+			continue;
+
+		WARN_ONCE(1, "page_pool(%u) has inflight page: %p\n",
+			  pool->user.id, page);
+
+		dma_unmap_page_attrs(pool->p.dev, page_pool_get_dma_addr(page),
+				     PAGE_SIZE << pool->p.order,
+				     pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC |
+				     DMA_ATTR_WEAK_ORDERING);
+		page_pool_set_dma_addr(page, 0);
+	}
+
+	pool->dma_map = false;
+	spin_unlock_bh(&pool->destroy_lock);
+}
+
+static bool page_pool_item_add(struct page_pool *pool, netmem_ref netmem)
+{
+	struct page_pool_item *items = pool->items;
+	unsigned int mask = pool->item_mask;
+	unsigned int idx = pool->item_idx;
+	unsigned int i;
+
+	for (i = 0; i <= mask; i++) {
+		unsigned int mask_idx = idx++ & mask;
+
+		if (!READ_ONCE(items[mask_idx].pp_netmem)) {
+			WRITE_ONCE(items[mask_idx].pp_netmem, netmem);
+			netmem_set_pp_item(netmem, &items[mask_idx]);
+			pool->item_idx = idx;
+			return true;
+		}
+	}
+
+	pool->item_idx = idx;
+	alloc_stat_inc(pool, item_full);
+	return false;
+}
+
+static void page_pool_item_del(struct page_pool *pool, netmem_ref netmem)
+{
+	struct page_pool_item *item = netmem_to_page(netmem)->pp_item;
+	struct page_pool_item *items = pool->items;
+	unsigned int idx = item->pp_idx;
+
+	DEBUG_NET_WARN_ON_ONCE(items[idx].pp_netmem != netmem);
+	WRITE_ONCE(items[idx].pp_netmem, (netmem_ref)NULL);
+	netmem_set_pp_item(netmem, NULL);
+}
+
 /**
  * page_pool_create_percpu() - create a page pool for a given cpu.
  * @params: parameters, see struct page_pool_params
@@ -329,10 +406,15 @@  static void page_pool_uninit(struct page_pool *pool)
 struct page_pool *
 page_pool_create_percpu(const struct page_pool_params *params, int cpuid)
 {
+#define PAGE_POOL_MIN_INFLIGHT_ITEMS		512
+	unsigned int item_cnt = (params->pool_size ? : 1024) +
+				PP_ALLOC_CACHE_SIZE + PAGE_POOL_MIN_INFLIGHT_ITEMS;
 	struct page_pool *pool;
 	int err;
 
-	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, params->nid);
+	item_cnt = roundup_pow_of_two(item_cnt);
+	pool = kvzalloc_node(struct_size(pool, items, item_cnt), GFP_KERNEL,
+			     params->nid);
 	if (!pool)
 		return ERR_PTR(-ENOMEM);
 
@@ -340,6 +422,8 @@  page_pool_create_percpu(const struct page_pool_params *params, int cpuid)
 	if (err < 0)
 		goto err_free;
 
+	page_pool_item_init(pool, item_cnt);
+
 	err = page_pool_list(pool);
 	if (err)
 		goto err_uninit;
@@ -350,7 +434,7 @@  page_pool_create_percpu(const struct page_pool_params *params, int cpuid)
 	page_pool_uninit(pool);
 err_free:
 	pr_warn("%s() gave up with errno %d\n", __func__, err);
-	kfree(pool);
+	kvfree(pool);
 	return ERR_PTR(err);
 }
 EXPORT_SYMBOL(page_pool_create_percpu);
@@ -499,19 +583,24 @@  static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 	if (unlikely(!page))
 		return NULL;
 
-	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page)))) {
-		put_page(page);
-		return NULL;
-	}
+	if (unlikely(!page_pool_set_pp_info(pool, page_to_netmem(page))))
+		goto err_alloc;
+
+	if (pool->dma_map && unlikely(!page_pool_dma_map(pool, page_to_netmem(page))))
+		goto err_set_info;
 
 	alloc_stat_inc(pool, slow_high_order);
-	page_pool_set_pp_info(pool, page_to_netmem(page));
 
 	/* Track how many pages are held 'in-flight' */
 	pool->pages_state_hold_cnt++;
 	trace_page_pool_state_hold(pool, page_to_netmem(page),
 				   pool->pages_state_hold_cnt);
 	return page;
+err_set_info:
+	page_pool_clear_pp_info(pool, page_to_netmem(page));
+err_alloc:
+	put_page(page);
+	return NULL;
 }
 
 /* slow path */
@@ -546,12 +635,18 @@  static noinline netmem_ref __page_pool_alloc_pages_slow(struct page_pool *pool,
 	 */
 	for (i = 0; i < nr_pages; i++) {
 		netmem = pool->alloc.cache[i];
+
+		if (unlikely(!page_pool_set_pp_info(pool, netmem))) {
+			put_page(netmem_to_page(netmem));
+			continue;
+		}
+
 		if (dma_map && unlikely(!page_pool_dma_map(pool, netmem))) {
+			page_pool_clear_pp_info(pool, netmem);
 			put_page(netmem_to_page(netmem));
 			continue;
 		}
 
-		page_pool_set_pp_info(pool, netmem);
 		pool->alloc.cache[pool->alloc.count++] = netmem;
 		/* Track how many pages are held 'in-flight' */
 		pool->pages_state_hold_cnt++;
@@ -623,9 +718,13 @@  s32 page_pool_inflight(const struct page_pool *pool, bool strict)
 	return inflight;
 }
 
-void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
+bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
 {
-	netmem_set_pp(netmem, pool);
+	if (unlikely(!page_pool_item_add(pool, netmem)))
+		return false;
+
+	DEBUG_NET_WARN_ON_ONCE(page_pool_to_pp(netmem_to_page(netmem)) != pool);
+
 	netmem_or_pp_magic(netmem, PP_SIGNATURE);
 
 	/* Ensuring all pages have been split into one fragment initially:
@@ -637,12 +736,14 @@  void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem)
 	page_pool_fragment_netmem(netmem, 1);
 	if (pool->has_init_callback)
 		pool->slow.init_callback(netmem, pool->slow.init_arg);
+
+	return true;
 }
 
-void page_pool_clear_pp_info(netmem_ref netmem)
+void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem)
 {
 	netmem_clear_pp_magic(netmem);
-	netmem_set_pp(netmem, NULL);
+	page_pool_item_del(pool, netmem);
 }
 
 static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
@@ -672,9 +773,13 @@  static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
  */
 void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 {
+	unsigned int destroy_cnt = READ_ONCE(pool->destroy_cnt);
 	int count;
 	bool put;
 
+	if (unlikely(destroy_cnt))
+		spin_lock_bh(&pool->destroy_lock);
+
 	put = true;
 	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv)
 		put = mp_dmabuf_devmem_release_page(pool, netmem);
@@ -688,9 +793,13 @@  void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	trace_page_pool_state_release(pool, netmem, count);
 
 	if (put) {
-		page_pool_clear_pp_info(netmem);
+		page_pool_clear_pp_info(pool, netmem);
 		put_page(netmem_to_page(netmem));
 	}
+
+	if (unlikely(destroy_cnt))
+		spin_unlock_bh(&pool->destroy_lock);
+
 	/* An optimization would be to call __free_pages(page, pool->p.order)
 	 * knowing page is not part of page-cache (thus avoiding a
 	 * __page_cache_release() call).
@@ -1034,14 +1143,14 @@  static void __page_pool_destroy(struct page_pool *pool)
 		static_branch_dec(&page_pool_mem_providers);
 	}
 
-	kfree(pool);
+	kvfree(pool);
 }
 
 static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 {
 	netmem_ref netmem;
 
-	if (pool->destroy_cnt)
+	if (pool->destroy_cnt > 1)
 		return;
 
 	/* Empty alloc cache, assume caller made sure this is
@@ -1057,7 +1166,7 @@  static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 static void page_pool_scrub(struct page_pool *pool)
 {
 	page_pool_empty_alloc_cache_once(pool);
-	pool->destroy_cnt++;
+	WRITE_ONCE(pool->destroy_cnt, pool->destroy_cnt + 1);
 
 	/* No more consumers should exist, but producers could still
 	 * be in-flight.
@@ -1139,10 +1248,14 @@  void page_pool_destroy(struct page_pool *pool)
 	if (!page_pool_put(pool))
 		return;
 
+	/* disable dma_sync_for_device */
+	pool->dma_sync = false;
+
 	page_pool_disable_direct_recycling(pool);
+	WRITE_ONCE(pool->destroy_cnt, 1);
 
-	/* Wait for the freeing side see the disabling direct recycling setting
-	 * to avoid the concurrent access to the pool->alloc cache.
+	/* Wait for the freeing side to see the new pool->dma_sync,
+	 * disable_direct and pool->destroy_cnt in page_pool_put_page.
 	 */
 	synchronize_rcu();
 
@@ -1151,6 +1264,8 @@  void page_pool_destroy(struct page_pool *pool)
 	if (!page_pool_release(pool))
 		return;
 
+	page_pool_item_uninit(pool);
+
 	page_pool_detached(pool);
 	pool->defer_start = jiffies;
 	pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h
index 57439787b9c2..5d85f862a30a 100644
--- a/net/core/page_pool_priv.h
+++ b/net/core/page_pool_priv.h
@@ -36,16 +36,18 @@  static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
 }
 
 #if defined(CONFIG_PAGE_POOL)
-void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
-void page_pool_clear_pp_info(netmem_ref netmem);
+bool page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem);
+void page_pool_clear_pp_info(struct page_pool *pool, netmem_ref netmem);
 int page_pool_check_memory_provider(struct net_device *dev,
 				    struct netdev_rx_queue *rxq);
 #else
-static inline void page_pool_set_pp_info(struct page_pool *pool,
+static inline bool page_pool_set_pp_info(struct page_pool *pool,
 					 netmem_ref netmem)
 {
+	return true;
 }
-static inline void page_pool_clear_pp_info(netmem_ref netmem)
+static inline void page_pool_clear_pp_info(struct page_pool *pool,
+					   netmem_ref netmem)
 {
 }
 static inline int page_pool_check_memory_provider(struct net_device *dev,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 74149dc4ee31..d4295353ca6e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1033,7 +1033,8 @@  bool napi_pp_put_page(netmem_ref netmem)
 	if (unlikely(!is_pp_netmem(netmem)))
 		return false;
 
-	page_pool_put_full_netmem(netmem_get_pp(netmem), netmem, false);
+	page_pool_put_full_netmem(page_pool_to_pp(netmem_to_page(netmem)),
+				  netmem, false);
 
 	return true;
 }
diff --git a/net/core/xdp.c b/net/core/xdp.c
index bcc5551c6424..e8582036b411 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -384,7 +384,8 @@  void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct,
 		/* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE)
 		 * as mem->type knows this a page_pool page
 		 */
-		page_pool_put_full_page(page->pp, page, napi_direct);
+		page_pool_put_full_page(page_pool_to_pp(page), page,
+					napi_direct);
 		break;
 	case MEM_TYPE_PAGE_SHARED:
 		page_frag_free(data);